Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science...

23
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University [email protected] March 29 th , 2010 Joint work with F. Gao, M. Hefeeda and W. Abdel-Majeed 1

Transcript of Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science...

Page 1: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Approximation algorithms for large-scale kernel methods

Taher DamehSchool of Computing Science

Simon Fraser [email protected] 29th, 2010

Joint work with F. Gao, M. Hefeeda and W. Abdel-Majeed

1

Page 2: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Outline

Introduction Motivation Local Sensitive Hashing Z and H Curves Affinity propagation Results

2

Page 3: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Introduction

Machine learning Kernel-based methods require O(N2) time and space complexities to compute and store non-sparse Gram matrices.

We are developing methods to approximate the Gram matrix with a band matrix

N pointsN*N Gram Matrix

3

Page 4: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Motivation

Exact vs. Approximate Answer Approximate might be good-enough and much-

faster Time-quality and memory trade-off As machine learning point of view; we can live

with bounded (controlled) error as long as we can run on large scale data where in normal ways we cann’t at all due to the memory usage.

4

Page 5: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Ideas of approximation

To construct the approximated band matrix we evaluate the kernel function only between a fixed neighborhood around each point.

This low rank method depends on the observation that the eigen-spectrum of the kernel function is a Radial Basis Function (real-valued function whose value depends only on the Euclidean distance )

(The most information is stored in the first of eigen vectors)

5

Page 6: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

How to choose this neighborhood window? Since kernel function is monotonically decreasing with

the Euclidian distance between the input points, so we can compute the kernel function only between close points.

We should find a fast and reliable technique to order the points.

Space filling Curves: Z-Curve and H-Curve Locality Sensitive Hashing

6

Page 7: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

LSH: Motivation

Similarity Search over large scale High-Dimensional Data

Exact vs. Approximate Answer Approximate might be good-enough and much-faster Time-quality trade-off

7

Page 8: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

LSH: Key idea

Hash the data-point using several LSH functions so that probability of collision is higher for closer objects

Algorithm:• Input

− Set of N points { p1 , …….. pn }− L ( number of hash tables )

• Output

− Hash tables Ti , i = 1 , 2, …. L

• Foreach i = 1 , 2, …. L

− Initialize Ti with a random hash function gi(.)

• Foreach i = 1 , 2, …. L Foreach j = 1 , 2, …. N Store point pj on bucket gi(pj) of hash table Ti

8

Page 9: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

LSH: Algorithm

g1(pi) g2(pi) gL(pi)

TLT2T1

pi

P

9

Page 10: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

LSH: Analysis

• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)}

− dist(p,q) < r1 ProbH [h(q) = h(p)] p1

− dist(p,q) r2 ProbH [h(q) = h(p)] p2

− p1 > p2 and r1 < r2

• LSH functions: gi(.) = { h1(.) …hk(.) }

10

Page 11: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Our approach

N points Hash the points using LSH functions family

Compute the kernel function only between the points in same bucket (0 between points on different buckets)

Using “m” size hash table we can achieve as best case O (N2/m) memory and computation

11

Page 12: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Validation Methods

Low Level (matrix Level) Frobenius Norm Eigen spectrum

High Level (Application Level) Affinity Propagation Support Vector Machines

12

Page 13: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Example

i,k 0 1 2 3 4 5

0 -1 -1 -4 -25 -36 -49

1 -1 -1 -1 -26 -37 -50

2 -4 -1 -1 -29 -40 -53

3 -25 -26 -29 -1 -1 -4

4 -36 -37 -40 -1 -1 -1

5 -49 -50 -53 -4 -1 -1

i,k 0 1 2

0 -1 -1 -4

1 -1 -1 -1

2 -4 -1 -1

i,k 3 4 5

3 -1 -1 -4

4 -1 -1 -1

5 -4 -1 -1

S (i,k)

S0 (i,k) S1 (i,k)

1

2

5 6 7P0P1P2

P3 P4 P5

0 P0 P1 P2

1 P3 P4 P5

LSH

FrobNorm (S) = 230.469

FrobNorm ( [S0 S1] ) = 217.853

13

Page 14: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

ResultsMemory Usage

All data Z512 Z1024 Lsh3000 Lsh5000

64 K 4 G 32 M 64 M 19 M 18 M

128 K 16 G 64 M 128 M 77 M 76 M

256 K 64 G 128 M 256 M 309 M 304 M

512 K 256 G 256 M 512 M 1244 M 1231 M

14

Page 15: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

15

Page 16: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

References

[1] M. Hussein and W. Abd-Almageed, “Efficient band approximation of gram matrices for large scale kernel methods on gpus,” in Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SuperComputing’09), Portland, OR, November 2009.

[2] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Communications of the ACM, vol. 51, no. 1, pp. 117–122, January 2008.

[3] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.

16

Page 17: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

AP : Motivation

No priori on the number of clusters Independence on initialization Processing time to achieve good

performance

17

Page 18: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Affinity Propagation

Take each data point as a node in the network

Consider all data points as potential cluster centers

Start the clustering with a similarity between pairs of data points

Exchange messages between data points until the good cluster centers are found

18

Page 19: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Terminology and Notation

Similarity s(i,k): single evidence of data k to be the exemplar for data I

(Kernel function for the Gram matrix)

19

Page 20: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Responsibility r (i,k ): accumulated evidence of data k to be the exemplar for data i

20

Page 21: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

Availability a ( i,k) : accumulated evidence of data i pick data k as the exemplar

21

Page 22: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

22

Page 23: Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University tdameh@cs.sfu.ca March 29 th, 2010.

flow chart

23