Similarities, Distances and Manifold Learning Prof. Richard C. Wilson Dept. of Computer Science...

Similarities, Distances and Manifold Learning

Prof. Richard C. Wilson

Dept. of Computer ScienceUniversity of York

Background

• Typically objects are characterised by features– Face images

– SIFT features

– Object spectra

– ...

• If we measure n features → n-dimensional space

• The arena for our problem is an n-dimensional vector space

Background

• Example: Eigenfaces

• Raw pixel values: n by m gives nm features

• Feature space is space of all n by m images

Background

• The space of all face-like images is smaller than the space of all images

• Assumption is faces lie on a smaller manifold embedded in the global space

All images

Face images

Manifold: A space which locally looks Euclidean

Manifold learning: Finding the manifold representing the objects we are interested in

All objects should be on the manifold, non-objects outside

Part I: Euclidean SpacePosition, Similarity and Distance

Manifold Learning in Euclidean space

Some famous techniques

Part II: Non-Euclidean ManifoldsAssessing Data

Nature and Properties of Manifolds

Data Manifolds

Learning some special types of manifolds

Part III: Advanced TechniquesMethods for intrinsically curved manifolds

Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributionsAnd support from the EU SIMBAD project

Part I: Euclidean Space

Position

The main arena for pattern recognition and machine learning problems is vector space– A set of n well defined features collected into a vector

ℝn

Also defined are addition of vectors and multiplication by a scalar

Feature vector → position

Similarity

To make meaningful progress, we need a notion of similarity

Inner product

• The inner-product ‹x,y› can be considered to be a similarity between x and y

i

ii yxyx,

Induced norm

• The self-similarity ‹x,x› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of x:

• Finally, the length of x allows the definition of a distance in our vector space as the length of the vector joining x and y

• Inner product also gets us distance

xxx ,

yxyxyxyx ,),(d

Euclidean space

• If we have a vector space for features, and the usual inner product, all three are connected:

),( Distance

Similarity

, Position

yx

yx

yx

d

,

non-Euclidean Inner Product

• If the inner-product has the form

• Then the vector space is Euclidean

• Note we recover all the expected stuff for Euclidean space, i.e.

• The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is

i

iiT yxyxyx,

2222

211

21

22

21

)()()(),( nn yxyxyxd

xxx

yx

x

44332211, yxyxyxyx yx

The Golden Trio

• In Euclidean space, the concepts of position, similarity and distance are elegantly connected

PositionX

SimilarityK

DistanceD

Point position matrix

• In a normal manifold learning problem, we have a set of samples X={x1,x2,...,xm}

• These can be collected together in a matrix X

Tm

T

T

x

x

x

X2

1

I use this convention, but othersmay write them vertically

Centreing

A common and important operation is centreing – moving the mean to the origin– Centred points behave better

is the mean matrix, so is the centred matrix

– J is the all-ones matrix

This can be done with C

– C is the centreing matrix (and is symmetric C=CT)

CXXJIC / m

m/JX m/JXX

Position-Similarity

• The similarity matrix K is defined as

• From the definition of X, we simply get

• The Gram matrix is the similarity matrix of the centred points (from the definition of X)

– i.e. a centring operation on K

• Kc is really a kernel matrix for the points (linear kernel)

PositionX

SimilarityK

CKCCCXXK TTc

jiijK xx ,

TXXK

Position-Similarity

• To go from K to X, we need to consider the eigendecomposition of K

• As long as we can take the square root of Λ then we can find X as

PositionX

SimilarityK

T

T

XXK

UUK

Λ

1/2ΛUX

Kernel embedding

First manifold learning method – kernel embedding

Finds a Euclidean manifold from object similarities

• Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)

• K must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)

1/2ΛUX TUUK Λ

Similarity-Distance

SimilarityK

DistanceD

ijsijjjii

jijjii

jijiji

DKKK

d

,

2

2

,2,,

,),(

xxxxxx

xxxxxx

• We can easily determine Ds from K

Similarity-Distance

What about finding K from Ds ?

Looking at the top equation, we might imagine that

K=-½ Ds is a suitable choice

• Not centred; the relationship is actually

CCDK s2

1

ijjjiiijs KKKD 2,

Classic MDS

• Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space

• Using what we have so far, the algorithm is simple

• This is MDS

kernel theEmbed Λ

kernel theposeEigendecom Λ

kernel theCompute 2

1

1/2UX

KUU

CCDK

T

s

PositionX

DistanceD

The Golden Trio

PositionX

SimilarityK

DistanceD

Kernel EmbeddingMDS

ijjjiiijs

s

KKKD 22

1

,

CCDK

Kernel methods

• A kernel is function k(i,j) which computes an inner-product

– But without needing to know the actual points (the space is implicit)

• Using a kernel function we can directly compute K without knowing XPosition

X

SimilarityK

DistanceD

jijik xx ,),(

Kernel function

Kernel methods

• The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite K and the implied space will be Euclidean

• Many (most?) PR algorithms can be kernelized– Made to use K rather than X or D

• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given

• Hence it can be written as a linear combination

• Look for α instead of u

αX

xxxuT

mm

2211

Kernel PCA

• What about PCA? PCA solves the following problem

• Let’s kernelize:

XuXu

Σuuu

u

u

TT

T

n

1minarg

minarg*

αKα

αXXXXα

αXXXαXXuXu

21

1

)()(11

T

TTT

TTTTTT

n

n

nn

Kernel PCA

• K2 has the same eigenvectors as K, so the eigenvectors of PCA are the same as the eigenvectors of K

• The eigenvalues of PCA are related to the eigenvectors of K by

• Kernel PCA is a kernel embedding with an externally provided kernel matrix

2PCA

1Kn

Kernel PCA

• So kernel PCA gives the same solution as kernel embedding– The eigenvalues are modified a bit

• They are essentially the same thing in Euclidean space

• MDS uses the kernel and kernel embedding

• MDS and PCA are essentially the same thing in Euclidean space

• Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space

Some useful observations

• Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD)

• By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD

• If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix)

• If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding

Part II: Non-Euclidean Manifolds

Non-linear data

• Much of the data in computer vision lies in a high-dimensional feature space but is constrained in some way– The space of all images of a face is a subspace of the

space of all possible images

– The subspace is highly non-linear but low dimensional (described by a few parameters)

Non-linear data

• This cannot be exploited by the linear subspace methods like PCA– These assume that the subspace is a Euclidean space as well

• A classic example is the

‘swiss roll’ data:

‘Flat’ Manifolds• Fundamentally different types of data, for example:

• The embedding of this data into the high-dimensional space is highly curved– This is called extrinsic curvature, the curvature of the manifold

with respect to the embedding space

• Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane without distorting it– No intrinsic curvature, in fact it is homeomorphic to Euclidean

space

• This manifold is different:

• It must be stretched to map it onto a plane– It has non-zero intrinsic curvature

• A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle

• In the first case, we might still hope to find Euclidean embedding

• We can never find a distortion free Euclidean embedding of the second (in the sense that the distances will always have errors)

Curved manifold

Intrinsically Euclidean Manifolds

• We cannot use the previous methods on the second type of manifold, but there is still hope for the first

• The manifold is embedded in Euclidean space, but Euclidean distance is not the correct way to measure distance

• The Euclidean distance ‘shortcuts’ the manifold• The geodesic distance calculates the shortest path along the

manifold

Geodesics

• The geodesic generalizes the concept of distance to curved manifolds– The shortest path joining two points which lies completely within

the manifold

• If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position

X

SimilarityK

DistanceD

GeodesicDistances

ISOMAP

• ISOMAP is exactly such an algorithm

• Approximate geodesic distances are computed for the points from a graph

• Nearest neighbours graph– For neighbours, Euclidean distance≈geodesic distances

– For non-neighbours, geodesic distance approximated by shortest distance in graph

• Once we have distances D, can use MDS to find Euclidean embedding

ISOMAP

• ISOMAP:– Neighbourhood graph

– Shortest path algorithm

– MDS

• ISOMAP is distance-preserving – embedded distances should be close to geodesic distances

Laplacian Eigenmap

• The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space

• As with ISOMAP, form a neighbourhood graph for the datapoints

• Find the graph Laplacian as follows

• The adjacency matrix A is

• The ‘degree’ matrix D is the diagonal matrix

• The normalized graph Laplacian is

otherwise 0

connected are and if

2

jieA t

d

ij

ij

j

ijii AD

2/12/1 ADDIL

Laplacian Eigenmap

• We find the Laplacian eigenmap embedding using the eigendecomposition of L

• The embedded positions are

• Similar to ISOMAP– Structure preserving not distance preserving

TUUL

UDX 2/1

Locally-Linear Embedding

• Locally-linear Embedding is another classic method which also begins with a neighbourhood graph

• We make point i (in the original data) from a weighted sum of the neighbouring points

• Wij is 0 for any point j not in the neighbourhood (and for i=j)• We find the weights by minimising the reconstruction error

– Subject to the constrains that the weights are non-negative and sum to 1

• Gives a relatively simple closed-form solution

i j j

jiji W xx̂

2|ˆ|min ii xx

j

ijij WW 1,0

Locally-Linear Embedding

• These weights encode how well a point j represents a point i and can be interpreted as the adjacency between i and j

• A low dimensional embedding is found by then finding points to minimise the error

• In other words, we find a low-dimensional embedding which preserves the adjacency relationships

• The solution to this embedding problem turns out to be simply the eigenvectors of the matrix M

• LLE is scale-free: the final points have the covariance matrix I– Unit scale

)()( WIWIM T

j

jijii

ii W yyyy ˆ |ˆ|min 2

Comparison

• LLE might seem like quite a different process to the previous two, but actually very similar

• We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding

ISOMAP Lap. Eigenmap LLE

Representation Neighbourhood graph

Neighbourhood graph

Neighbourhood graph

Similarity matrix From geodesic distances

Graph Laplacian Reconstruction weights

Embedding

UXUUΛK

WWWWJIK

T

TT

n

kk

)1(

UDX 2/12/1UX UX

Comparison

• ISOMAP is the only method which directly computes and uses the geodesic distances– The other two depend indirectly on the distances through local

structure

• LLE is scale-free, so the original distance scale is lost, but the local structure is preserved

• Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods

Non-Euclidean data

• Data is Euclidean iff K is psd

• Unless you are using a kernel function, this is often not true

• Why does this happen?

What type of data do I have?

• Starting point: distance matrix

• However we do not know apriori if our measurements are representable on a manifold– We will call them dissimilarities

• Our starting point to answer the question “What type of data do I have?” will be a matrix of dissimilarities D between objects

• Types of dissimilarities– Euclidean (no intrinsic curvature)

– Non-Euclidean, metric (curved manifold)

– Non-metric (no point-like manifold representation)

Causes

• Example: Chicken pieces data

• Distance by alignment

• Global alignment of everything could find Euclidean distances

• Only local alignments are practical

Causes

Dissimilarities may also be non-metric

The data is metric if it obeys the metric conditions1. Dij≥ 0 (nonegativity)

2. Dij= 0 iff i=j (identity of indiscernables)

3. Dij= Dji (symmetry)

4. Dij≤Dik+ Dkj (triangle inequality)

Reasonable dissimilarites should meet 1&2

Causes

• Symmetry Dij= Dji

• May not be symmetric by definition• Alignment: i→j may find a better solution than

j→i

Causes

• Triangle violations Dij≤Dik+ Dkj

• ‘Extended objects’

• Finally, noise in the measure of D can cause all of these effects

k

i j

0

0

0

ij

kj

ik

D

D

D

Tests(1)

• Find the similarity matrix

• The data is Euclidean iff K is positive semidefinite (no negative eigenvalues)– K is a kernel, explicit embedding from kernel embedding

• We can then use K in a kernel algorithm

CCDK s2

1

Tests(2)

• Negative eigenfraction (NEF)

• Between 0 and 0.5

i

i

i

0NEF

Tests(3)

1. Dij≥ 0 (nonegativity)

2. Dij= 0 iff i=j (identity of indiscernables)

3. Dij= Dji (symmetry)

4. Dij≤Dik+ Dkj (triangle inequality)

– Check these for your data (3rd involves checking all triples)

– Metric data is embeddable on a (curved) Reimannian manifold

Corrections

• If the data is non-metric or non-Euclidean, we can ‘correct it’

• Symmetry violations– Average

– For min-cost distances may be more appropriate

• Triangle violations– Constant offset

– This will also remove non-Euclidean behaviour for large enough c

• Euclidean violations– Discard negative eigenvalues

• There are many other approaches*

* “On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol,Lee and Bunke, S+SSPR 08

)(2

1jiijjiij DDDD

),min( jiijjiij DDDD

)( jicDD ijij

Part III: Advanced techniques for non-Euclidean Embeddings

Known Manifolds

• Sometimes we have data which lies on a known but non-Euclidean manifold

• Examples in Computer Vision– Surface normals

– Rotation matrices

– Flow tensors (DT-MRI)

• This is not Manifold Learning, as we already know what the manifold is

• What tools do we need to be able to process data like this?– As before, distances are the key

Example: 2D direction

Direction of an edge in an image, encoded as a unit vector

The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction

The normal definition of mean is not correct

– Because the manifold is curved

1x

2x

x

i

inxx

1

Tangent space

• The tangent space (TP) is the Euclidean space which is parallel to the manifold(M) at a particular point (P)

• The tangent space is a very useful tool because it is Euclidean

M

TP

P

Exponential Map

• Exponential map:

• ExpP maps a point X on the tangent plane onto a point A on the manifold– P is the centre of the mapping and is at the origin on the tangent

space

– The mapping is one-to-one in a local region of P

– The most important property of the mapping is that the distances to the centre P are preserved

– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre only)

XA

MT

P

PP

Exp

:Exp

),(),( PAdPXd MTP

Exponential map

• The log map goes the other way, from manifold to tangent plane

MX

TM

P

pP

Log

:Log

Exponential Map

• Example on the circle: Embed the circle in the complex plane

• The manifold representing the circle is a complex number with magnitude 1 and can be written x+iy=exp(i)

Re

ImPieP

• In this case it turns out that the map is related to the normal exp and log functions

M

TP PieP

AieA

PAi

i

P

P

A

e

ei

P

AiAX

log

logLog

APAP

P

iii

iXPXA

exp)(expexp

expExp

X

Intrinsic mean

• The mean of a set of samples is usually defined as the sum of the samples divided by the number– This is only true in Euclidean space

• A more general formula

• Minimises the distances from the mean to the samples (equivalent in Euclidean space)

i

igd ),(minarg 2 xxxx

Intrinsic mean

• We can compute this intrinsic mean using the exponential map

• If we knew what the mean was, then we can use the mean as the centre of a map

• From the properties of the Exp-map, the distances are the same

• So the mean on the tangent plane is equal to the mean on the manifold

iMi AX Log

),(),( MAdMXd igie

Intrinsic mean

• Start with a guess at the mean and move towards correct answer

• This gives us the following algorithm– Guess at a mean M0

1. Map on to tangent plane using Mi

2. Compute the mean on the tangent plane to get new estimate Mi+1

i

iMMk An

Mkk

Log1

Exp1

Intrinsic Mean

• For many manifolds, this procedure will converge to the intrinsic mean– Convergence not always guaranteed

• Other statistics and probability distributions on manifolds are problematic.– Can hypothesis a normal distribution on tangent plane, but

distortions inevitable

Some useful manifolds and maps

• Some useful manifolds and exponential maps

• Directional vectors (surface normals etc.)

• a, p unit vectors, x lies in an (n-1)D space

map) (Exp sin

cos

map) (Log )cos(sin

1 ,

xpa

pax

aa


• Symmetric positive definite matrices (covariance, flow tensors etc)

• A is symmetric positive definite, X is just symmetric

• log is the matrix log defined as a generalized matrix function

map) (Exp exp

map) (Log log

0 0 ,

21

21

21

21

21

21

21

21

PXPPPA

PAPPPX

uAuuA

T


• Orthogonal matrices (rotation matrices, eigenvector matrices)

• A orthogonal, X antisymmetric (X+XT=0)

• These are the matrix exp and log functions as before

• In fact there are multiple solutions to the matrix log– Only one is the required real antisymmetric matrix; not easy to find

– Rest are complex

map) (Exp exp

map) (Log log

I ,

XPA

APX

AAA

T

T

Embedding on Sn

• On S2 (surface of a sphere in 3D) the following parameterisation is well known

• The distance between two points (the length of the geodesic) is

Trrr )cos ,sinsin ,cossin( x

xyd

x

y

yxxyyxij rd coscossinsincos 1

xyrθ

xyθ

x

y

More Spherical Geometry

• But on a sphere, the distance is the highlighted arc-length– Much neater to use inner-product

– And works in any number of dimensions

21

2

,cos

coscos,

rrrd

rxy

xyxy

xyxy

yx

yx

Spherical Embedding

• Say we had the distances between some objects (dij), measured on the surface of a [hyper]sphere of dimension n

• The sphere (and objects) can be embedded into an n+1 dimensional space– Let X be the matrix of point positions

• Z=XXT is a kernel matrix• But• And

• We can compute Z from D and find the spherical embedding!

jiijZ xx ,

r

drZ

rrd

ijjiij

xy

cos,

,cos

2

21

xx

yx

Spherical Embedding

• But wait, we don’t know what r is!

• The distances D are non-Euclidean, and if we use the wrong radius, Z is not a kernel matrix– Negative eigenvalues

• Use this to find the radius– Choose r to minimise the negative eigenvalues

)(minarg* rZr or

Example: Texture Mapping

• As an alternative to unwrapping object onto a plane and texture-mapping the plane

• Embed onto a sphere and texture-map the sphere

Plane Sphere

Backup slides

Laplacian and related processes

• As well as embedding objects onto manifolds, we can model many interesting processes on manifolds

• Example: the way ‘heat’ flows across a manifold can be very informative

•

• On a sphere it is

equationheat 2udt

du

2

2

2

2

2

2

2

isit spaceEuclidean 3Din andLaplacian theis

zyx

sin

sin

1

sin

122

2

22 rr

Heat flow

• Heat flow allows us to do interesting things on a manifold

• Smoothing: Heat flow is a diffusion process (will smooth the data)

• Characterising the manifold (heat content, heat kernel coefficients...)

• The Laplacian depends on the geometry of the manifold– We may not know this

– It may be hard to calculate explicitly

• Graph Laplacian

Graph Laplacian

• Given a set of datapoints on the manifold, describe them by a graph– Vertices are datapoints, edges are adjacency relation

• Adjacency matrix (for example)

• Then the graph Laplacian is

• The graph Laplacian is a discrete approximation of the manifold Laplacian

2

2 )/exp(

ij

ijij d

dA

j

ijii AV AVL

Heat Kernel

• Using the graph Laplacian, we can easily implement heat-flow methods on the manifold using the heat-kernel

• Can diffuse a function on the manifold by

kernelheat )exp(

equationheat

tdt

d

LH

Luu

Hff '

Similarities, Distances and Manifold Learning Prof. Richard C. Wilson Dept. of Computer Science...

Documents

Transcript of Similarities, Distances and Manifold Learning Prof. Richard C. Wilson Dept. of Computer Science...