Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...

Post on 23-Jun-2020

0 views 0 download

Transcript of Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...

Applications of Information Geometry

Clustering based on Divergence: Cluster CenterStochastic Reasoning: Belief Propagation in Graphical ModelSupport Vector MachineBayesian Framework and Restricted Boltzmann MachineNatural Gradient in Multilayer Perceptron LearningIndependent Component AnalysisSparse Signal Analysis and Minkovskian GradientConvex Optimization

Clustering of Points1 2 3{ , , ,..., }ND

Clustering : center of a cluster of points

1, , mC

arg min , ii

D

Center of C

C1

23

m

is simply the arithmetic average in the dual coordinate system

k‐means:  clustering algorithm 

*

[ : ] ( ) ( )[ : ]

( ) 01*

i i i

i i

i

i

DD

m

k‐means:  clustering algorithm

1. Choose k cluster centers2. Classify patterns of D  into clusters3. Calculate new cluster centers

Voronoi Diagram:  Boundaries of Clusters

1 2[ : *] [ : *]D D

Total Bregman divergence (Vemuri)

2

( ) ( )tBD( : )

1 | ( ) |p qp q

p qq

t-center of C

2

*

1

1

i i

i

i

i

ww

w

Conformal change of divergence

: :D p q q D p q

ij ijg p g

( )

logijk ijk k ij j ik i jk

i i

T T s g s g s g

s

t-center is robust

1, , ;

1; ,

nE

n

y

z y

influence fun ;ction z y

robust as : c z y

1, yGw

z y

y

21

1

y y

yw

w

y

y

Robust: is boundedz

21Euclidean case 2

f x

1

2,

1

,

G

yz yy

z y y

1

1 [ : ] [ : y]minimize ( )1

i

i N

D DN w w

MPEG7 database• Great intraclass variability, and small interclass dissimilarity.

Shape representation

First clustering then retrieval

Other TBD applications

Diffusion tensor imaging (DTI) analysis [Vemuri]• Interpolation• Segmentation

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

Information Geometryof Stochastic Reasing

Belief Propagation in Graphical Model

• Shun‐ichi Amari (RIKEN BSI)• Shiro Ikeda (Inst. Statist. Math.)• Toshiyuki Tanaka (Kyoto U.)

Stochastic Reasoning:  Graphical Model

( , , , , )p x y z r s

( , , | , ), , ,... 1, 1p x y z r sx y z

Stochastic Reasoningq(x1,x2,x3,…| observation)

X= (x1 x2 x3 …..) x = 1, ‐1

X= argmax q(x1, x2 ,x3 ,…..)     maximum likelihood

Xi = sgn E[xi]     least bit error rate estimator

Mean Value Marginalization: projection to independent distributions

0 1 1 2 2 0( ) ( ) ( )... ( ) ( )n nq q x q x q x q x x

1, 1( ) ( ..., ) .. ..i i n i nq x q x x dx dx dx

0[ ] [ ]q q E x E x

1

1

1

1 2

exp

,

1, 1

e p

,

x

s

ij

L

i j i

i i r qr

r r i i s

i

i

q k x c

c c x x r

q w x x h x

i i

x r i i

x

x x

x

Boltzmann machine, spin glass, neural networksTurbo Codes, LDPC Codes

cliques

Computationally DifficultComputationally Difficult

exp r q

q E

q c

x x

x x

mean‐field approximation

belief propagation

tree propagation, CCCP (convex‐concave)

Information Geometry ofMean Field Approximation

• m‐projection• e‐projection

q = argmin D[q:p]q = argmin D[p:q]

( )[ : ] ( )log( )x

q xD q p q xp x

0e

0m 0 { ( )}i i iM p x

0( )p x M

0

*

*

(x, *)

(x, *) *

( ) *( )( )*( )

( ), * ( *) ( ) *( ) 0

[ ] [ ]

px

q p

p q M

p x

q x p xt xp x

t x x x q x p x

E x E x

m‐projection keeps the expectation of x

Information GeometryInformation Geometry

0 0 0 0 0, exp

, expr r r r r r

M p

M p c

x x

x x x

1, ,r L

q x

rM

'rM

0M

( ) exp{ ( )rq x c x

Belief Prop Algorithm

0M

rM

'rM

r

'r

r

'r

0 0

1 t0

1 1'

'11

0

( , ) ( , )

( , ) : belief for ( )

t tr r r

t tr r r r r

t tr r

r rtt

r

p x p x

p x c x

Belief PropagationBelief Propagation( , ) exp{ ( ) }r r r rp x c x x

Equilibrium of BPEquilibrium of BP , rξ

1)  m‐condition

*0 ,r rp x

-flat submanifold m M

rM

'rM

0M

rM

'rM

0M

q2)  e‐condition 

*r

θ ξ

-flat submanifold q ex

ξ1( ')ξ θ

Belief Propagation  e‐condition OK

CCCP m‐condition OK

1 2

1 2 1 2

( ; , , , ' '( , , ) ( ' , ' , ' )

L r

L L

θ ξ ξ ξ θ ξ

0 0

0

: ( , ) ( , )

'

r r r r

r

p x p x

θ θ

θ

Convex‐Concave ComputationalProcedure (CCCP) :   A. Yuille

1 21

1 2

( ) ( ) ( )

( ) ( )

t t

F F F

F FElimination of double loops

Geometry of Support Vector Machinemodification of kernel

Simple perceptron: decision surface

( )| |

| |

f x w x bw x bd

w

2minimize | |constraint ( ) 1i i

wy w x b

21( , , ) | | ( )2

0 : support vector : 0

( , )

i i

i i i i

i i i i i i

L w b w w x b

y x

w y x f x w y x b

Embedding in higher‐dimensional space

( )( ) ( )

x z xf x w x b

z: infinite‐dimensional

Kernel trick ( , ') ( ) ( ')

( , ') ( ') ' ( ')

1( ) ( )

( , ) ( , )

i i i

ii

i i i

K x x x x

K x x x dx x

x x

f x w y K x x

2 2

2

| '

| ( ) ( ) | z( ) ( )

( ) ( ) ( )

( ) ( , x')

i ji j

iji j

ij x xi j

ds z x dx z x x z x dx dxx x

g x x xx x

g x K xx x

2

2

2

( ) exp{ { ( )}

( , ') ( ) ( ') ( , ')( ) exp{ | x x * | }

g ( ) { ( )} g ( ) ( ) ( )i i

ij ij i j

x f x

K x x x x K x xx

x x x x x

Conformal Transformation of Kernel

Basic Principles ofUnsupervised and SupervisedLearning Toward Deep Learning

Shun‐ichi Amari  (RIKEN Brain Science Institute)collaborators:    R. Karakida, M. Okada (U. Tokyo)

Deep LearningSelf‐Organization  +  Supervised Learning

RBM: Restricted Boltzmann MachineAuto‐Encoder,  Recurrent Net

DropoutContrastive divergence

Boltzmann Machine

RBM:  Restricted Boltzmann Machine

RBM

Simple Hebbian Self‐Organization

self‐organization of  

Self‐Organization

Interaction of Hidden Neurons

Bayesian Duality in Exponential Family

Data x          Parameter (higher‐order concepts) 

Curved exponential family

RBM= h,     x = Wvx = v        = hW

Gaussian Boltzmann Machine

Equilibrium Solution

Equilibrium Solution

General Solution 

You can choose m(≤ k) eigen values  form

• diagonalized by 

Stable Solution  the case of m = k

‐ No connections within layers  

Contrastive Divergence

50

• How to train RBM  Maximum Likelihood (ML) learning is hard 

• RBM  ‐ 2‐layered probabilistic neural network

Sampling

EquilibriumInput 

Many iterations of Gibbs Sampling demand too muchcomputational time 

W

Hidden hVisible v

Contrastive Divergence Solution

Two Manifolds

Geometry of CDn (contrastive divergence)

Bernoulli‐Gaussian RBM

ICAR. Karakida

Simulation

InputMixing

Sources p (s) ICA Solution

OutputUniformDistribution

The number of Neurons: N = M = 2,  σ = 1/2

Independent sources are extracted in G‐B RBM

CD

55

Information Geometry of Neuromanifolds

Natural Gradient and Singularities

Shun-ichi AmariRIKEN Brain Science Institute

Multilayer Perceptron

Mathematical Neurons

i iy w x h w x

x y( )u

u

Multilayer Perceptrons

i iy v n w x

21; exp ,2

, i i

p y c y f

f v

x x

x w x

x y

1 2( , ,..., )nx x x x

1 1( ,..., ; ,..., )m mw w v v

Multilayer Perceptron: Neuromanifold

1 1,

,

, ; ,i i

m m

y f

v

v v

x θ

w x

θ w w

neuromanifold ( )x

space of functions

Universal Function Approximator

xwx

xx

xx

ii

i

m

ii

i

N

i

v

a

v

1

1i

Learning from examples

ˆ, xfx

1 1examples { , , , , }n nD y y x x

learning ; estimation

Backpropagation ‐‐‐gradient learningBackpropagation ‐‐‐gradient learning

1 1

2

examples : , , , training set1( , ; ) ,2

log , ;

t ty y

E y x y f

p y

x x

x

x

,

t t

i i

E

f v

x w x

( , )y f x n

Multilayer Perceptron: NeuromanifoldMetric and Topology

1 1,

,

, ; ,i i

m m

y f

v

v v

x θ

w x

θ w w

neuromanifold ( )x

space of functions

Metric: Riemannian manifold

22

ij i j

T

ds d

g d d

d G d

j

i

d

log ( | ; ) log ( | ; )( ) [ ]iji j

p y x p y xg E

Topology: Neuromanifold

• Metrical structure

• Topological structure

singularities

Geometry of singular model

y v n w x

v| | 0v w

Parameter Space S

S

Equivalence

1) 02)

/

i i

i j i j

vv v

M S

ww w

i iy v n w x

Topological Singularities

: parameter spaceS

: behavioral spaceM

/M S

2  hidden‐units

1 1 2

1 2 1 1

1 2

: 0

1

y v v n

S v v

v x w v x w

2

2 2

w x w x

w w w w

Gaussian mixtures

21( ) exp2i ip x v x w

Gaussian mixture

1 2 1 2; , , 1p x v w w v x w v x w

21 1exp22

x x

1 2 s ingular: , 1 0w w v v

1w2w

v

Learning, Estimation, and Model Selection

gen 0

train emp

: ;

;

E D p y p y

E D p y

x x

x

gen

gen train

: dimension2dE dn

dE En

Problem of Backprop

• slow convergence‐‐‐‐plateau‐‐‐saddle

• local minima 

( , ; )t t t t tl x y

Flaws of MLPFlaws of MLPslow convergence : plateau

local minima

Boosting, Bagging, SVM

error

Steepest Direction --- Natural GradientSteepest Direction --- Natural Gradient

1

1

2

, ,

=

n

i jij

l ll

l G l

d d Gd G d d

d

( )l

( , ; )t t t t tl x y

Natural Gradient

2 2

1

max

under ij i j

dl l d l l d

d g d d

d l G l

( , ; )t t t t tl x y

Information Geometry of MLPInformation Geometry of MLPNatural Gradient Learning :

S. Amari ; H.Y. Park

1

1 1 1 11 1 T

t t t t

lG

G G G f f G

Adaptive natural gradient learning

Regular statistical model ,

: Fisher information

M p x

G

11TE Gn

01ˆ, : ,2

2

E KL p x p x G En

dn

AIC,   MDL

Landscape of error at singularityMilner attractor

Dynamics of Learning

1,

( , ), ( , )

( , ) , ( , )

d dl G ldt dt

du dzf u z k u zdt dt

du f u zdz k u z

2 2 1 log

2u z z c

Coordinate Transformation

2 1

1 1 2 2

1 2

2 2

: 0

1

v vv

v v v v vv vz z

v

u w w uw ww w w

Dynamics of Learning:  Natural gradient works!!

1,

( , ), ( , )

( , ) , ( , )

d dl G ldt dt

du dzf u z k u zdt dt

du f u zdz k u z

2 2 1 log

2u z z c

Dynamic vector fields: General case (|z|<1 part stable)

Dynamic vector fields: General case (|z|>1 part stable )

Adaptive Natural Gradient works well

Signal ProcessingICA : Independent Component Analysis

t t t tA x s x s

sparse component analysis

positive matrix factorization

mixture and unmixtureof independent signals

2x

1s

ns2smx

1x1

n

i ij jj

x A s

x As

Independent Component Analysis

1

i ij jA x A s

W W A

x s

y x

observations: x(1), x(2), …, x(t)recover: s(1), s(2), …, s(t)

s A W y

x

Semiparametric Statistical Model

( ; , ) | | ( )p r rx W W Wx

unknown

x(1), x(2), …, x(t)

ir r , ( ) :r s 1W A

Natural Gradient

, Tl

y W

W W WW

Space of Matrices : Lie group

-1d dX WW

2 1tr trT T T

T

d d d d d

ll

W X X WW W W

W WW

:dX

I I d X

WdW W

non-holonomic basis

1W

Information Geometry of ICA

natural gradientestimating functionstability, efficiency

S ={p(y)}

1 1 2 2{ ( ) ( )... ( )}n nI q y q y q y

{ ( )}p Wx

r q

( ) [ ( ; ) : ( )] ( )

l KL p qr

W y W yy

Estimating Functions

,

)0, '

, ' 0, 'rE

W

W F(y,WW W

F y WW W

estimating equation

0 t tt

tF y ,W y Wx

learning

t t t W F W x

Admissible class

on-line learnin

, { }

,

0: estimating equation

canonical estimating functio

g :

n:

T

t

t t t t

W

F IW

F y W I y y

F R W F y W

R W F Wx

W R W F W x

{ }i j j iy y y y W

Basis Given: overcomplete caseSparse Solution

many solutionsmany 0

ˆ

i i

i

t t

A s

s

A

x s a

x s

ˆ ˆ ˆ: A x Assparse

generalized inverse

ˆmin 2is

sparse solution

ˆmin ii s

ˆ ˆ ˆ: x A Assparse

2 :-normL

1 :-normL

Minkovskian Gradient

min : convex functionconstraint F c

typical case:

21 1 ( *) ( *)2 21 ;

T

pi

X G

Fp

y

2, 1, 1/ 2p p p

Optimization under Sparsity Condition:

(X y n)

: -sparsek

2 logm k n

1

, 1,2,...k

tt i i t

i

y x n t N

1

2

xN

xx

X X

y n

Linear regression

Overcomplete:  N < kunder‐determined  infinitely many solutions

Sparsity constraint solves the problem

21ˆ arg min , ( )2t t

t

y u u

x

' 0t t tt

y x x

ˆ, T TG X X G X y

1 †ˆ T TX X X

y = X y

Maximum likelihood estimator

Euclidean case (Gaussian noise):

Not sparse

Sparse Solution

*min min [ ] ( ) ( *) ( *)( *) ( *) 0

penalty = pp i

D

F c

0

1 1

22

#1[ 0] :

:

: 0 1

:

i

i

p

i

F

F L

F p

F

sparsest solution

solution

generalized inverse solution

Sparse solution: overcomplete case

( ) : *min β

orthogonal projection, dual

D β : β , F β = c dual geodesic

projec

projec

tion

tion

( )c cF

dual

a) : 2, 1cR n p b) : 2, 1cR n p

c) : 2, 1cR n p

Fig. 1

non-convex

L1-constrained optimizationLASSO

LARS

min under

solution : 0

0c

F c

c c

min

solution 0

0

F

, ,

: ,

* *c λsolutions β and β : coincide λ = λ c p 1

p < 1 λ = λ c multiple noncontinuous stability different

 λP Problem

 cP Problem

*

*

Projection from to F = c (information geometry)*β

n

F n

Fig. 5 subgradient

( )c cF

LASSO path and LARS path(stagewise solution)

min :

min

F c

F

,c c λ correspondence

Active set and gradient

1

0

sgn ,, ,

1,1

i

pi i

p

A i

i AF i A

Solution path

1

0,

;

c c

A c c A c c

A A c c A A c c A c

c cA

c

cK

F

F F

ddc

F

c c cK G F

1 1 10; (sgn ) : iF F L

Solution pathin the subspace of the active set

1

0 : active direction

A A A

A A

F

K F

turning point A A

Gradient Descent Method

{ ( )}: covariant

{ ( )}: contravariant

i

ji

i

L L

L g L

2min L( +a): i jijg a a

1 ( )t t tc L

ij i jG g a aa

0pG t t G a a

. 1piG a p a

0L G

a aa

Riemannian metric

Minkovskian metric

Lp-norm

Steepest direction of L

1sign pi iG p a a a

ii

f F

1

1sgn pi i ia c f f

1, ij i jG g a a F G f a Natural gradient

F f

1

1sgn pi iF c f f

0

1sgn

0

0

iF c f

Euclidean case

1

arg max ii f

max i i jf f f

1, for and ,0 otherwise.i

i i jF

1t t F LASSOTry for various p, p->1Try for various noise functionLASSO and flat geometry

Extended LARS (p = 1) and Minkovskian gradient

11

norm

max under 1

1

sgn , max , ,

0, otherwise

pip

p

p

i i NA

a

p

1

a

a a

a a

arg max ii f

max i i jf f f

1, for and ,0 otherwise.i

i i jF

1t t F LARS

c-trajectory and-trajectory

Ex. 1-dim 21 *2

21 2 22 f F

L1/2 constraint: non-convex optimization

2: min , cP c

c c

: 0P f 0

ˆ : Xu Zongben's operator R

c c c

0 c

R

c

An Example of the greedy path 

β1

β2

LASSO and LARS :

1 : is non-sparse1 : sparse

cpp

Minkovskian gradient

2

1

:

0

c

c c c

c c c

D F

F

F

c

1

c c c c c c c

c c c

F F

G H

G H F

p

Solution Path :

not continuous, not-monotonejump

c

c

Linear Programming

inner met

max

lo

ho

g

d

ij j i

i i

ij j ii

A x b

c x

A x b

x

Convex Cone Programming

P : positive semi-definite matrix

convex potential function

dual geodesic approach

, minA x b c x

Support vector machine

Convex Programming━ Inner Method

:LP A x b

min c x

log

ij j iA x b x

i x

Simplex method ; inner method

Barrier function

Polynomial-Time Algorithm

curvature : step-size 2mH

min : geodesict t c x x x

Integration of evidences:

arithmetic meangeometric meanharmonic mean‐mean

1 2, ,... mx x x

Generalized mean: f‐meanf(u): monotone; f‐representation of u

1 ( ) ( )( , ) { }2f

f a f bm a b f

12( ) , 1

log ,

( , )

1

( ,

)

=

f fm ca cb cm a

f u u

b

u

scale free

α-representation

‐mean : 

2

1 :

1: 2

10 : ( )4 2

min( , ) max( , )

aba b

a ba b ab

m a bm a b

1 2( ( ), ( ))m p s p s

Various Means

2a b

ab2

1 1a b:

arithmetic geometric

Any other mean?

:

harmonic

Family of Distributions 1( ), , ( )kp s p s

1( ) ( ), 1

k

mix i i ii

p s t p s t

explog ( ) log ( )i ip s t p s

mixture family :

exponential family :

1( ; ) { ( ( ))}i ip x f f p x

α‐geodesic projection

robust estimator