Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...

Applications of Information Geometry

Clustering based on Divergence: Cluster CenterStochastic Reasoning: Belief Propagation in Graphical ModelSupport Vector MachineBayesian Framework and Restricted Boltzmann MachineNatural Gradient in Multilayer Perceptron LearningIndependent Component AnalysisSparse Signal Analysis and Minkovskian GradientConvex Optimization

Clustering of Points1 2 3{ , , ,..., }ND

Clustering : center of a cluster of points

1, , mC

arg min , ii

Center of C

is simply the arithmetic average in the dual coordinate system

k‐means: clustering algorithm

[ : ] ( ) ( )[ : ]

( ) 01*

k‐means: clustering algorithm

1. Choose k cluster centers2. Classify patterns of D into clusters3. Calculate new cluster centers

Voronoi Diagram: Boundaries of Clusters

1 2[ : *] [ : *]D D

Total Bregman divergence (Vemuri)

( ) ( )tBD( : )

1 | ( ) |p qp q

t-center of C

Conformal change of divergence

: :D p q q D p q

ij ijg p g

logijk ijk k ij j ik i jk

T T s g s g s g

t-center is robust

1, , ;

influence fun ;ction z y

robust as : c z y

1, yGw

Robust: is boundedz

21Euclidean case 2

1 [ : ] [ : y]minimize ( )1

D DN w w

MPEG7 database• Great intraclass variability, and small interclass dissimilarity.

Shape representation

First clustering then retrieval

Other TBD applications

Diffusion tensor imaging (DTI) analysis [Vemuri]• Interpolation• Segmentation

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

Information Geometryof Stochastic Reasing

Belief Propagation in Graphical Model

• Shun‐ichi Amari (RIKEN BSI)• Shiro Ikeda (Inst. Statist. Math.)• Toshiyuki Tanaka (Kyoto U.)

Stochastic Reasoning: Graphical Model

( , , , , )p x y z r s

( , , | , ), , ,... 1, 1p x y z r sx y z

Stochastic Reasoningq(x1,x2,x3,…| observation)

X= (x1 x2 x3 …..) x = 1, ‐1

Ｘ= argmax q(x1, x2 ,x3 ,…..) maximum likelihood

Ｘi = sgn E[xi] least bit error rate estimator

Mean Value Marginalization: projection to independent distributions

0 1 1 2 2 0( ) ( ) ( )... ( ) ( )n nq q x q x q x q x x

1, 1( ) ( ..., ) .. ..i i n i nq x q x x dx dx dx

0[ ] [ ]q q E x E x

i i r qr

r r i i s

q k x c

c c x x r

q w x x h x

x r i i

Boltzmann machine, spin glass, neural networksTurbo Codes, LDPC Codes

cliques

Computationally DifficultComputationally Difficult

exp r q

mean‐field approximation

belief propagation

tree propagation, CCCP (convex‐concave)

Information Geometry ofMean Field Approximation

• m‐projection• e‐projection

q = argmin D[q:p]q = argmin D[p:q]

( )[ : ] ( )log( )x

q xD q p q xp x

0m 0 { ( )}i i iM p x

0( )p x M

(x, *)

(x, *) *

( ) *( )( )*( )

( ), * ( *) ( ) *( ) 0

[ ] [ ]

q x p xt xp x

t x x x q x p x

E x E x

m‐projection keeps the expectation of x

Information GeometryInformation Geometry

0 0 0 0 0, exp

, expr r r r r r

1, ,r L

( ) exp{ ( )rq x c x

Belief Prop Algorithm

( , ) ( , )

( , ) : belief for ( )

t tr r r

t tr r r r r

t tr r

p x p x

p x c x

Belief PropagationBelief Propagation( , ) exp{ ( ) }r r r rp x c x x

Equilibrium of BPEquilibrium of BP , rξ

1) m‐condition

*0 ,r rp x

-flat submanifold m M

q2) e‐condition

-flat submanifold q ex

ξ1( ')ξ θ

Belief Propagation e‐condition OK

CCCP m‐condition OK

1 2 1 2

( ; , , , ' '( , , ) ( ' , ' , ' )

θ ξ ξ ξ θ ξ

: ( , ) ( , )

r r r r

p x p x

Convex‐Concave ComputationalProcedure (CCCP) : A. Yuille

( ) ( ) ( )

( ) ( )

F FElimination of double loops

Geometry of Support Vector Machinemodification of kernel

Simple perceptron: decision surface

( )| |

f x w x bw x bd

2minimize | |constraint ( ) 1i i

wy w x b

21( , , ) | | ( )2

0 : support vector : 0

i i i i

i i i i i i

L w b w w x b

w y x f x w y x b

Embedding in higher‐dimensional space

( )( ) ( )

x z xf x w x b

z: infinite‐dimensional

Kernel trick ( , ') ( ) ( ')

( , ') ( ') ' ( ')

1( ) ( )

( , ) ( , )

K x x x x

K x x x dx x

f x w y K x x

| ( ) ( ) | z( ) ( )

( ) ( ) ( )

( ) ( , x')

i ji j

ij x xi j

ds z x dx z x x z x dx dxx x

g x x xx x

g x K xx x

( ) exp{ { ( )}

( , ') ( ) ( ') ( , ')( ) exp{ | x x * | }

g ( ) { ( )} g ( ) ( ) ( )i i

ij ij i j

K x x x x K x xx

x x x x x

Conformal Transformation of Kernel

Basic Principles ofUnsupervised and SupervisedLearning Toward Deep Learning

Shun‐ichi Amari (RIKEN Brain Science Institute)collaborators: R. Karakida, M. Okada (U. Tokyo)

Deep LearningSelf‐Organization + Supervised Learning

RBM: Restricted Boltzmann MachineAuto‐Encoder, Recurrent Net

DropoutContrastive divergence

Boltzmann Machine

RBM: Restricted Boltzmann Machine

Simple Hebbian Self‐Organization

self‐organization of

Self‐Organization

Interaction of Hidden Neurons

Bayesian Duality in Exponential Family

Data x Parameter (higher‐order concepts)

Curved exponential family

RBM= h, x = Wvx = v = hW

Gaussian Boltzmann Machine

Equilibrium Solution

General Solution

You can choose m(≤ k) eigen values form

• diagonalized by

Stable Solution the case of m = k

‐ No connections within layers

Contrastive Divergence

• How to train RBM Maximum Likelihood (ML) learning is hard

• RBM ‐ 2‐layered probabilistic neural network

Sampling

EquilibriumInput

Many iterations of Gibbs Sampling demand too muchcomputational time

Hidden hVisible v

Contrastive Divergence Solution

Two Manifolds

Geometry of CDn (contrastive divergence)

Bernoulli‐Gaussian RBM

ICAR. Karakida

Simulation

InputMixing

Sources p (s) ICA Solution

OutputUniformDistribution

The number of Neurons: N = M = 2, σ = 1/2

Independent sources are extracted in G‐B RBM

Information Geometry of Neuromanifolds

Natural Gradient and Singularities

Shun-ichi AmariRIKEN Brain Science Institute

Multilayer Perceptron

Mathematical Neurons

i iy w x h w x

x y( )u

Multilayer Perceptrons

i iy v n w x

21; exp ,2

p y c y f

1 2( , ,..., )nx x x x

1 1( ,..., ; ,..., )m mw w v v

Multilayer Perceptron: Neuromanifold

, ; ,i i

θ w w

neuromanifold ( )x

space of functions

Universal Function Approximator

Learning from examples

ˆ, xfx

1 1examples { , , , , }n nD y y x x

learning ; estimation

Backpropagation ‐‐‐gradient learningBackpropagation ‐‐‐gradient learning

examples : , , , training set1( , ; ) ,2

log , ;

t ty y

E y x y f

( , )y f x n

Multilayer Perceptron: NeuromanifoldMetric and Topology

, ; ,i i

θ w w

neuromanifold ( )x

space of functions

Metric: Riemannian manifold

ij i j

log ( | ; ) log ( | ; )( ) [ ]iji j

p y x p y xg E

Topology: Neuromanifold

• Metrical structure

• Topological structure

singularities

Geometry of singular model

y v n w x

v| | 0v w

Parameter Space S

Equivalence

1) 02)

i j i j

i iy v n w x

Topological Singularities

: parameter spaceS

: behavioral spaceM

2 hidden‐units

1 2 1 1

y v v n

v x w v x w

w x w x

w w w w

Gaussian mixtures

21( ) exp2i ip x v x w

Gaussian mixture

1 2 1 2; , , 1p x v w w v x w v x w

21 1exp22

1 2 s ingular: , 1 0w w v v

Learning, Estimation, and Model Selection

train emp

E D p y p y

E D p y

gen train

: dimension2dE dn

Problem of Backprop

• slow convergence‐‐‐‐plateau‐‐‐saddle

• local minima

( , ; )t t t t tl x y

Flaws of MLPFlaws of MLPslow convergence : plateau

local minima

Boosting, Bagging, SVM

Steepest Direction --- Natural GradientSteepest Direction --- Natural Gradient

d d Gd G d d

( , ; )t t t t tl x y

Natural Gradient

under ij i j

dl l d l l d

d g d d

d l G l

( , ; )t t t t tl x y

Information Geometry of MLPInformation Geometry of MLPNatural Gradient Learning :

S. Amari ; H.Y. Park

1 1 1 11 1 T

t t t t

G G G f f G

Adaptive natural gradient learning

Regular statistical model ,

: Fisher information

11TE Gn

01ˆ, : ,2

E KL p x p x G En

AIC, MDL

Landscape of error at singularityMilner attractor

Dynamics of Learning

( , ), ( , )

( , ) , ( , )

d dl G ldt dt

du dzf u z k u zdt dt

du f u zdz k u z

2 2 1 log

2u z z c

Coordinate Transformation

1 1 2 2

v v v v vv vz z

u w w uw ww w w

Dynamics of Learning: Natural gradient works!!

( , ), ( , )

( , ) , ( , )

d dl G ldt dt

du dzf u z k u zdt dt

du f u zdz k u z

2 2 1 log

2u z z c

Dynamic vector fields: General case (|z|<1 part stable)

Dynamic vector fields: General case (|z|>1 part stable )

Adaptive Natural Gradient works well

Signal ProcessingICA : Independent Component Analysis

t t t tA x s x s

sparse component analysis

positive matrix factorization

mixture and unmixtureof independent signals

ns2smx

i ij jj

Independent Component Analysis

i ij jA x A s

observations: x(1), x(2), …, x(t)recover: s(1), s(2), …, s(t)

s A W y

Semiparametric Statistical Model

( ; , ) | | ( )p r rx W W Wx

unknown

x(1), x(2), …, x(t)

ir r , ( ) :r s 1W A

Natural Gradient

W W WW

Space of Matrices : Lie group

-1d dX WW

2 1tr trT T T

d d d d d

W X X WW W W

I I d X

non-holonomic basis

Information Geometry of ICA

natural gradientestimating functionstability, efficiency

S ={p(y)}

1 1 2 2{ ( ) ( )... ( )}n nI q y q y q y

{ ( )}p Wx

( ) [ ( ; ) : ( )] ( )

l KL p qr

W y W yy

Estimating Functions

, ' 0, 'rE

W F(y,WW W

F y WW W

estimating equation

0 t tt

tF y ,W y Wx

learning

t t t W F W x

Admissible class

on-line learnin

0: estimating equation

canonical estimating functio

t t t t

F y W I y y

F R W F y W

R W F Wx

W R W F W x

{ }i j j iy y y y W

Basis Given: overcomplete caseSparse Solution

many solutionsmany 0

ˆ ˆ ˆ: A x Assparse

generalized inverse

ˆmin 2is

sparse solution

ˆmin ii s

ˆ ˆ ˆ: x A Assparse

2 :-normL

1 :-normL

Minkovskian Gradient

min : convex functionconstraint F c

typical case:

21 1 ( *) ( *)2 21 ;

2, 1, 1/ 2p p p

Optimization under Sparsity Condition:

(X y n)

: -sparsek

2 logm k n

, 1,2,...k

tt i i t

y x n t N

Linear regression

Overcomplete: N < kunder‐determined infinitely many solutions

Sparsity constraint solves the problem

21ˆ arg min , ( )2t t

' 0t t tt

ˆ, T TG X X G X y

1 †ˆ T TX X X

y = X y

Maximum likelihood estimator

Euclidean case (Gaussian noise):

Not sparse

Sparse Solution

*min min [ ] ( ) ( *) ( *)( *) ( *) 0

penalty = pp i

#1[ 0] :

sparsest solution

solution

generalized inverse solution

Sparse solution: overcomplete case

( ) : *min β

orthogonal projection, dual

D β : β , F β = c dual geodesic

projec

( )c cF

a) : 2, 1cR n p b) : 2, 1cR n p

c) : 2, 1cR n p

Fig. 1

non-convex

L1-constrained optimizationLASSO

min under

solution : 0

solution 0

* *c λsolutions β and β : coincide λ = λ c p 1

p < 1 λ = λ c multiple noncontinuous stability different

　λP Problem

　cP Problem

Projection from to F = c (information geometry)*β

Fig. 5 subgradient

( )c cF

LASSO path and LARS path(stagewise solution)

,c c λ correspondence

Active set and gradient

sgn ,, ,

i AF i A

Solution path

A c c A c c

A A c c A A c c A c

c c cK G F

1 1 10; (sgn ) : iF F L

Solution pathin the subspace of the active set

0 : active direction

turning point A A

Gradient Descent Method

{ ( )}: covariant

{ ( )}: contravariant

2min L( +a): i jijg a a

1 ( )t t tc L

ij i jG g a aa

0pG t t G a a

. 1piG a p a

Riemannian metric

Minkovskian metric

Lp-norm

Steepest direction of L

1sign pi iG p a a a

1sgn pi i ia c f f

1, ij i jG g a a F G f a Natural gradient

1sgn pi iF c f f

iF c f

Euclidean case

arg max ii f

max i i jf f f

1, for and ,0 otherwise.i

i i jF

1t t F LASSOTry for various p, p->1Try for various noise functionLASSO and flat geometry

Extended LARS (p = 1) and Minkovskian gradient

max under 1

sgn , max , ,

0, otherwise

i i NA

arg max ii f

max i i jf f f

1, for and ,0 otherwise.i

i i jF

1t t F LARS

c-trajectory and-trajectory

Ex. 1-dim 21 *2

21 2 22 f F

L1/2 constraint: non-convex optimization

2: min , cP c

: 0P f 0

ˆ : Xu Zongben's operator R

An Example of the greedy path

LASSO and LARS :

1 : is non-sparse1 : sparse

Minkovskian gradient

c c c c c c c

Solution Path :

not continuous, not-monotonejump

Linear Programming

inner met

ij j i

ij j ii

Convex Cone Programming

P : positive semi-definite matrix

convex potential function

dual geodesic approach

, minA x b c x

Support vector machine

Convex Programming━ Inner Method

:LP A x b

min c x

ij j iA x b x

Simplex method ; inner method

Barrier function

Polynomial-Time Algorithm

curvature : step-size 2mH

min : geodesict t c x x x

Integration of evidences:

arithmetic meangeometric meanharmonic mean‐mean

1 2, ,... mx x x

Generalized mean: f‐meanf(u): monotone; f‐representation of u

1 ( ) ( )( , ) { }2f

f a f bm a b f

12( ) , 1

f fm ca cb cm a

scale free

α-representation

‐mean :

10 : ( )4 2

min( , ) max( , )

a ba b ab

m a bm a b

1 2( ( ), ( ))m p s p s

Various Means

1 1a b:

arithmetic geometric

Any other mean?

harmonic

Family of Distributions 1( ), , ( )kp s p s

1( ) ( ), 1

mix i i ii

p s t p s t

explog ( ) log ( )i ip s t p s

mixture family :

exponential family :

1( ; ) { ( ( ))}i ip x f f p x

α‐geodesic projection

robust estimator

Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...

Documents

Transcript of Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...

-0.6cm Applications of Information Geometry to …image.diku.dk/MLLab/SummerSchools/IGMalago.pdfApplications of Information Geometry to Combinatorial Optimization Luigi Malagò Shinshu

MEMS Applications OverviewMEMS Applications Overviewnanotechradar.com/sites/default/files/inano10_mems_applications_overview.pdf · MEMS Applications OverviewMEMS Applications Overview

Quicksort - mllab-skku.github.io · Quicksort • Proposed by C.A.R. Hoare in 1962. • Divide-and-conquer algorithm. • Sorts “in place” (like insertion sort, but not like merge

The Combinatorial Basis of Entropy (“MaxProb”)people.physics.anu.edu.au/~ccs106/SUMMERSCHOOLS/SS22/RNiven/… · The Combinatorial Basis of Entropy ... • Combinatorial basis

APPLICATIONS AND DECISIONS - gov.uk · 3 LIST OF CONTENTS Section 1 – Applications Received 1.1 New applications 1.2 Variation applications Section 2 – Applications Decided (Without

APPLICATIONS AND DECISIONS - assets.publishing.service.gov.uk · 2.1 New applications granted 2.2 New applications refused 2.3 Variation applications granted 2.4 Variation applications

APPLICATIONS Dry & Bulk Powder Applications

Internet Applications INTERNET & INTERNET APPLICATIONS.

Medi-Cal Handbook Applications 5. Applications · Applications 5. Applications. ... [Refer to “Applications from Non-Custodial Parents,” page 5-24 for additional ... who completes

Archives 2017 WTT - University of Lausannewp.unil.ch/summerschools/files/2017/08/Archives-2017-WTT.pdfModule 1: Wine and winemaking culture in context The aim of the module is to introduce

Quick Sort - mllab-skku.github.io · Quick Sort 1 arr[] = {10, 80, 30, 90, 40, 50, 70} Indexes: 0 1 2 3 4 5 6 Pseudo Code for recursive quickSort function Pseudo code for partition()

Multimedia Applications Internet Technologies and Applications.

Mobile Applications applications operating on mobile devices, tablets, smartphones Mobile Applications.

Op amp applications cw nonlinear applications

CALCUL DES OUVRAGES: APPLICATIONS APPLICATIONS

Html5 applications vs native applications

Hydrologic Applications (Conservation Applications of LiDAR ...

Linear Programming Applications in Marketing, Finance and Operations Marketing Applications Marketing Applications Financial Applications Financial Applications.

DRIVETRAIN DANA KIT APPLICATIONS EATON KIT APPLICATIONS ... · DANA KIT APPLICATIONS EATON KIT APPLICATIONS MACK KIT APPLICATIONS MERITOR KIT APPLICATIONS. NTN provides axle repair

LUBRICANTS GUIDE - John Deere · 2 Lubricants Guide Applications Applications Applications Applications – For use in heavy-duty off-road applications, on-road trucks, marine engines,