# Toward Deeper Understanding of Neural Networks: The · PDF fileToward Deeper Understanding of...

date post

02-Jul-2018Category

## Documents

view

212download

0

Embed Size (px)

### Transcript of Toward Deeper Understanding of Neural Networks: The · PDF fileToward Deeper Understanding of...

Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on

Expressivity

Amit Daniely Roy Frostig Yoram Singer

Google Research

NIPS, 2017Presenter: Chao Jiang

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 1 /

34

Outline

1 IntroductionOverview

2 ReviewReproducing Kernel Hilbert SpaceDuality

3 Terminology and Main Conclusion

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 2 /

34

Outline

1 IntroductionOverview

2 ReviewReproducing Kernel Hilbert SpaceDuality

3 Terminology and Main Conclusion

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 3 /

34

Overview

They define an object termed a computation skeleton that describes adistilled structure of feed-forward networks.

They show that the representation generated by random initializationis sufficiently rich to approximately express the functions in Hall functions in H can be approximated by tuning the weights of thelast layer, which is a convex optimization task.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 4 /

34

Outline

1 IntroductionOverview

2 ReviewReproducing Kernel Hilbert SpaceDuality

3 Terminology and Main Conclusion

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 5 /

34

Part I: Function Basis

In Rn space, we can use n independent vectors to represent anyvector by linear combination. The n independent vectors can beviewed as a set of basis.

if x = (x1, x2, , xn), y = (y1, y2, , yn), we can get

< x , y >=n

i=1

xiyi

Until now, this is the review of vector basis. These knowledge can beextended to functions and function space.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 6 /

34

Function Basis

A function is an infinite vector.

For a function defined on the interval [a, b], we take samples by aninterval x .If we sample the function f (x) at points a, x1 , xn, b, then we cantransform the function into a vector (f (a), f (x1), , f (xn), f (b))T .When x 0, the vector should be more and more close to thefunction and at last, it becomes infinite.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 7 /

34

Inner Product of Functions Similarly

Since functions are so close to vectors, we can also define the innerproduct of functions similarly.

For two functions f and g sampling by interval x , the inner productcould be defined as:

< f , g >= limx0

i

f (xi )g(xi )x =

f (x)g(x)dx

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 8 /

34

Inner Product of Functions Similarly

The expression of function inner product is seen everywhere. It hasvarious meanings in various context.

For example, if X is a continuous random variable with probabilitydensity function f (x) , i.e., f (x) > 0 and

f (x)dx = 1 , then the

expectation

E [g(x)] =

f (x)g(x)dx =< f , g >

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 9 /

34

Inner Product of Functions Similarly

Similar to vector basis, we can use a set of functions to representother functions.

The difference is that in a vector space, we only need finite vectors toconstruct a complete basis set, but in function space, we may needinfinite basis functions.

Two functions can be regarded as orthogonal if their inner product iszero.

In function space, we can also have a set of function basis that aremutually orthogonal.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 10 /

34

Inner Product of Functions Similarly

Kernel methods have been widely used in a variety of data analysistechniques.

The motivation of kernel method arises in mapping a vector in Rnspace as another vector in a feature space.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 11 /

34

Eigen Decomposition

For a real symmetric matrix A , there exists real number and vectorq so that

Aq = q

For A Rnn, we can find n eigenvalues (i ) along with n orthogonaleigenvectors (qi ). As a result, A can be decomposited as

Here {qi}ni=1 is a set of orthogonal basis of Rn.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on Expressivity

NIPS, 2017 Presenter: Chao Jiang 12 /34

Eigen Decomposition

A matrix is a description of the transformation in a linear space.

The transformation direction is eigenvector.

The transformation scale is eigenvalue.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 13 /

34

Kernel Function

A function f (x) can be viewed as an infinite vector, then for afunction with two independent variables K (x , y), we can view it as aninfinite matrix.

If K (x , y) = K (y , x) and f (x)K (x , y)f (y)dxdy 0

for any function f , then K (x , y) is symmetric and positive definite,then K (x , y) is a kernel function.

Similar to matrix eigenvalue and eigenvector, there exists eigenvalue and eigenfunction (x), so that

K (x , y)(x)dx = (y)

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 14 /

34

Kernel Function

K (x , y) =i=0

ii (x)i (y) (1)

Here, < i , j >= 0fori 6= jTherefore, {i}i=1 construct a set of orthogonal basis for a functionspace.

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 15 /

34

Reproducing Kernel Hilbert Space

{ii}i=1 is a set of orthogonal basis and construct a Hilbert space

HAny function or vector in the space can be represented as the linearcombination of the basis. Suppose

f =i=1

fiii

we can denote f as an infinite vector in H:

f = (f1, f2, )THfor another function

g = (g1, g2, )THwe have

< f , g >H=i=1

figi

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 16 /

34

Reproducing Kernel Hilbert Space

We use K (x , y) to denote the number of K at point x , y , use K (, )to denote the function itself, and use K (x , ) to denote the x-throw of the matrix.

K (x , ) =i=0

ii (x)i

In space H, we can denote

K (x , ) = (11(x),

22(x), )TH

Therefore,

< K (x , ,K (y , )) >H=i=0

ii (x)i (y) = K (x , y)

This is the reproducing property, thus H is called reproducing kernelHilbert space (RKHS).

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 17 /

34

Go Back: How to Map a Point into A Feature Space

we define a mapping:

(x) = K (x , ) = (11(x),

22(x), )TH

then we can map the point x to H.

< (x),(y) >H=< K (x , ),K (y , ) >H= K (x , y)

As a result, we do not need to actually know what is the mapping,where is the feature space, or what is the basis of the feature space.

Kernel trick: for a symmetric positive-definite function K , there mustexist at least one mapping and one feature space H, so that

< (x),(y) >H= K (x , y)

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 18 /

34

Outline

1 IntroductionOverview

2 ReviewReproducing Kernel Hilbert SpaceDuality

3 Terminology and Main Conclusion

Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 19 /

34

Duality

In mathematical optimizati

*View more*