Toward Deeper Understanding of Neural Networks: The · PDF fileToward Deeper Understanding of...

Click here to load reader

  • date post

    02-Jul-2018
  • Category

    Documents

  • view

    212
  • download

    0

Embed Size (px)

Transcript of Toward Deeper Understanding of Neural Networks: The · PDF fileToward Deeper Understanding of...

  • Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on

    Expressivity

    Amit Daniely Roy Frostig Yoram Singer

    Google Research

    NIPS, 2017Presenter: Chao Jiang

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 1 /

    34

  • Outline

    1 IntroductionOverview

    2 ReviewReproducing Kernel Hilbert SpaceDuality

    3 Terminology and Main Conclusion

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 2 /

    34

  • Outline

    1 IntroductionOverview

    2 ReviewReproducing Kernel Hilbert SpaceDuality

    3 Terminology and Main Conclusion

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 3 /

    34

  • Overview

    They define an object termed a computation skeleton that describes adistilled structure of feed-forward networks.

    They show that the representation generated by random initializationis sufficiently rich to approximately express the functions in Hall functions in H can be approximated by tuning the weights of thelast layer, which is a convex optimization task.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 4 /

    34

  • Outline

    1 IntroductionOverview

    2 ReviewReproducing Kernel Hilbert SpaceDuality

    3 Terminology and Main Conclusion

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 5 /

    34

  • Part I: Function Basis

    In Rn space, we can use n independent vectors to represent anyvector by linear combination. The n independent vectors can beviewed as a set of basis.

    if x = (x1, x2, , xn), y = (y1, y2, , yn), we can get

    < x , y >=n

    i=1

    xiyi

    Until now, this is the review of vector basis. These knowledge can beextended to functions and function space.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 6 /

    34

  • Function Basis

    A function is an infinite vector.

    For a function defined on the interval [a, b], we take samples by aninterval x .If we sample the function f (x) at points a, x1 , xn, b, then we cantransform the function into a vector (f (a), f (x1), , f (xn), f (b))T .When x 0, the vector should be more and more close to thefunction and at last, it becomes infinite.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 7 /

    34

  • Inner Product of Functions Similarly

    Since functions are so close to vectors, we can also define the innerproduct of functions similarly.

    For two functions f and g sampling by interval x , the inner productcould be defined as:

    < f , g >= limx0

    i

    f (xi )g(xi )x =

    f (x)g(x)dx

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 8 /

    34

  • Inner Product of Functions Similarly

    The expression of function inner product is seen everywhere. It hasvarious meanings in various context.

    For example, if X is a continuous random variable with probabilitydensity function f (x) , i.e., f (x) > 0 and

    f (x)dx = 1 , then the

    expectation

    E [g(x)] =

    f (x)g(x)dx =< f , g >

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 9 /

    34

  • Inner Product of Functions Similarly

    Similar to vector basis, we can use a set of functions to representother functions.

    The difference is that in a vector space, we only need finite vectors toconstruct a complete basis set, but in function space, we may needinfinite basis functions.

    Two functions can be regarded as orthogonal if their inner product iszero.

    In function space, we can also have a set of function basis that aremutually orthogonal.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 10 /

    34

  • Inner Product of Functions Similarly

    Kernel methods have been widely used in a variety of data analysistechniques.

    The motivation of kernel method arises in mapping a vector in Rnspace as another vector in a feature space.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 11 /

    34

  • Eigen Decomposition

    For a real symmetric matrix A , there exists real number and vectorq so that

    Aq = q

    For A Rnn, we can find n eigenvalues (i ) along with n orthogonaleigenvectors (qi ). As a result, A can be decomposited as

    Here {qi}ni=1 is a set of orthogonal basis of Rn.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on Expressivity

    NIPS, 2017 Presenter: Chao Jiang 12 /34

  • Eigen Decomposition

    A matrix is a description of the transformation in a linear space.

    The transformation direction is eigenvector.

    The transformation scale is eigenvalue.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 13 /

    34

  • Kernel Function

    A function f (x) can be viewed as an infinite vector, then for afunction with two independent variables K (x , y), we can view it as aninfinite matrix.

    If K (x , y) = K (y , x) and f (x)K (x , y)f (y)dxdy 0

    for any function f , then K (x , y) is symmetric and positive definite,then K (x , y) is a kernel function.

    Similar to matrix eigenvalue and eigenvector, there exists eigenvalue and eigenfunction (x), so that

    K (x , y)(x)dx = (y)

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 14 /

    34

  • Kernel Function

    K (x , y) =i=0

    ii (x)i (y) (1)

    Here, < i , j >= 0fori 6= jTherefore, {i}i=1 construct a set of orthogonal basis for a functionspace.

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 15 /

    34

  • Reproducing Kernel Hilbert Space

    {ii}i=1 is a set of orthogonal basis and construct a Hilbert space

    HAny function or vector in the space can be represented as the linearcombination of the basis. Suppose

    f =i=1

    fiii

    we can denote f as an infinite vector in H:

    f = (f1, f2, )THfor another function

    g = (g1, g2, )THwe have

    < f , g >H=i=1

    figi

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 16 /

    34

  • Reproducing Kernel Hilbert Space

    We use K (x , y) to denote the number of K at point x , y , use K (, )to denote the function itself, and use K (x , ) to denote the x-throw of the matrix.

    K (x , ) =i=0

    ii (x)i

    In space H, we can denote

    K (x , ) = (11(x),

    22(x), )TH

    Therefore,

    < K (x , ,K (y , )) >H=i=0

    ii (x)i (y) = K (x , y)

    This is the reproducing property, thus H is called reproducing kernelHilbert space (RKHS).

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 17 /

    34

  • Go Back: How to Map a Point into A Feature Space

    we define a mapping:

    (x) = K (x , ) = (11(x),

    22(x), )TH

    then we can map the point x to H.

    < (x),(y) >H=< K (x , ),K (y , ) >H= K (x , y)

    As a result, we do not need to actually know what is the mapping,where is the feature space, or what is the basis of the feature space.

    Kernel trick: for a symmetric positive-definite function K , there mustexist at least one mapping and one feature space H, so that

    < (x),(y) >H= K (x , y)

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 18 /

    34

  • Outline

    1 IntroductionOverview

    2 ReviewReproducing Kernel Hilbert SpaceDuality

    3 Terminology and Main Conclusion

    Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 19 /

    34

  • Duality

    In mathematical optimizati