Output Units and Cost Function in FNN

IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model

Concludsions and Discussions

Deep Neural NetworkCost Functions and Output Units

Jiaming [email protected]

DATALab@IIINetDBLab@NTU

January 9, 2017

1 / 28 Jiaming Lin [email protected] Deep Neural Network



Outline

1 Introduction

2 Output Units and Cost FunctionsBinaryMultinoulli

3 Deterministic and Generic Model

4 Concludsions and Discussions




Introduction

In the neural network learning...

The selection of output unit depends on the learningproblems.

– Classification: sigmoid, softmax or linear.– Linear Regression: linear.

Determine and analyse the cost function.

– Is the cost function †analytic?– Can the learning progress well(first order derivative)?

Deterministic and Generic Model.

– Data is more complicated in many cases.

Note: †For simplicity, we mean analytic to say a function isinfinitely differentiable on the domain.


https://en.wikipedia.org/wiki/Analytic_function



BinaryMultinoulli

Outline

1 Introduction







BinaryMultinoulli

Binary

index x1 · · · xn target1 0 · · · 1 Class A2 1 · · · 0 Class B3 1 · · · 1 Class A· · · · · · · · · · · · · · ·m 0 · · · 0 Class B




BinaryMultinoulli

Binary

whereS is the sigmoid function,z is the input of output layer

z = w>h+ b (1)

with w is weight, h is output of hidden layer and b is bias.6 / 28 Jiaming Lin [email protected] Deep Neural Network



BinaryMultinoulli

Cost Function

Cost function can be derived from many methods, we discusstwo of the most common

Mean Square Error

Cross Entropy




BinaryMultinoulli

Cost Function


Mean Square Error

Let y(i) denotes the data label, and y(i) = S(z(i)) as theprediction. We may define the cost function Cmse by

Cmse =1

m

m∑i=1

(y(i) − y(i))2 (2)

where m is the data size, and z(i), y(i) and y(i) are realnumbers.




BinaryMultinoulli

Cost Function


Cross Entropy

Adapting the symbols above, the cost function defined byCross Entropy is

Cce =1

m

m∑i=1

y(i) ln(y(i)) + (1− y(i)) ln(1− y(i)) (2)

where m is the data size, and z(i), y(i) and y(i) are realnumbers.




BinaryMultinoulli

Comparison between MSE and Cross Entropy

Problem: Which one is better?

Analyticity(infinitely differentiable)

Learning ability(first order derivatives)




BinaryMultinoulli


Analyticity:

Cmse =1

m

m∑i=1

(y(i) − y(i))2

Cce =1

m

m∑i=1

y(i) ln(y(i)) + (1− y(i)) ln(1− y(i))

Computationally, the value of y(i) = S(z(i)) could overflow to1 or underflow to 0 when z(i) is very positive or very negative.Therefore, given a fixed y(i) ∈ {0, 1},

Cce is undefined at y(i) is 0 or 1.

Cmse is polynomial and thus analytic every where.




BinaryMultinoulli


Learning Ability: compare the gradients

∂Cmse∂w

= [S(z)− y] [1− S(z)]S(z)h, (3)

∂Cce∂w

= [y − S(z)]h (4)

respectively, where S is sigmoid, z = w>h+ b.




BinaryMultinoulli


MSE Cross Entropy[S(z)− y] [1− S(z)]S(z)h [y − S(z)]h

If y = 1 and y → 1,steps → 0






If y = 0 and y → 1,steps → −1


In the ceas of Mean Square Error, the progress get stuck whenz is very positive or very negative.




BinaryMultinoulli

The Unstable Issue in Cross Entropy

We have mentioned about the unstable issue of crossentropy.

Precisely,

y = S(z) underflow to 0 when z is very negative,

y = S(z) overflow to 1 when z is very positive.

Therefore, given a fixed y ∈ {0, 1}, then the function

C = y ln y + (1− y) ln(1− y)

could be undefined when z is very positive or verynegative.




BinaryMultinoulli


Alternatively, regarding z as the variable of cross entropy

C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.




BinaryMultinoulli



C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.

We may obtain the analyticity of C by showing the dCdz

ismultiple of analytic functions.




BinaryMultinoulli



C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.In the cases of right answer

y = 1 and y = S(z)→ 1⇒ z →∞, C → 0,

y = 0 and y = S(z)→ 0⇒ z → −∞, C → 0.

In the cases of wrong answer

y = 1 and y = S(z)→ 0⇒ z → −∞,∇C → −1,

y = 0 and y = S(z)→ 1⇒ z →∞,∇C → −1.




BinaryMultinoulli

Outline

1 Introduction







BinaryMultinoulli

Multinoulli: Output Unit and Cost Function

Generalize the binary case to multiple classes.Linear output units and #(output units) = #(classes).Cost function evaluated by cross entropy.

Cost Function in Multinoulli Problems

Suppose the size of dataset is m and there are K classes, thenwe can obtain the cost function from cross entropy

C(w) = −

[m∑i=1

K∑k=1

1{y(i) = k} lnexp(z

(i)k )∑K

j=1 exp(z(i)j )

](7)

where z(i)k = w>k h

(i) + bk and h(i) is the output of hidden layercorresponding to example data xi.




BinaryMultinoulli

A Lemma for Cost Function Simplify






BinaryMultinoulli




To claim above properties, We should show a lemma at veryfirst,

Lemma 1

For the output z = w>h+ b and z = [z1, . . . , zK], we have

supz

(ln

K∑j=1

exp(zj)

)= max

j{zj}. (8)




BinaryMultinoulli


Proof.

Without loss of generality, we may assume z1 > . . . > zK ,then the remaining work is to show, for all ε > 0.

ln

[ez1

(1 +

K∑j=2

ezj−z1

)]= z1 + ln

(1 +

K∑j=2

ezj−z1

)≤ z1 + ε

Intuitively, the ln∑∑∑K

j=1exp (zj) can be well approximated

by maxj

{zj}.




BinaryMultinoulli

Analyticity

We may rewrite the cost function as

C(w) = −

{m∑i=1

K∑k=1

1{y(i) = k}

[z(i)k − ln

K∑j=1

exp(z(i)j )

]}.

For each summand, it is substraction of analytic function andthus analytic, and the term 1{y(i) = k} is acturally a constant.The total cost is summation of analytic functions and thusanalytic.




BinaryMultinoulli

Learning Ability

Property 2

By the rule of sum in derivatives, we may simplify the (7) asfollowing

C(i) =K∑k=1

1{y = k}

[zk − ln

K∑j=1

exp(zj)

], (8)

this cost is contributed by the example xi in the total cost C.

1 Assume the model gives the right answer, then theerrors would close to 0.

2 Assume the model gives the wrong answer, then thelearning can prograss well.




BinaryMultinoulli

Learning Ability

Proof (The Right Answer).

Suppose the true label is class n. By the assumption, weknow zn is the maxmal. Then

−ε ≤K∑k=1

1{y = k}

[zk − ln

K∑j=1

exp(zj)

]

= zn − lnK∑j=1

exp(zj)

< zn −maxj{zj} = 0.

This shows that −ε ≤ C(i) < 0 for an arbitrary small ε.16 / 28 Jiaming Lin [email protected] Deep Neural Network



BinaryMultinoulli

Learning Ability

Proof (The Wrong Answer).

Suppose the true label is class n. By assumption, theprediction zn given by model is not the maxmal. On the otherhand, using the fact

zn 6= maxj{zj} ⇒ softmax(zn) � 1.

This implies that there exist a sufficient large δ > 0 such that| softmax(zn)− 1 |> δ.




BinaryMultinoulli

Learning Ability

Proof (The Wrong Answer, Conti.)

Then

∂C(i)

∂zn=

∂

∂zn

[zn − ln

K∑j=1

ezj

]= 1− softmax(zn)

> δ

This shows the gradient is sufficently large and alsopredictable(bounded by 1), therefore the learning can progresswell.




Outline

1 Introduction







Learning Processes Overview

Deterministic GenericStep1 Model function

Linear

Sigmoid

Probability distribution

Gaussian

BernoulliStep2 Design errors evals

MSE

Cross Entropy

Maximum Likelihood Es-timate

Step3 Learning one statistic

Mean

Median

Learning full distribution

To describe some complicate data, it’s easier to build modelwith generic method.




Generic Modeling for Binary Classification

Step1: Using Bernoulli distribution as likelihood function.

p(y | x) = py(1− p)1−y

= S(z)y(1− S(z))1−y

Step2: Minimizing negative log-likelihood

lnp(y | x(i)) = y lnS(z) + (1− y) ln(1− S(z))

Step3: We an learn the full distribution.

p(y | x′) = S(z′)y(1− S(z′))1−y,

where we denote z′ = w>x′ + b and S is sigmoid.




Generic Modeling for Linear Regression: Step1

Given a training feature x, using Gaussian distribution aslikelihood function





Given a training feature x, using Gaussian distribution aslikelihood function

p(y | x) =1√

2σ2πexp

(−(µ− y)2

2σ2

),

where we denote the output of hidden layer as hx, weightw = [w1, w2] and bias b = [b1, b2], then

µ = w>1 hx + b1

σ = w>2 hx + b2

Intuitively, µ and σ are two linear output units, they arefunctions of x.





Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is

(µ, σ) = arg min(µ,σ)

(−∑x

lnp(y | x)

)(8)

However, for each summand,

Cx = lnp(y | x) =−1

2

[ln(2πσ2) +

(µ− y)2

σ2

]∂Cx∂σ

= (πσ)−1 − 2σ−3(µ− y)

the gradients and errors become unstable when σ close 0.





Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is

(µ, σ) = arg min(µ,σ)

(−∑x

lnp(y | x)

)(8)

However, for each summand,

Cx = lnp(y | x) =−1

2

[ln(2πσ2) +

(µ− y)2

σ2

]∂Cx∂σ

= (πσ)−1 − 2σ−3(µ− y)

the gradients and errors become unstable when σ close 0.21 / 28 Jiaming Lin [email protected] Deep Neural Network




To prevent the gradients and errors from being unstable, wemay substitute the term 1

2σ2 with v, then for each summand inthe negative log-likelihood

Cx = lnπ − ln v − (µ− y)2v,

∂Cx∂µ

= −2v(µ− y),

∂Cx∂v

=1

v− (µ− y)2.

Note that, this substitution valid only when the variance isn’ttoo large.





If the variance σ is fixed and chosen by user, then bycomparing the negative log-likelihood and MSE, we can seethat minimizing NLL is equivalent to minimizing MSE.

Cmse =1

m

m∑i=1

‖y(i) − y(y)‖2

Cnll =m∑i=1

Cx(i)

=−1

2

[m ln(2πσ2) +

m∑i=1

‖µx(i) − y(i)‖2

σ2

]





Full distribution from Generic, µ and σ in this case.

Single statistics from Deterministic, µ in this case.





Full distribution from Generic, µ and σ in this case.

Single statistics from Deterministic, µ in this case.

Experiment(ref): generate random data base on the formula

y = x+ 7.0 sin(0.75x) + ε

where ε is the gaussian noise with µ = 0, σ = 1


http://blog.otoro.net/2015/11/24/mixture-density-networks-with-tensorflow/




Full distribution from Generic, µ and σ in this case.Single statistics from Deterministic, µ in this case.

FNN config:#(hidden layey) = 1, width = 20 and hidden unit is tanh.

Gerneric Deterministic




More Complicated Cases

Complicated data distributions.

In some cases, it’s almost impossible to describe data viadeterministic methods.

Generic methods might perform better in complicatedcase.




Mixture Density Network

Generate random data based on the formula

x = y + 7.0 sin(0.75y) + ε

where ε is the gaussian noise with µ = 0, σ = 1





Firstly, just try to using MSE to define cost function and onehidden layer with width = 20, hidden unit is tanh.





Firstly, just try to using MSE to define cost function and onehidden layer with width = 20, hidden unit is tanh.

The reason is, minimizing MSE isequivalant to minimizing nagetive log-likelihood for simpleGaussian.





The mixture density network. The Gaussian mixture with ncomponents is defined by the conditional probabilitydistribution

p(y | x) =n∑i=1

p(c = i|x)ℵ(y;µ(i)(x); Σ(i)(x)). (9)

Network configuration,

1 Number of components n, need to be fine tuned(try anderror).

2 3× n output units.





Experiment(ref):

#(components) = 24,

two hidden layers with width = 24 and activation is tanh,

#(output units) = 3× 24 and they are linear.


http://blog.otoro.net/2015/11/24/mixture-density-networks-with-tensorflow/



Outline

1 Introduction







In classification problems, cross entropy is naturallygood to evaluate errors than other methods.

An cross entropy improvement to avoid numericallyunstable.

– The MNIST example from Tensorflow.

Determine the cost function is good or not.

– Is the cost function analytic?– Can the learning progress well?

Deterministic v.s. Generic

– Deterministic learns single statistic while generic learnfull distribution.

– When data distribution is not normal(high kurtosis or fattail), generic might be better.

– Generic methods is easier to apply to complicated cases.


https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_softmax.py



Thank you.


Output Units and Cost Function in FNN

Data & Analytics

Transcript of Output Units and Cost Function in FNN