Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title:...

24
Introduction to Computer Vision Dr. Sedat Ozer Introduction to Neural Networks: Part2

Transcript of Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title:...

Page 1: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Introduction to Computer Vision

Dr. Sedat Ozer

Introduction to Neural Networks: Part2

Page 2: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Use of Python

• Both GPU and CPU has parallelization instructions and many Python built-in functions supports that. • Use built-in functions for matrix (or vector) operations and avoid

using for loops in your code whenever you can!

2Lecture Notes for Computer Vision Sedat Ozer

Page 3: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Cost function and GD implementation

3

! ", $ =1'()*+

,

ℒ(/0 ) , 0()))

2ℒ(3,4)

25+

= 61(7 − 0)

2ℒ(3,4)

29= 7 − 0

2ℒ(3,4)

25:

= 62(7 − 0)

<! ", $<"1

=1'()*+

,<ℒ a

(>), 0(>)

<"1

<! ", $<"2

=1'()*+

,<ℒ a

(>), 0(>)

<"2

<! ", $<$

=1'()*+

,<ℒ a

(>), 0(>)

<$

Remember the loss for single sample:

Cost Function:

Final derivatives to be used:! d"1 d"2 d$ ∝ =0.00001

i +E>= FGH

i+ $

7>= J E

(>)

! += −[0())log 7 ) (1 − 0())) log(1 − 7 ) )]dE

>= 7

>−0())

d"1 += 6+()) dE

>

d"2 += 6:()) dE

>

d$ += dE>

! ! d"1 d"1

d"2 d"2 d$ d$"1 "1 ∝ d"1"2 "2 ∝ d"2$ b ∝ db

Implement all that:

x2

x1

/0w1

w2

b

z=wTx + b J (z)

Lecture Notes for Computer Vision Sedat Ozer

Page 4: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Lets Re-implement the Logistic Regression

4

Xnxm=x1 xm… features

samples

Remember:

Z= z1 zm…= #$% + [(, (, … , (]

1xm

Implementation for Z: Z = ) ( +

A= a1 am…=-(Z)

Y= y1, y2

, …,ym

Note that in this particular example, the output is a vector. Therefore, the matrix Z (thus, the matrix A) becomes a vector.

Lecture Notes for Computer Vision Sedat Ozer

(Training) Data

Page 5: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

A Better Implementation in Python

5

! d#1 d#2 d&

i +()= +,-

i+ &

/)= 0((

()))

! += −[5(6)log / 6 (1 − 5(6)) log(1 − / 6 )]

d()= /

)−5(6)

d#1 += ;<(6)d(

)

d#2 += ;=(6)d(

)

d& += d()

! ! d#1 d#1

d#2 d#2 d& d&

#1 #1 ∝ d#1

#2 #2 ∝ d#2

& b ∝ db

@ = +,A + &A= 0(@)

dZ = A– Y

d +=(1/m)X dZT

dK =(1/m)np.sum(dZ)

+ + ∝ d+

K K ∝ dK

One iteration of gradient descent One iteration of gradient descent

Lecture Notes for Computer Vision Sedat Ozer

Page 6: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Neural “Networks”

6Lecture Notes for Computer Vision Sedat Ozer

Page 7: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Neural Networks

• So far, we have seen only a “single neuron” (with two models)!• Linear model• Logistic regression model

• The performance of a single unit (single neuron) is limited!• Higher performance can be achieved by forming a network of

multiple neurons.

7Lecture Notes for Computer Vision Sedat Ozer

Page 8: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

!"!#!$

%&

x

w

b

' = )*+ + - . = /(') ℒ(., &)

x

4["]

7["]'["] = 4["]+ + 7["] 8["] = /(9["]) '[#] = 4[#]:["] + -[#] .[#] = /('[#]) ℒ(.[#], &)

!"!#!$

%&

;[#]

-[#]

A Neuron vs. A Neural Network

<=>.?@

<=>.?@Lecture Notes for Computer Vision Sedat Ozer

Page 9: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

A 2-layer FC-NN Example:

9

!"

!#

!$

!%

&'!(

Input Layer(L0)

Hidden Layer(L1)

Output Layer(L2)

a[1]x = a[0]

&' = a[2]

Each layer has its own weights and bias values. So… the kth layer would have: W[k] and b[k]

W[1] is a (3x5) matrix,b[1] is a (3x1) vector

W[2] is a (1x3) matrix (i.e., a row vector),b[2] is a (1x1) vector

&' = a[2]= )"[#]

1x1

&'11x1

=

-"["]=

.""

.#"

.("

.$"

.%"5x1

a vector

Layer

Unit

(Reads: the weight vector of the 1st unit at the first layer)

W[1] =."" .#" .(" .$" .%"."# .## .(# .$# .%#."( .#( .(( .$( .%(

1st unit in L1 (-"" /)

2nd unit in L1 (-#" /)

3rd unit in L1 (-(" /)

3x5

The Weight Matrix for the entire layer 1:

a[1]=

)"["]

)#["])(["]

3x1

Lecture Notes for Computer Vision Sedat Ozer

[1]

Page 10: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

What if I have more than 2 classes?• In logistic regression we assumed that we had only two classes: Class 0 & Class 1.• What if I have more than two classes?• One typical approach is: using one vs. all approach.

• (where, for example, you first consider Class 0 as one class and the combination of all the other classes as the “other” class. Then you consider Class 1 as one class and the combination of all the other classes as the “other class”,...)

• Another approach might be using Softmax Regression instead of logistic regression.• Define a new cost function and derive all the weight update rules according to that cost

function.

10Lecture Notes for Computer Vision Sedat Ozer

Page 11: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Softmax Regression

• Remember Logistic Regression: We had only two classes (Class 0 and Class 1, i.e., y(i)∈{0,1}).• Softmax Regression is the case where we have K classes ( K > 2 ) such

that: y(i)∈{1,…,K}. • Sigmoid function is no longer being used.• Now the output can take K different values rather than just two.

11

" #, % = 1()

*+,

-ℒ(a(1), 3(1))ℒ a, 3 = −)

*+,

K31567(8*)89 = :39 = 7(;9) =

<=>∑*+,@ <=A

Softmax Loss Function Softmax Cost FunctionSoftmax Activation Function

Lecture Notes for Computer Vision Sedat Ozer

Page 12: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Lets have a look at an example:

13

!"#$%&"' (&) : +,

&'#"-. − 0.$!%ℎ: +2

ℎ"34. 4&(.: +5# 7.89""-4: +:

# 7$%ℎ9""-4: +; <$-&!= 4&(.

n: total number of features = 6

!"#$! $-.'&%&.4

4#ℎ""! >3$!&%=

0$!?$7&!&%=

)")3!$%&"': +@ A=1(-$9?.% )")3!$9&%=)Discrete value: popular/ not

Lecture Notes for Computer Vision Sedat Ozer

Page 13: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

A Fully Connected (FC) Neural Network (6 inputs, 1 output)

14

!"

!#

!$!%

&'

!(

n: total number of features = 6

!)

Lecture Notes for Computer Vision Sedat Ozer

Page 14: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

A Fully Connected (FC) Neural Network (6 inputs, 2 outputs)

15

!"

!#

!$!%

&'1

!(

n: total number of features = 6

!)&'2

a “unit” (a neuron)

Lecture Notes for Computer Vision Sedat Ozer

Page 15: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

A 2-layer FC-NN Example:

16

!"

!#

!$

!%

&'!(

Input Layer(L0)

Hidden Layer(L1)

Output Layer(L2)

a[1]x = a[0]

&' = a[2]

a[1] =

)"["]

)#["])(["]

3x1

&' = a[2]= )"[#]1x1

&'11x1

=

W[1] =-"" -#" -(" -$" -%"-"# -## -(# -$# -%#-"( -#( -(( -$( -%(

3x5

The Weight Matrix for the entire layer 1:

=.(0"["]).(0#["]).(0(["])

3x1

= .(z[1]) 2"["]

2#["]2(["]

-"" -#" -(" -$" -%"-"# -## -(# -$# -%#-"( -#( -(( -$( -%(

!1!2!3!4!5

+ =0"["]

0#["]0(["] 3x13x1

5x13x5

x

z[1]= W[1] x + b[1] =

Lecture Notes for Computer Vision Sedat Ozer

Page 16: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

17

!"

!#

!$

!%

&'!(

Input Layer(L0)

Hidden Layer(L1)

Output Layer(L2)

a[1]x = a[0]

&' = a[2]

a[1] =

)"["]

)#["])(["]

3x1

&' = a[2]= )"[#]1x1

&'11x1

=

-"["]

-#["]-(["]

z[1]= W[1] x + b[1] =."" .#" .(" .$" .%"."# .## .(# .$# .%#."( .#( .(( .$( .%(

!1!2!3!4!5

+ =3"["]

3#["]3(["] 3x13x1

5x13x5

=4(3"["])4(3#["])4(3(["])

3x1

= 4(z[1])

x

z[1] =W[1] x + b[1]

a[1] = 4(z[1])

z[2] = W[2] a[1] + b[2]

a[2] = 4(z[2])

Steps to compute the output for logistic regression for one input sample:

Lecture Notes for Computer Vision Sedat Ozer

A 2-layer FC-NN Example:

Page 17: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Computation with less for-loop

18

for i = 1 to m! " ($) = ' " (($) + * "

+ " ($) = ,(! " $ )- . ($) = ' . + " ($) + / .

0 . ($) = ,(- . $ )

1 " = 2 " 3 + * "

4 " = ,(1 " )1 . = 2 . 4 " + * .

4 . = ,(1 . )

…5 = 6(") 6(.) 6(7)

0["](.)A["] = 0["](") 0["](7)…

m = number of training samples

6"6.6;

<=

Algorithm 1:

Algorithm 2:

(> " has the same shape as A["]) Lecture Notes for Computer Vision Sedat Ozer

Page 18: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Another 2 layer FC NN Example

19

!"

!#

!$

!%

&"["]

)*1

!+

!,

)*2

Input LayerWith 6 features

Hidden Layerwith 4 units

Output Layerwith 2 units

a[1]x = a[0]

)- = a[2]

&#["]

&,["]

&$["]

&"[#]

&#[#]

A 2 layer NN with 6 inputs and 2 outputs

a[1]=

&"["]

&#["]&,["]

&$["]4x1

)- = a[2]= &"[#]

&#[#]2x1

)*1)*2

2x1

=

QUESTION: What are the dims of W[1] and W[2]?Dims = (a x b); a=? b=?

a=4 and b=6 for W[1]

Lecture Notes for Computer Vision Sedat Ozer

Page 19: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

20

zSigmoid:

z

z

z

Common Activation Functions

ReLU:

tanh: Leaky ReLU:

Lecture Notes for Computer Vision Sedat Ozer

a(z)a(z)

! = # $ = max(0, $)

! = # $ = ,- − ,/-,- + ,/-

! = # $ = max(0.01$, $)

! = # $ = 11 + ,/-

a(z)a(z)

Page 20: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

Activation Function as: g(z)

21

! " = $ " % + ' "

( " = )(! " )! , = $ , ( " + ' ,

( , = )(! , )

Algorithm 2:

! " = $ " % + ' "

( " = -(! " )! , = $ , ( " + ' ,

( , = -(! , )

Algorithm 2:

Lecture Notes for Computer Vision Sedat Ozer

Page 21: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

With or Without the Activation Function

22

Lets have a look at the case where we do not use any activation function. (That is also equivalent to setting !(z[1]) = z[1] )

#$ = a[2]= %&[(]1x1

#$11x1

=

z[1] =W[1] x + b[1]

a[1] =!(z[1])

z[2] = W[2] a[1] + b[2]

a[2] =!(z[2])

#$ = z[2]= W x + b

z[1] =W[1] x + b[1]

a[1] = z[1]

z[2] = W[2] z[1] + b[2]

a[2] = z[2] = W[2] z[1] + b[2]

= W[2] [W[1] x + b[1]]+ b[2]

= W[2] W[1] x + W[2] b[1]+ b[2]

= W x + b

The output is always a linear function of the input!

Lecture Notes for Computer Vision Sedat Ozer

+&+(+,

#$

Page 22: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

• Remember that the updating process of the parameters depends on the derivatives!• That also depends on the derivative of the chosen activation function!• (we used sigmoid function previously in our logistic regression

implementation).

23Lecture Notes for Computer Vision Sedat Ozer

Derivatives for the Activation Functions

Page 23: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

24

a(z)

z

z

a(z)

z

a(z)

z

a(z)Derivatives for the Activation Functions

ReLU: ! = # $ = max(0, $)

tanh: ! = # $ = ,- − ,/-,- + ,/-

Leaky ReLU: ! = # $ = max(0.01$, $)

Sigmoid: ! = # $ = 11 + ,/-

#3 - = 4# $4$ = # $ (1 − # $ ) = !(1 − !)

#3 - = 4# $4$ == (1 − (tanh $ )2) = (1 − !2)

#3 - = 4# $4$ =

0, 9: $ < 01, 9: $ > 0

=>4,:9>,4, 9: $ = 0

#3 - = 4# $4$ = 0.01, 9: $ < 0

1, 9: $ ≥ 0Lecture Notes for Computer Vision Sedat Ozer

Page 24: Sedat Bikent NN-Part2sedat/CS484_555/Sedat_Bikent_NN... · 2020. 2. 29. · Title: Sedat_Bikent_NN-Part2 Created Date: 2/29/2020 9:45:56 AM

25

![#]% + '[#] = ([#]

%

![#]

'[#]

) ( # = +[#] ℒ(+[.], y)![.]+[#] + '[.] = ([.] ) ( . = +[.]

![.]

'[.]

2([.] = +[.] − 4

2![.] = 2([.]+ # 5

2'[.] = 2([.]

2([#] = ! . 62([.] ∗ 8[#]′(z # )

2![#] = 2([#]%6

2'[#] = 2([#]

2;[.] = <[.] − =

2![.] =1?2;[.]< # 5

2'[.] =1?@A. CD?(2; . , +%EC = 1, FGGA2E?C = HIDG)

2;[#] = ! . 62;[.] ∗ 8[#]′(Z # )

2![#] =1?2;[#]K6

2'[#] =1?@A. CD?(2; # , +%EC = 1, FGGA2E?C = HIDG)

Lecture Notes for Computer Vision Sedat Ozer

%#%.%L

M4