Lecture Note 2 – Calculus and Probability Shuaiqiang Wang Department of CS & IS University of...
-
Upload
leslie-george -
Category
Documents
-
view
222 -
download
2
description
Transcript of Lecture Note 2 – Calculus and Probability Shuaiqiang Wang Department of CS & IS University of...
Lecture Note 2 – Calculus and Probability
Shuaiqiang WangDepartment of CS & ISUniversity of Jyväskylä
http://users.jyu.fi/~swang/[email protected]
Part 1: Calculus
Definition
• Given a function , the derivative is
𝑓 ′ (𝑥 )= 𝑑𝑑𝑥 𝑓 (𝑥)=
lim𝑡→ 0
𝑓 (𝑥+𝑡 )− 𝑓 (𝑥)
𝑡
𝑑𝑑𝑥 𝑓 (𝑡)=𝑑 𝑓
𝑑𝑡𝑑𝑡𝑑𝑥
𝑑𝑑𝑥 2=0
Polynomial Function
• Example:
𝑑𝑑𝑥 𝑥
𝑎=𝑎𝑥𝑎−1
Proof: Polynomial Function
Logarithm Function
• Where the base • Example:
𝑑𝑑𝑥 ln𝑥=
1𝑥
𝑑𝑑𝑥 ln(𝑥2+2)¿𝑡=𝑥2+2 𝑑
𝑑𝑡 ln 𝑡× 𝑑𝑡𝑑𝑥¿1𝑡 ×2 𝑥=
2𝑥𝑥2+2
Proof: Logarithm Function
• Let , Then when , and• =
Exponential Function
Example:
𝑑𝑑𝑥 𝑒
𝑥=𝑒𝑥
𝑑𝑑𝑥 𝑒
𝑥2+𝑥¿𝑡=𝑥2+𝑥 𝑑𝑑 𝑡 𝑒
𝑡× 𝑑𝑡𝑑𝑥
¿𝑒𝑡× (2 𝑥+1 )=(2 𝑥+1 )𝑒𝑥2+𝑥
Proof: Exponential Function
• Let’s calculate . Let Then• = • • Thus , and
Exponential Function
• Proof.• Let Then • =
• Thus
𝑑𝑑𝑥 𝑎
𝑥=𝑎𝑥 ln𝑎
Taylor Series
When
Example:
Partial Derivative and Gradient
𝒙=[𝑥1
⋮𝑥𝑛
] 𝑓 (𝒙 )=𝑎𝑥1𝑥2+𝑏𝑥22For example
Partial derivative of a function with respect to certain variable is the derivative of while regarding other variables as constants.
𝛻 𝑓 (𝒙 )=[𝜕 𝑓𝜕𝑥1
⋮𝜕 𝑓𝜕 𝑥𝑛
]
Taylor Approximation
𝑓 (𝑥 )≈∑𝑖=0
𝑘 𝑓 (𝑖 ) (𝑎 )𝑖 ! (𝑥−𝑎 )𝑖Taylor
Approximation
Taylor Series 𝑓 (𝑥 )=∑
𝑖=0
∞ 𝑓 (𝑖 ) (𝑎 )𝑖 ! (𝑥−𝑎 )𝑖
First-Order Taylor Approximation
𝑓 (𝒙 )≈ 𝑓 (𝒂 )+𝛻 𝑓 (𝒙 )⊤(𝒙−𝒂)
𝑓 (𝑥 )≈ 𝑓 (𝑎)+ 𝑓 ′ (𝑎 )(𝑥−𝑎)1 dimension
dimensions when
Gradient Descent Optimization
According to the first order Taylor approximation of ( ) :
( ) ( ) ( ) (1)It can be written as:
( ) ( ) ( ) (1)where is the learning rate, and is a unit vector represent
Tn n n
Tn n n
f x
f x hu f x h f x u O
f x hu f x h f x u Oh u
1
ing direction.Let , which is the value of in the next iteration.Our optimization objective function is:
arg min ( ) ( ) arg min ( ) (1)
The optimal solution is: ( )
n n
Tn n n
u u
n
x x hu x
f x hu f x h f x u O
u f x
Gradient Descent Algorithm
max
1
For n 1,2, , :( )
if || ( )|| , return
1End
n n
n n
n n n
Ng f xg x x
x x hgn n
K
Part 2: Probability
Independent Events• Let and be two independent events.
𝑃 ( 𝐴 ,𝐵 )=𝑃 ( 𝐴 ) 𝑃 (𝐵)
• Example 1: Coin tossing– Each tossing is independent to previous ones
• Example 2: Taking exams– Each exam is independent to previous ones– Fail 3 times:
– Pass at least 1 time:
Conditional Probability
• A person goes to sauna 6 times during the last 10 days, at most once per day.
• It snowed 8 days during the last 10 days.• It snowed 4 days during the 6 sauna days.• P(sauna | snow) = ?• P(snow | sauna) = ?
𝑃 ( 𝐴|𝐵 )= 𝑃 (𝐴 ,𝐵)𝑃 (𝐵)
Example
Bayes’ Theorem
𝑃 (𝜃|𝑦 )=𝑃 (𝑦 ,𝜃)𝑃 (𝑦 )
𝑃 (𝜃|𝑦 )=𝑃 (𝑦 ,𝜃)𝑃 (𝑦 )
𝑃 (𝑦 ,𝜃 )=𝑃 (𝑦|𝜃 ) 𝑃 (𝜃 )=𝑃 (𝜃|𝑦 ) 𝑃 ( 𝑦 )
Since
Then
¿𝑃 ( 𝑦|𝜃 ) 𝑃 (𝜃)
𝑃 (𝑦 )
Bayes’ Theorem
𝑃 (𝜃|𝑦 )=𝑃 (𝑦 ,𝜃)𝑃 (𝑦 )
=𝑃 (𝑦|𝜃 ) 𝑃 (𝜃)
𝑃 (𝑦)
With same data and same prior
Maximum Likelihood Estimation
• Input: A set of observations with parameters • Output: The estimation of • Assume that all of the observations are
independent • Thus their probability can be calculated as
ℒ (𝑦∨𝜃)=∏𝑖=1
𝑛
𝑃 (𝑦 𝑖∨𝜃)
Maximum Likelihood Estimation
• We try to find the largest probability of with the given observations
• With same and , we can actually maximize :
Optimization
• Since is a increasing function, it is equivalent to
Then we can optimize it with gradient descent.
Any Question?