Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...
Transcript of Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...
Introduction to Machine LearningLogistic Regression
Bhaskar Mukhoty, Shivam Bansal
Indian Institute of Techonology KanpurSummer School 2019
May 28, 2019
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 1 / 18
Lecture Outline
The classification problem
Intuition of Hyperplane.
The Heaviside function and 0-1 loss
The sigmoid function and cross-entropy loss.
Convex functions and Gradient Descent
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 2 / 18
The classification problem
Figure: Hyperplane Separation of Data
We assume separable data.
Want to learn a hyperplane, which separates them.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 3 / 18
How does a hyperplane looks like
In 2-dimensional space, the hyper plane is a line.
In d-dimensional space, it is a d − 1 dimensional plane.
The hyperplane can be uniquely identified, using a unit vector wperpendicular to the plane and a point on the plane.
The hyperplane divides the d-dim. space into two half-spaces.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 4 / 18
Equation of hyper-plane
Suppose we are given, a pointr = [r1, r2, · · · , rd ]> on the hyper planeand w = [w1,w2, · · · ,wd ]>
perpendicular to the plane.
The direction of w is called thedirection of the hyperplane.
Any point x = [x1, x2, · · · , xd ]> on thehyperplane should satisfy, w ⊥ (x− r).
We have, w>(x− r) = 0
w>x =w>r = w1r1 + w2r2 + · · ·+ wd rd = b
That is, for all x in plane P, we have,w>x = b. b is called the bias.
Figure: Point on a hyperplane P
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 5 / 18
Characterization of half-spaces using dot product
Points on the hyperplane, satisfiesw>x = b
Points in the half-space that lies in thedirection of the hyperplane, shouldsatisfy, w>(x− r) > 0, or, w>x > b
Points in the half-space that liesopposite to the direction of w, have,w>x < b.
Figure: Point above ahyperplane P
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 6 / 18
Bias is often hidden in extended feature
If we write x̃ = [x1, x2, · · · , ..xd , 1]
And, w̃ = [w1,w2, · · · , ..wd ,−b]
Then, w>x− b = 0, is same as, w̃>x̃ = 0
Instead searching for an {w, b} we would only search for w̃
Instead characterizing w>x > b we would characterize for w̃>x̃ > 0
However, instead for writing w̃>x̃, people often write w>x andcompare with 0
When compared with 0, it is understood bias b is hidden.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 7 / 18
The Heavside function and zero-one loss
Suppose we have a hyperplane w.
We can now predict label ypred ∈ {0, 1}of a point xi as h(〈w, x〉)h is called Heaviside function.
We can define, 0-1 loss as
L(w) =
{0 h(〈w, xi〉) = yi
1 h(〈w, xi〉) 6= yiFigure: Heaviside function
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 8 / 18
The non-separable data
The data might not always be separableby a hyper-plane, i.e. there may not beany w∗.
Even if the data is separable finding wwhich minimizes the 0− 1 loss isdifficult.
0− 1 loss is not differentiable.
Figure: non-separable data
Figure courtesy: https://appliedmachinelearning.blog/2017/03/09/understanding-support-vector-machines-a-primer/
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 9 / 18
The sigmoid function
The sigmoid function:
σ(x) =1
1 + exp(−x)
The t-sigmoid function:
σt(x) =1
1 + exp(−tx)
σt approaches Heavside as t becomeslarge.
Figure: t-sigmoid
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 10 / 18
The Logistic Regression
Assume yi takes label 1, with prob. πi ,i.e. Pr(yi = 1) = πi
And, Pr(yi = 0) = 1− πiWhere,
πi = σ(〈w, xi〉) =1
1 + exp(−〈w, xi〉)
Likelihood and Log Likelihood
p(yi |πi ) = πyii (1− πi )1−yi
log p(yi |πi ) = yi log πi + (1− yi ) log(1− πi )
Figure: sigmoid function
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 11 / 18
The negative-log-likelihood and cross-entropy loss
p(y|w) =n∏
i=1
p(yi |πi )
NLL(w) = − log p(y|w) = −n∑
i=1
log p(yi |πi )
= −n∑
i=1
yi log(σ(〈w, xi〉)) + (1− yi ) log(1− σ(〈w, xi〉))
wMLE = arg minw
NLL(w)
No closed form solution
wMLE has no closed form solution.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 12 / 18
Visualizing cross-entropy loss
Figure: σt() Figure: − log(σt()) Figure: − log(1− σt())
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 13 / 18
Convex Functions
Figure: Convex and Non-convex functions
A convex function will always lie above the tangent at any point.
∀x , y f (y) ≥ f (x) + (y − x)∇f (x)
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 14 / 18
Cross Entropy is convex functions
− log(σt(x)) and − log(1− σt(x)) are convex.
Sum of two convex function is convex.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 15 / 18
The Gradient Descent Algorithm
Initialize, w0 ← 0
Update, wt+1 ← wt − η∇NLL(wt)
Step length η, need to be tuned.
GD on a convex function will converge to a global optima.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 16 / 18
Gradient of Cross-Entropy Loss
∂NLL(w)
∂w= −
n∑i=1
[yi∂ log(σ(〈w, xi〉))
∂w+ (1− yi )
∂ log(1− σ(〈w, xi〉))
∂w
]
= −n∑
i=1
[yi (1− σ(〈w, xi〉))− (1− yi )σ(〈w, xi〉)] xi
=n∑
i=1
[σ(〈w, xi〉)− yi ] xi
d
dxσ(x) = σ(x)(1− σ(x))
wt+1 ← wt − ηn∑
i=1
[σ(〈w, xi〉)− yi ] xi
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 17 / 18
Questions?
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 18 / 18