Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...

18
Introduction to Machine Learning Logistic Regression Bhaskar Mukhoty, Shivam Bansal Indian Institute of Techonology Kanpur Summer School 2019 May 28, 2019 Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 ) Introduction to Machine Learning May 28, 2019 1 / 18

Transcript of Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...

Page 1: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Introduction to Machine LearningLogistic Regression

Bhaskar Mukhoty, Shivam Bansal

Indian Institute of Techonology KanpurSummer School 2019

May 28, 2019

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 1 / 18

Page 2: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Lecture Outline

The classification problem

Intuition of Hyperplane.

The Heaviside function and 0-1 loss

The sigmoid function and cross-entropy loss.

Convex functions and Gradient Descent

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 2 / 18

Page 3: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The classification problem

Figure: Hyperplane Separation of Data

We assume separable data.

Want to learn a hyperplane, which separates them.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 3 / 18

Page 4: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

How does a hyperplane looks like

In 2-dimensional space, the hyper plane is a line.

In d-dimensional space, it is a d − 1 dimensional plane.

The hyperplane can be uniquely identified, using a unit vector wperpendicular to the plane and a point on the plane.

The hyperplane divides the d-dim. space into two half-spaces.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 4 / 18

Page 5: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Equation of hyper-plane

Suppose we are given, a pointr = [r1, r2, · · · , rd ]> on the hyper planeand w = [w1,w2, · · · ,wd ]>

perpendicular to the plane.

The direction of w is called thedirection of the hyperplane.

Any point x = [x1, x2, · · · , xd ]> on thehyperplane should satisfy, w ⊥ (x− r).

We have, w>(x− r) = 0

w>x =w>r = w1r1 + w2r2 + · · ·+ wd rd = b

That is, for all x in plane P, we have,w>x = b. b is called the bias.

Figure: Point on a hyperplane P

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 5 / 18

Page 6: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Characterization of half-spaces using dot product

Points on the hyperplane, satisfiesw>x = b

Points in the half-space that lies in thedirection of the hyperplane, shouldsatisfy, w>(x− r) > 0, or, w>x > b

Points in the half-space that liesopposite to the direction of w, have,w>x < b.

Figure: Point above ahyperplane P

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 6 / 18

Page 7: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Bias is often hidden in extended feature

If we write x̃ = [x1, x2, · · · , ..xd , 1]

And, w̃ = [w1,w2, · · · , ..wd ,−b]

Then, w>x− b = 0, is same as, w̃>x̃ = 0

Instead searching for an {w, b} we would only search for w̃

Instead characterizing w>x > b we would characterize for w̃>x̃ > 0

However, instead for writing w̃>x̃, people often write w>x andcompare with 0

When compared with 0, it is understood bias b is hidden.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 7 / 18

Page 8: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The Heavside function and zero-one loss

Suppose we have a hyperplane w.

We can now predict label ypred ∈ {0, 1}of a point xi as h(〈w, x〉)h is called Heaviside function.

We can define, 0-1 loss as

L(w) =

{0 h(〈w, xi〉) = yi

1 h(〈w, xi〉) 6= yiFigure: Heaviside function

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 8 / 18

Page 9: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The non-separable data

The data might not always be separableby a hyper-plane, i.e. there may not beany w∗.

Even if the data is separable finding wwhich minimizes the 0− 1 loss isdifficult.

0− 1 loss is not differentiable.

Figure: non-separable data

Figure courtesy: https://appliedmachinelearning.blog/2017/03/09/understanding-support-vector-machines-a-primer/

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 9 / 18

Page 10: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The sigmoid function

The sigmoid function:

σ(x) =1

1 + exp(−x)

The t-sigmoid function:

σt(x) =1

1 + exp(−tx)

σt approaches Heavside as t becomeslarge.

Figure: t-sigmoid

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 10 / 18

Page 11: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The Logistic Regression

Assume yi takes label 1, with prob. πi ,i.e. Pr(yi = 1) = πi

And, Pr(yi = 0) = 1− πiWhere,

πi = σ(〈w, xi〉) =1

1 + exp(−〈w, xi〉)

Likelihood and Log Likelihood

p(yi |πi ) = πyii (1− πi )1−yi

log p(yi |πi ) = yi log πi + (1− yi ) log(1− πi )

Figure: sigmoid function

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 11 / 18

Page 12: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The negative-log-likelihood and cross-entropy loss

p(y|w) =n∏

i=1

p(yi |πi )

NLL(w) = − log p(y|w) = −n∑

i=1

log p(yi |πi )

= −n∑

i=1

yi log(σ(〈w, xi〉)) + (1− yi ) log(1− σ(〈w, xi〉))

wMLE = arg minw

NLL(w)

No closed form solution

wMLE has no closed form solution.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 12 / 18

Page 13: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Visualizing cross-entropy loss

Figure: σt() Figure: − log(σt()) Figure: − log(1− σt())

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 13 / 18

Page 14: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Convex Functions

Figure: Convex and Non-convex functions

A convex function will always lie above the tangent at any point.

∀x , y f (y) ≥ f (x) + (y − x)∇f (x)

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 14 / 18

Page 15: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Cross Entropy is convex functions

− log(σt(x)) and − log(1− σt(x)) are convex.

Sum of two convex function is convex.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 15 / 18

Page 16: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

The Gradient Descent Algorithm

Initialize, w0 ← 0

Update, wt+1 ← wt − η∇NLL(wt)

Step length η, need to be tuned.

GD on a convex function will converge to a global optima.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 16 / 18

Page 17: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Gradient of Cross-Entropy Loss

∂NLL(w)

∂w= −

n∑i=1

[yi∂ log(σ(〈w, xi〉))

∂w+ (1− yi )

∂ log(1− σ(〈w, xi〉))

∂w

]

= −n∑

i=1

[yi (1− σ(〈w, xi〉))− (1− yi )σ(〈w, xi〉)] xi

=n∑

i=1

[σ(〈w, xi〉)− yi ] xi

d

dxσ(x) = σ(x)(1− σ(x))

wt+1 ← wt − ηn∑

i=1

[σ(〈w, xi〉)− yi ] xi

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 17 / 18

Page 18: Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data We assume separable data. Want to learn a hyperplane, which separates them. Bhaskar

Questions?

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 18 / 18