Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...

Introduction to Machine LearningLogistic Regression

Bhaskar Mukhoty, Shivam Bansal

Indian Institute of Techonology KanpurSummer School 2019

May 28, 2019

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 28, 2019 1 / 18

Lecture Outline

The classification problem

Intuition of Hyperplane.

The Heaviside function and 0-1 loss

The sigmoid function and cross-entropy loss.

Convex functions and Gradient Descent


The classification problem

Figure: Hyperplane Separation of Data

We assume separable data.

Want to learn a hyperplane, which separates them.


How does a hyperplane looks like

In 2-dimensional space, the hyper plane is a line.

In d-dimensional space, it is a d − 1 dimensional plane.

The hyperplane can be uniquely identified, using a unit vector wperpendicular to the plane and a point on the plane.

The hyperplane divides the d-dim. space into two half-spaces.


Equation of hyper-plane

Suppose we are given, a pointr = [r1, r2, · · · , rd ]> on the hyper planeand w = [w1,w2, · · · ,wd ]>

perpendicular to the plane.

The direction of w is called thedirection of the hyperplane.

Any point x = [x1, x2, · · · , xd ]> on thehyperplane should satisfy, w ⊥ (x− r).

We have, w>(x− r) = 0

w>x =w>r = w1r1 + w2r2 + · · ·+ wd rd = b

That is, for all x in plane P, we have,w>x = b. b is called the bias.

Figure: Point on a hyperplane P


Characterization of half-spaces using dot product

Points on the hyperplane, satisfiesw>x = b

Points in the half-space that lies in thedirection of the hyperplane, shouldsatisfy, w>(x− r) > 0, or, w>x > b

Points in the half-space that liesopposite to the direction of w, have,w>x < b.

Figure: Point above ahyperplane P


Bias is often hidden in extended feature

If we write x̃ = [x1, x2, · · · , ..xd , 1]

And, w̃ = [w1,w2, · · · , ..wd ,−b]

Then, w>x− b = 0, is same as, w̃>x̃ = 0

Instead searching for an {w, b} we would only search for w̃

Instead characterizing w>x > b we would characterize for w̃>x̃ > 0

However, instead for writing w̃>x̃, people often write w>x andcompare with 0

When compared with 0, it is understood bias b is hidden.


The Heavside function and zero-one loss

Suppose we have a hyperplane w.

We can now predict label ypred ∈ {0, 1}of a point xi as h(〈w, x〉)h is called Heaviside function.

We can define, 0-1 loss as

L(w) =

{0 h(〈w, xi〉) = yi

1 h(〈w, xi〉) 6= yiFigure: Heaviside function


The non-separable data

The data might not always be separableby a hyper-plane, i.e. there may not beany w∗.

Even if the data is separable finding wwhich minimizes the 0− 1 loss isdifficult.

0− 1 loss is not differentiable.

Figure: non-separable data

Figure courtesy: https://appliedmachinelearning.blog/2017/03/09/understanding-support-vector-machines-a-primer/


The sigmoid function

The sigmoid function:

σ(x) =1

1 + exp(−x)

The t-sigmoid function:

σt(x) =1

1 + exp(−tx)

σt approaches Heavside as t becomeslarge.

Figure: t-sigmoid


The Logistic Regression

Assume yi takes label 1, with prob. πi ,i.e. Pr(yi = 1) = πi

And, Pr(yi = 0) = 1− πiWhere,

πi = σ(〈w, xi〉) =1

1 + exp(−〈w, xi〉)

Likelihood and Log Likelihood

p(yi |πi ) = πyii (1− πi )1−yi

log p(yi |πi ) = yi log πi + (1− yi ) log(1− πi )

Figure: sigmoid function


The negative-log-likelihood and cross-entropy loss

p(y|w) =n∏

i=1

p(yi |πi )

NLL(w) = − log p(y|w) = −n∑

i=1

log p(yi |πi )

= −n∑

i=1

yi log(σ(〈w, xi〉)) + (1− yi ) log(1− σ(〈w, xi〉))

wMLE = arg minw

NLL(w)

No closed form solution

wMLE has no closed form solution.


Visualizing cross-entropy loss

Figure: σt() Figure: − log(σt()) Figure: − log(1− σt())


Convex Functions

Figure: Convex and Non-convex functions

A convex function will always lie above the tangent at any point.

∀x , y f (y) ≥ f (x) + (y − x)∇f (x)


Cross Entropy is convex functions

− log(σt(x)) and − log(1− σt(x)) are convex.

Sum of two convex function is convex.


The Gradient Descent Algorithm

Initialize, w0 ← 0

Update, wt+1 ← wt − η∇NLL(wt)

Step length η, need to be tuned.

GD on a convex function will converge to a global optima.


Gradient of Cross-Entropy Loss

∂NLL(w)

∂w= −

n∑i=1

[yi∂ log(σ(〈w, xi〉))

∂w+ (1− yi )

∂ log(1− σ(〈w, xi〉))

∂w

]

= −n∑

i=1

[yi (1− σ(〈w, xi〉))− (1− yi )σ(〈w, xi〉)] xi

=n∑

i=1

[σ(〈w, xi〉)− yi ] xi

d

dxσ(x) = σ(x)(1− σ(x))

wt+1 ← wt − ηn∑

i=1

[σ(〈w, xi〉)− yi ] xi


Questions?


Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...

Documents

Transcript of Introduction to Machine Learning · The classi cation problem Figure:Hyperplane Separation of Data...