Linear Regression & Gradient Descentbboots3/CS4641-Fall2018/...Linear Regression & Gradient Descent...

Post on 28-May-2020

16 views 0 download

Transcript of Linear Regression & Gradient Descentbboots3/CS4641-Fall2018/...Linear Regression & Gradient Descent...

LinearRegression&GradientDescent

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

RegressionGiven:– Datawhere

– Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

berA

rcticSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)

LinearRegressionQuadraticRegression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

LinearRegression• Hypothesis:

• Fitmodelbyminimizingsumofsquarederrors

3

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0 =1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy ofGregShakhnarovich

LeastSquaresLinearRegression

4

• CostFunction

• Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

IntuitionBehindCostFunction

5SlidebyAndrewNg

IntuitionBehindCostFunction

6

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

7

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

8

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

9

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

10

✓ J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

11

J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

12

J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminimainlinearregression

GradientDescent• Initialize• Repeatuntilconvergence

13

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

learningrate(small)e.g.,α=0.05

J(✓)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

GradientDescent• Initialize• Repeatuntilconvergence

14

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescentforLinearRegression

• Initialize• Repeatuntilconvergence

15

simultaneousupdateforj =0...d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new

� ✓old

k2 < ✏• Assumeconvergencewhen

GradientDescent

16

(forfixed,thisisafunctionofx) (functionoftheparameters)

h(x)=-900– 0.1x

SlidebyAndrewNg

GradientDescent

17

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

18

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

19

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

20

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

21

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

22

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

23

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

24

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

Choosingα

25

αtoosmall

slowconvergence

αtoolarge

Increasingvaluefor J(✓)

• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs

• e.g.log,exp,squareroot,square,etc.

– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3

– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables

• example:x3 =x1 × x2

Thisallowsuseoflinear regressiontechniquestofitnon-linear datasets.

LinearBasisFunctionModels

• Generally,

• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfunction

BasedonslidebyChristopherBishop(PRML)

LinearBasisFunctionModels

– Theseareglobal;asmallchangeinx affectsallbasisfunctions

• Polynomialbasisfunctions:

• Gaussianbasisfunctions:

– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).

BasedonslidebyChristopherBishop(PRML)

LinearBasisFunctionModels• Sigmoidal basisfunctions:

where

– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).

BasedonslidebyChristopherBishop(PRML)

ExampleofFittingaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj

QualityofFit

Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )

• ...butfailstogeneralizetonewexamples

31

Price

Size

Price

Size

Price

Size

Underfitting(highbias)

Overfitting(highvariance)

Correctfit

J(✓) ⇡ 0

BasedonexamplebyAndrewNg

Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis

• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel

• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)

32

✓j

Regularization• Linearregressionobjectivefunction

– istheregularizationparameter()– Noregularizationon!

33

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularization

• Notethat

– Thisisthemagnitudeofthefeaturecoefficientvector!

• Wecanalsothinkofthisas:

• L2 regularizationpullscoefficientstoward0

34

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularization

• Whathappensifwesettobehuge(e.g.,1010)?

35

�Price

Size0 0 0 0

BasedonexamplebyAndrewNg

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

RegularizedLinearRegression

36

• CostFunction

• Fitbysolving

• Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

RegularizedLinearRegression

37

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

• Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j