CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L12.pdf · 2020. 9. 6. · Machine...

Post on 07-Oct-2020

4 views 0 download

Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L12.pdf · 2020. 9. 6. · Machine...

CPSC340:MachineLearningandDataMining

LeastSquaresFall2020

Admin• Assignment3isup:– Startearly,thisisusuallythelongestassignment.

• We’regoingtostartusingcalculus andlinearalgebraalot.– YoushouldstartreviewingtheseASAPifyouarerusty.– Areviewofrelevantcalculusconceptsishere.– Areviewofrelevantlinearalgebraconceptsishere.

SupervisedLearningRound2:Regression• We’regoingtorevisitsupervisedlearning:

• Previously,weconsideredclassification:– Weassumedyi wasdiscrete:yi =‘spam’oryi =‘notspam’.

• Nowwe’regoingtoconsiderregression:– Weallowyi tobenumerical:yi =10.34cm.

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberoflungcancerdeathschangewithnumberofcigarettes?– Doesnumberofskincancerdeathschangewithlatitude?

http://www.cvgs.k12.va.us:81/digstats/main/inferant/d_regrs.htmlhttps://onlinecourses.science.psu.edu/stat501/node/11

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Dopeopleinbigcitieswalkfaster?– Istheuniverseexpandingorshrinkingorstayingthesamesize?

http://hosting.astro.cornell.edu/academics/courses/astro201/hubbles_law.htmhttps://www.nature.com/articles/259557a0.pdf

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberofgundeathschangewithgunownership?– Doesnumberviolentcrimeschangewithviolentvideogames?

http://www.vox.com/2015/10/3/9444417/gun-violence-united-states-americahttps://www.soundandvision.com/content/violence-and-video-games

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:

– DoeshighergenderequalityindexleadtomorewomenSTEMgrads?

• Notthatwe’redoingsupervisedlearning:– Tryingtopredictvalueof1variable(the‘yi’values).(insteadofmeasuringcorrelationbetween2).

• Supervisedlearningdoesnotgivecausality:– OK:“Higherindexiscorrelatedwithlowergrad%”.– OK:“Higherindexhelpspredictlowergrad%”.– BAD:“Higherindexleadstolowergrads%”.

• People/mediagettheseconfusedallthetime,becareful!• Therearelotsofpotentialreasonsforthiscorrelation.

https://www.weforum.org/agenda/2018/02/does-gender-equality-result-in-fewer-female-stem-grads/

HandlingNumericalLabels• Onewaytohandlenumericalyi:discretize.– E.g.,for‘age’couldweuse{‘age≤20’,‘20<age≤30’,‘age>30’}.– Nowwecanapplymethodsforclassificationtodoregression.– Butcoarsediscretizationlosesresolution.– Andfinediscretizationrequireslotsofdata.

• Thereexistregressionversionsofclassificationmethods:– Regressiontrees,probabilisticmodels,non-parametricmodels.

• Today:oneofoldest,butstillmostpopular/importantmethods:– Linearregressionbasedonsquarederror.– Interpretableandthebuildingblockformore-complexmethods.

LinearRegressionin1Dimension• Assumeweonlyhave1feature(d=1):– E.g.,xi isnumberofcigarettesandyi isnumberoflungcancerdeaths.

• Linearregressionmakespredictions𝑦"i usingalinearfunctionofxi:

• Theparameter‘w’istheweight orregressioncoefficient ofxi.– We’retemporarilyignoringthey-intercept.

• Asxi changes,slope‘w’affectstheratethat𝑦"i increases/decreases:– Positive‘w’:𝑦"i increaseasxi increases.– Negative‘w’:𝑦"i decreasesasxi increases.

LinearRegressionin1Dimension

Aside:terminologywoes• Differentfieldsusedifferentterminologyandsymbols.– Datapoints=objects=examples =rows=observations.– Inputs =predictors=features =explanatoryvariables=regressors =independentvariables=covariates=columns.

– Outputs =outcomes=targets=responsevariables=dependentvariables(alsocalleda“label”ifit’scategorical).

– Regressioncoefficients=weights=parameters=betas.

• Withlinearregression,thesymbolsareinconsistenttoo:– InML,thedataisXandy,andtheweightsarew.– Instatistics,thedataisXandy,andtheweightsareβ.– Inoptimization,thedataisAandb,andtheweightsarex.

LeastSquaresObjective• Ourlinearmodelisgivenby:

• Sowemakepredictions foranewexamplebyusing:

• Butwecan’tusethesameerrorasbefore:– Itisunlikelytofindalinewhere𝑦"𝑖 = 𝑦𝑖 exactly formanypoints.

• Duetonoise,relationshipnotbeingquitelinearorjustfloating-pointissues.– “Best”modelmayhave|𝑦"𝑖 − 𝑦𝑖| issmall butnotexactly0.

LeastSquaresObjective• Insteadof“exactyi”,weevaluate“size”oftheerror inprediction.• Classicwayissettingslope‘w’tominimizesumof squarederrors:

• Therearesomejustificationsforthischoice.– Aprobabilisticinterpretationiscominglaterinthecourse.

• Butusually,itisdonebecauseitiseasytominimize.

LeastSquaresObjective• Classicwaytosetslope‘w’isminimizingsumof squarederrors:

LeastSquaresObjective• Classicwaytosetslope‘w’isminimizingsumof squarederrors:

MinimizingaDifferentialFunction• Math101approachtominimizingadifferentiablefunction‘f’:

1. Takethederivativeof‘f’.2. Findpoints‘w’wherethederivativef’(w)isequalto0.3. Choosethesmallestone(andcheckthatf’’(w)ispositive).

Digression:MultiplyingbyaPositiveConstant• Notethatthisproblem:

• Hasthesamesetofminimizers asthisproblem:

• Andthesealsohavethesameminimizers:

• Icanmultiply‘f’byanypositiveconstantandnotchangesolution.– Derivativewillstillbezeroatthesamelocations.– We’llusethistrickalot!

(Quoratrollingonethicsofthis)

FindingLeastSquaresSolution• Finding‘w’thatminimizessumof squarederrors:

FindingLeastSquaresSolution• Finding‘w’thatminimizessumof squarederrors:

• Let’scheckthatthisisaminimizer bycheckingsecondderivative:

– Since(anything)2 isnon-negativeand(anythingnon-zero)2 >0,ifwehaveonenon-zerofeaturethenf’’(w)>0andthisisaminimizer.

LeastSquaresObjective/Solution(AnotherView)

• Leastsquaresminimizesaquadraticthatisasumofquadratics:

(pause)

Motivation:CombiningExplanatoryVariables• Smokingisnottheonlycontributortolungcancer.– Forexample,thereenvironmentalfactorslikeexposuretoasbestos.

• Howcanwemodelthecombined effect ofsmokingandasbestos?• Asimplewayiswitha2-dimensionallinearfunction:

• Wehaveaweightw1 forfeature‘1’andw2 forfeature‘2’:

LeastSquaresin2-Dimensions• Linearmodel:

• Thisdefinesatwo-dimensionalplane.

LeastSquaresin2-Dimensions• Linearmodel:

• Thisdefinesatwo-dimensionalplane.

• Notjustaline!

DifferentNotationsforLeastSquares• Ifwehave‘d’features,thed-dimensionallinearmodel is:

– Inwords,ourmodelisthattheoutputisaweightedsumoftheinputs.

• Wecanre-writethisinsummationnotation:

• Wecanalsore-writethisinvectornotation:

NotationAlert(again)• Inthiscourse,allvectorsareassumedtobecolumn-vectors:

• SowTxi isascalar:

• Sorowsof‘X’areactuallytransposeofcolumn-vectorxi:

LeastSquaresind-Dimensions• Thelinearleastsquaresmodelind-dimensionsminimizes:

• Datesbackto1801:GaussusedittopredictlocationofCeres.• Howdowefindthe bestvector‘w’ in‘d’dimensions?– Canwesetthe“partialderivative”ofeachvariableto0?

PartialDerivatives

http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml

PartialDerivatives

http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml

LeastSquaresPartialDerivatives(1Example)• Thelinearleastsquaresmodelind-dimensionsfor1example:

• Computingthepartialderivative forvariable‘1’:

LeastSquaresPartialDerivatives(‘n’Examples)• Linearleastsquarespartialderivativeforvariable1onexample‘i’:

• Foragenericvariable‘j’wewouldhave:

• Andif‘f’issummedoverall‘n’exampleswewouldhave:

• Unfortunately,thepartialderivativeforwj dependsonall{w1,w2,…,wd}– Ican’tjust“setequalto0andsolveforwj”.

GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.

• Gradient isvectorwithpartialderivative‘j’inposition‘j’:

http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml

GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.

• Gradient isvectorwithpartialderivative‘j’inposition‘j’:

http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml

Summary• Regression considersthecaseofanumericalyi.• Leastsquaresisaclassicmethodforfittinglinearmodels.– With1feature,ithasasimpleclosed-formsolution.– Canbegeneralizedto‘d’features.

• Gradient isvectorcontainingpartialderivativesofallvariables.• Nexttime: