CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L12.pdf · 2020. 9. 6. · Machine...

CPSC340:MachineLearningandDataMining

LeastSquaresFall2020

Admin• Assignment3isup:– Startearly,thisisusuallythelongestassignment.

• We’regoingtostartusingcalculus andlinearalgebraalot.– YoushouldstartreviewingtheseASAPifyouarerusty.– Areviewofrelevantcalculusconceptsishere.– Areviewofrelevantlinearalgebraconceptsishere.

SupervisedLearningRound2:Regression• We’regoingtorevisitsupervisedlearning:

• Previously,weconsideredclassification:– Weassumedyi wasdiscrete:yi =‘spam’oryi =‘notspam’.

• Nowwe’regoingtoconsiderregression:– Weallowyi tobenumerical:yi =10.34cm.

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberoflungcancerdeathschangewithnumberofcigarettes?– Doesnumberofskincancerdeathschangewithlatitude?

http://www.cvgs.k12.va.us:81/digstats/main/inferant/d_regrs.htmlhttps://onlinecourses.science.psu.edu/stat501/node/11

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Dopeopleinbigcitieswalkfaster?– Istheuniverseexpandingorshrinkingorstayingthesamesize?

http://hosting.astro.cornell.edu/academics/courses/astro201/hubbles_law.htmhttps://www.nature.com/articles/259557a0.pdf

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberofgundeathschangewithgunownership?– Doesnumberviolentcrimeschangewithviolentvideogames?

http://www.vox.com/2015/10/3/9444417/gun-violence-united-states-americahttps://www.soundandvision.com/content/violence-and-video-games

Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:

– DoeshighergenderequalityindexleadtomorewomenSTEMgrads?

• Notthatwe’redoingsupervisedlearning:– Tryingtopredictvalueof1variable(the‘yi’values).(insteadofmeasuringcorrelationbetween2).

• Supervisedlearningdoesnotgivecausality:– OK:“Higherindexiscorrelatedwithlowergrad%”.– OK:“Higherindexhelpspredictlowergrad%”.– BAD:“Higherindexleadstolowergrads%”.

• People/mediagettheseconfusedallthetime,becareful!• Therearelotsofpotentialreasonsforthiscorrelation.

https://www.weforum.org/agenda/2018/02/does-gender-equality-result-in-fewer-female-stem-grads/

HandlingNumericalLabels• Onewaytohandlenumericalyi:discretize.– E.g.,for‘age’couldweuse{‘age≤20’,‘20<age≤30’,‘age>30’}.– Nowwecanapplymethodsforclassificationtodoregression.– Butcoarsediscretizationlosesresolution.– Andfinediscretizationrequireslotsofdata.

• Thereexistregressionversionsofclassificationmethods:– Regressiontrees,probabilisticmodels,non-parametricmodels.

• Today:oneofoldest,butstillmostpopular/importantmethods:– Linearregressionbasedonsquarederror.– Interpretableandthebuildingblockformore-complexmethods.

LinearRegressionin1Dimension• Assumeweonlyhave1feature(d=1):– E.g.,xi isnumberofcigarettesandyi isnumberoflungcancerdeaths.

• Linearregressionmakespredictions𝑦"i usingalinearfunctionofxi:

• Theparameter‘w’istheweight orregressioncoefficient ofxi.– We’retemporarilyignoringthey-intercept.

• Asxi changes,slope‘w’affectstheratethat𝑦"i increases/decreases:– Positive‘w’:𝑦"i increaseasxi increases.– Negative‘w’:𝑦"i decreasesasxi increases.

LinearRegressionin1Dimension

Aside:terminologywoes• Differentfieldsusedifferentterminologyandsymbols.– Datapoints=objects=examples =rows=observations.– Inputs =predictors=features =explanatoryvariables=regressors =independentvariables=covariates=columns.

– Outputs =outcomes=targets=responsevariables=dependentvariables(alsocalleda“label”ifit’scategorical).

– Regressioncoefficients=weights=parameters=betas.

• Withlinearregression,thesymbolsareinconsistenttoo:– InML,thedataisXandy,andtheweightsarew.– Instatistics,thedataisXandy,andtheweightsareβ.– Inoptimization,thedataisAandb,andtheweightsarex.

LeastSquaresObjective• Ourlinearmodelisgivenby:

• Sowemakepredictions foranewexamplebyusing:

• Butwecan’tusethesameerrorasbefore:– Itisunlikelytofindalinewhere𝑦"𝑖 = 𝑦𝑖 exactly formanypoints.

• Duetonoise,relationshipnotbeingquitelinearorjustfloating-pointissues.– “Best”modelmayhave|𝑦"𝑖 − 𝑦𝑖| issmall butnotexactly0.

LeastSquaresObjective• Insteadof“exactyi”,weevaluate“size”oftheerror inprediction.• Classicwayissettingslope‘w’tominimizesumof squarederrors:

• Therearesomejustificationsforthischoice.– Aprobabilisticinterpretationiscominglaterinthecourse.

• Butusually,itisdonebecauseitiseasytominimize.

LeastSquaresObjective• Classicwaytosetslope‘w’isminimizingsumof squarederrors:

MinimizingaDifferentialFunction• Math101approachtominimizingadifferentiablefunction‘f’:

1. Takethederivativeof‘f’.2. Findpoints‘w’wherethederivativef’(w)isequalto0.3. Choosethesmallestone(andcheckthatf’’(w)ispositive).

Digression:MultiplyingbyaPositiveConstant• Notethatthisproblem:

• Hasthesamesetofminimizers asthisproblem:

• Andthesealsohavethesameminimizers:

• Icanmultiply‘f’byanypositiveconstantandnotchangesolution.– Derivativewillstillbezeroatthesamelocations.– We’llusethistrickalot!

(Quoratrollingonethicsofthis)

FindingLeastSquaresSolution• Finding‘w’thatminimizessumof squarederrors:

• Let’scheckthatthisisaminimizer bycheckingsecondderivative:

– Since(anything)2 isnon-negativeand(anythingnon-zero)2 >0,ifwehaveonenon-zerofeaturethenf’’(w)>0andthisisaminimizer.

LeastSquaresObjective/Solution(AnotherView)

• Leastsquaresminimizesaquadraticthatisasumofquadratics:

(pause)

Motivation:CombiningExplanatoryVariables• Smokingisnottheonlycontributortolungcancer.– Forexample,thereenvironmentalfactorslikeexposuretoasbestos.

• Howcanwemodelthecombined effect ofsmokingandasbestos?• Asimplewayiswitha2-dimensionallinearfunction:

• Wehaveaweightw1 forfeature‘1’andw2 forfeature‘2’:

LeastSquaresin2-Dimensions• Linearmodel:

• Thisdefinesatwo-dimensionalplane.

LeastSquaresin2-Dimensions• Linearmodel:

• Thisdefinesatwo-dimensionalplane.

• Notjustaline!

DifferentNotationsforLeastSquares• Ifwehave‘d’features,thed-dimensionallinearmodel is:

– Inwords,ourmodelisthattheoutputisaweightedsumoftheinputs.

• Wecanre-writethisinsummationnotation:

• Wecanalsore-writethisinvectornotation:

NotationAlert(again)• Inthiscourse,allvectorsareassumedtobecolumn-vectors:

• SowTxi isascalar:

• Sorowsof‘X’areactuallytransposeofcolumn-vectorxi:

LeastSquaresind-Dimensions• Thelinearleastsquaresmodelind-dimensionsminimizes:

• Datesbackto1801:GaussusedittopredictlocationofCeres.• Howdowefindthe bestvector‘w’ in‘d’dimensions?– Canwesetthe“partialderivative”ofeachvariableto0?

PartialDerivatives

http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml

PartialDerivatives

LeastSquaresPartialDerivatives(1Example)• Thelinearleastsquaresmodelind-dimensionsfor1example:

• Computingthepartialderivative forvariable‘1’:

LeastSquaresPartialDerivatives(‘n’Examples)• Linearleastsquarespartialderivativeforvariable1onexample‘i’:

• Foragenericvariable‘j’wewouldhave:

• Andif‘f’issummedoverall‘n’exampleswewouldhave:

• Unfortunately,thepartialderivativeforwj dependsonall{w1,w2,…,wd}– Ican’tjust“setequalto0andsolveforwj”.

GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.

• Gradient isvectorwithpartialderivative‘j’inposition‘j’:

GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.

• Gradient isvectorwithpartialderivative‘j’inposition‘j’:

Summary• Regression considersthecaseofanumericalyi.• Leastsquaresisaclassicmethodforfittinglinearmodels.– With1feature,ithasasimpleclosed-formsolution.– Canbegeneralizedto‘d’features.

• Gradient isvectorcontainingpartialderivativesofallvariables.• Nexttime:

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L12.pdf · 2020. 9. 6. · Machine...

Documents

Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L12.pdf · 2020. 9. 6. · Machine...

Introduction to Databases - Classesclasses.engr.oregonstate.edu/eecs/fall2015/cs340-001/Lectures/... · –Turing award. •IBM: ... –Oracle. –PeopleSoft. –SAP. –Siebel. ...

Bayesian concept learning - UBC Computer Sciencemurphyk/Teaching/CS340-Fall07/... · 2007-08-27 · Bayesian concept learning Kevin P. Murphy Last updated August 27, 2007 1 Introduction

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L5.pdf · Machine Learning and Data Mining Probabilistic Classification Fall 2019. Admin •Waiting list people: the

Information theory - UBC Department of Computer Sciencemurphyk/Teaching/CS340-Fall06/... · 2006-10-25 · distribution. Hence there is a deep connection between information theory

List Interface CS340 1. java.util.List Interface and its Implementers CS340 2.

CS340 Machine learning Gaussian classifiersmurphyk/Teaching/CS340-Fall07/...CS340 Machine learning Gaussian classifiers 2 Correlated features • Height and weight are not independent

CS340 Machine learning Decision theory

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L10.pdf · Other Clustering Methods •Mixture models: –Probabilistic clustering. •Mean-shift clustering: –Finds

Section 3.1: Inductively Defined Setsweb.cecs.pdx.edu/~harry/discrete/slides/Section3.1.pdf · 2010-04-12 · CS340-Discrete Structures Section 3.1 Section 3.1: Inductively Defined

Section 8.4 Insertion Sort CS340 1. Insertion Sort Another quadratic sort, insertion sort, is based on the technique used by card players to arrange.

Alexandre Duret-Lutz adl@lrde.epitaadl/ens/iitj/cs340/cs340.pdf · Alternatively, a recursive de nition of w n can be given as: w n = (" if n = 0 w n 1 w if n >0 Examples: (01 )3

TRANSMISSION MEDIA - Bentley Universitycis.bentley.edu/jgorgone/cs340/A/pdf/tranmedia.pdfTRANSMISSION MEDIA Characteristics: ... sites in a data transmission network. Teletext Service:

University of Virginia Software Development Processes (CS340 John Knight 2005) 1 Software Development Processes.

CS 340 Lec. 21: Hidden Markov Modelsarnaud/cs340/lec21_HMM_handouts.pdf · Typical applications include: speech processing, tracking, stock prices. Most popular model for time dependent

CS340 Machine learning Lecture 1 Introduction

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure

Section 3.2: Recursively Defined Functions and Proceduresweb.cecs.pdx.edu/~harry/discrete/slides/Section3.2.pdf · CS340-Discrete Structures Section 3.2 Page 3 Example: Find a recursive

B -ISDN - Computer Information Systemscis.bentley.edu/jgorgone/cs340/C/pdf/BISDN.pdf · 2000-02-28 · BISDN is an extension of ISDN only in term ... BISDN LFC = local function capabilities

CS340 Machine learning Lecture 4 Learning theory

ANALOG VS DIGITAL - Computer Information Systemscis.bentley.edu/jgorgone/cs340/A/pdf/ad.pdf · ANALOG VS DIGITAL. A&D 8 Copyright 1998 ... Digital vs . Analog Transmission ... Any