Introduction to Machine Learning -...
Transcript of Introduction to Machine Learning -...
![Page 1: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/1.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Introduction to Machine LearningBishop PRML Ch. 1
Alireza Ghane
![Page 2: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/2.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
Intro. to Machine Learning Alireza Ghane / Torsten Moller 1
![Page 3: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/3.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
Intro. to Machine Learning Alireza Ghane / Torsten Moller 2
![Page 4: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/4.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Course Info.
Home: http://vda.univie.ac.at/Teaching/ML/15s
Discussions: https://moodle.univie.ac.at/
Dr. Torsten Moller
Alireza Ghane
Intro. to Machine Learning Alireza Ghane / Torsten Moller 3
![Page 5: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/5.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Registration
Course max participants: 25
Course Registered: 61Number of Seats: 48
Excess: 13
Sign your name on the sheet
If you miss the first two sessions,you will be automatically SIGNED OFF the course!
Intro. to Machine Learning Alireza Ghane / Torsten Moller 4
![Page 6: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/6.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Registration
Course max participants: 25
Course Registered: 61Number of Seats: 48
Excess: 13
Sign your name on the sheet
If you miss the first two sessions,you will be automatically SIGNED OFF the course!
Intro. to Machine Learning Alireza Ghane / Torsten Moller 5
![Page 7: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/7.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
References
Main Textbook: Pattern Recognition and Machine Learning,Christopher M. Bishop, Springer 2006.
Other Useful Resources:
• The Elements of Statistical Learning, Trevor Hastie, RobertTibshirani, and Jerome Friedman.
• Machine Learning, Tom Mitchel.
• Pattern Classification (2nd ed.), Richard O. Duda, Peter E.Hart, and David G. Stork.
• Machine Learning, An Algorithmic Perspective, StephenMarsland.
• The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar.
• Learning from Data, Cherkassky-Mulier.
Online Courses: Andrew Ng: http://ml-class.org/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 6
![Page 8: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/8.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
References
Main Textbook: Pattern Recognition and Machine Learning,Christopher M. Bishop, Springer 2006.
Other Useful Resources:
• The Elements of Statistical Learning, Trevor Hastie, RobertTibshirani, and Jerome Friedman.
• Machine Learning, Tom Mitchel.
• Pattern Classification (2nd ed.), Richard O. Duda, Peter E.Hart, and David G. Stork.
• Machine Learning, An Algorithmic Perspective, StephenMarsland.
• The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar.
• Learning from Data, Cherkassky-Mulier.
Online Courses: Andrew Ng: http://ml-class.org/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 7
![Page 9: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/9.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
References
Main Textbook: Pattern Recognition and Machine Learning,Christopher M. Bishop, Springer 2006.
Other Useful Resources:
• The Elements of Statistical Learning, Trevor Hastie, RobertTibshirani, and Jerome Friedman.
• Machine Learning, Tom Mitchel.
• Pattern Classification (2nd ed.), Richard O. Duda, Peter E.Hart, and David G. Stork.
• Machine Learning, An Algorithmic Perspective, StephenMarsland.
• The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar.
• Learning from Data, Cherkassky-Mulier.
Online Courses: Andrew Ng: http://ml-class.org/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 8
![Page 10: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/10.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Grading
• Grading:
• Assignments / Labs ( 50% )5 assignments, 10% each
• Final Exam ( 40 % )
• Class Feedback ( 10 % )
• Assignment late policy
• 5 grace days for allassignments together
• after the grace days, 25%penalty for each day
• Programming with:MATLAB (licensed): http://de.mathworks.com/
Octave (free): https://www.gnu.org/software/octave/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 9
![Page 11: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/11.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Grading
• Grading:
• Assignments / Labs ( 50% )5 assignments, 10% each
• Final Exam ( 40 % )
• Class Feedback ( 10 % )
• Assignment late policy
• 5 grace days for allassignments together
• after the grace days, 25%penalty for each day
• Programming with:MATLAB (licensed): http://de.mathworks.com/
Octave (free): https://www.gnu.org/software/octave/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 10
![Page 12: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/12.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Grading
• Grading:
• Assignments / Labs ( 50% )5 assignments, 10% each
• Final Exam ( 40 % )
• Class Feedback ( 10 % )
• Assignment late policy
• 5 grace days for allassignments together
• after the grace days, 25%penalty for each day
• Programming with:MATLAB (licensed): http://de.mathworks.com/
Octave (free): https://www.gnu.org/software/octave/
Intro. to Machine Learning Alireza Ghane / Torsten Moller 11
![Page 13: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/13.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Course Topics
We will cover techniques in the standard ML toolkit
maximum likelihood, regularization, support vector machines(SVM), Fisher’s linear discriminant (LDA), boosting, principal
components analysis (PCA), Markov random fields (MRF),neural networks, graphical models, belief propagation,
expectation-maximization (EM), mixture models, mixtures ofexperts (MoE), hidden Markov models (HMM), particle filters,
Markov Chain Monte Carlo (MCMC), Gibbs sampling, ...
Intro. to Machine Learning Alireza Ghane / Torsten Moller 12
![Page 14: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/14.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Background
• Calculus:
E = mc2 ⇒ ∂E
∂c= 2mc
• Linear algebra (PRML Appendix C):
Aui = λiui;∂
∂x(xTa) = a
• Probability (PRML Ch. 1.2):
p(X) =∑Y
p(X,Y ); p(x) =
∫p(x, y)dy; Ex[f ] =
∫p(x)f(x)dx
It will be possible to refresh, but if you’venever seen these before this course will bevery difficult.
Intro. to Machine Learning Alireza Ghane / Torsten Moller 13
![Page 15: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/15.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
Intro. to Machine Learning Alireza Ghane / Greg Mori 14
![Page 16: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/16.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
What is Machine Learning (ML)?
• Algorithms that automatically improve performancethrough experience
• Often this means define a model by hand, and use data tofit its parameters
Intro. to Machine Learning Alireza Ghane / Greg Mori 15
![Page 17: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/17.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Why ML?
• The real world is complex – difficult to hand-craft solutions.
• ML is the preferred framework for applications in manyfields:
• Computer Vision• Natural Language Processing, Speech Recognition• Robotics• . . .
Intro. to Machine Learning Alireza Ghane / Greg Mori 16
![Page 18: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/18.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Hand-written Digit Recognition
!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334
7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(
8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6
C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"
/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:
8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/
(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6
!"# $%&'() *+,-. */0+12.33. 4,3,5,6.
N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/
+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI
*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:
=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/
1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!
-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'
!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997
:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ
?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6
:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6
:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6
Belongie et al. PAMI 2002
• Difficult to hand-craft rules about digits
Intro. to Machine Learning Alireza Ghane / Greg Mori 17
![Page 19: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/19.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Hand-written Digit Recognition
xi =
!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334
7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(
8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6
C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"
/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:
8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/
(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6
!"# $%&'() *+,-. */0+12.33. 4,3,5,6.
N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/
+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI
*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:
=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/
1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!
-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'
!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997
:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ
?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6
:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6
:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6
ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
• Represent input image as a vector xi ∈ R784.
• Suppose we have a target vector ti• This is supervised learning• Discrete, finite label set: perhaps ti ∈ {0, 1}10, a
classification problem
• Given a training set {(x1, t1), . . . , (xN , tN )}, learningproblem is to construct a “good” function y(x) from these.
• y : R784 → R10
Intro. to Machine Learning Alireza Ghane / Greg Mori 18
![Page 20: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/20.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Face Detection
Schneiderman and Kanade, IJCV 2002
• Classification problem
• ti ∈ {0, 1, 2}, non-face, frontal face, profile face.
Intro. to Machine Learning Alireza Ghane / Greg Mori 19
![Page 21: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/21.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Spam Detection
• Classification problem
• ti ∈ {0, 1}, non-spam, spam
• xi counts of words, e.g. Viagra, stock, outperform,multi-bagger
Intro. to Machine Learning Alireza Ghane / Greg Mori 20
![Page 22: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/22.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Caveat - Horses (source?)
• Once upon a time there were two neighboring farmers, Jedand Ned. Each owned a horse, and the horses both liked tojump the fence between the two farms. Clearly the farmersneeded some means to tell whose horse was whose.
• So Jed and Ned got together and agreed on a scheme fordiscriminating between horses. Jed would cut a small notchin one ear of his horse. Not a big, painful notch, but justbig enough to be seen. Well, wouldn’t you know it, the dayafter Jed cut the notch in horse’s ear, Ned’s horse caughton the barbed wire fence and tore his ear the exact sameway!
• Something else had to be devised, so Jed tied a big bluebow on the tail of his horse. But the next day, Jed’s horsejumped the fence, ran into the field where Ned’s horse wasgrazing, and chewed the bow right off the other horse’s tail.Ate the whole bow!
Intro. to Machine Learning Alireza Ghane / Greg Mori 21
![Page 23: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/23.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Caveat - Horses (source?)
• Once upon a time there were two neighboring farmers, Jedand Ned. Each owned a horse, and the horses both liked tojump the fence between the two farms. Clearly the farmersneeded some means to tell whose horse was whose.
• So Jed and Ned got together and agreed on a scheme fordiscriminating between horses. Jed would cut a small notchin one ear of his horse. Not a big, painful notch, but justbig enough to be seen. Well, wouldn’t you know it, the dayafter Jed cut the notch in horse’s ear, Ned’s horse caughton the barbed wire fence and tore his ear the exact sameway!
• Something else had to be devised, so Jed tied a big bluebow on the tail of his horse. But the next day, Jed’s horsejumped the fence, ran into the field where Ned’s horse wasgrazing, and chewed the bow right off the other horse’s tail.Ate the whole bow!
Intro. to Machine Learning Alireza Ghane / Greg Mori 22
![Page 24: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/24.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Caveat - Horses (source?)
• Once upon a time there were two neighboring farmers, Jedand Ned. Each owned a horse, and the horses both liked tojump the fence between the two farms. Clearly the farmersneeded some means to tell whose horse was whose.
• So Jed and Ned got together and agreed on a scheme fordiscriminating between horses. Jed would cut a small notchin one ear of his horse. Not a big, painful notch, but justbig enough to be seen. Well, wouldn’t you know it, the dayafter Jed cut the notch in horse’s ear, Ned’s horse caughton the barbed wire fence and tore his ear the exact sameway!
• Something else had to be devised, so Jed tied a big bluebow on the tail of his horse. But the next day, Jed’s horsejumped the fence, ran into the field where Ned’s horse wasgrazing, and chewed the bow right off the other horse’s tail.Ate the whole bow!
Intro. to Machine Learning Alireza Ghane / Greg Mori 23
![Page 25: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/25.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Caveat - Horses (source?)
• Finally, Jed suggested, and Ned concurred, that theyshould pick a feature that was less apt to change. Heightseemed like a good feature to use. But were the heightsdifferent? Well, each farmer went and measured his horse,and do you know what? The brown horse was a full inchtaller than the white one!
Moral of the story: ML provides theory and tools for settingparameters. Make sure you have the right model and features.Think about your “feature vector x.”
Intro. to Machine Learning Alireza Ghane / Greg Mori 24
![Page 26: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/26.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Caveat - Horses (source?)
• Finally, Jed suggested, and Ned concurred, that theyshould pick a feature that was less apt to change. Heightseemed like a good feature to use. But were the heightsdifferent? Well, each farmer went and measured his horse,and do you know what? The brown horse was a full inchtaller than the white one!
Moral of the story: ML provides theory and tools for settingparameters. Make sure you have the right model and features.Think about your “feature vector x.”
Intro. to Machine Learning Alireza Ghane / Greg Mori 25
![Page 27: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/27.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Stock Price Prediction
• Problems in which ti is continuous are called regression
• E.g. ti is stock price, xi contains company profit, debt,cash flow, gross sales, number of spam emails sent, . . .
Intro. to Machine Learning Alireza Ghane / Greg Mori 26
![Page 28: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/28.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Clustering Images
Wang et al., CVPR 2006
• Only xi is defined: unsupervised learning
• E.g. xi describes image, find groups of similar imagesIntro. to Machine Learning Alireza Ghane / Greg Mori 27
![Page 29: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/29.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Types of Learning Problems
• Supervised Learning• Classification• Regression
• Unsupervised Learning• Density estimation• Clustering: k-means, mixture models, hierarchical clustering• Hidden Markov models
• Reinforcement Learning
Intro. to Machine Learning Alireza Ghane / Greg Mori 28
![Page 30: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/30.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
Intro. to Machine Learning Alireza Ghane / Greg Mori 29
![Page 31: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/31.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
An Example - Polynomial Curve Fitting
�
�
0 1
−1
0
1
• Suppose we are given training set of N observations(x1, . . . , xN ) and (t1, . . . , tN ), xi, ti ∈ R
• Regression problem, estimate y(x) from these data
Intro. to Machine Learning Alireza Ghane / Greg Mori 30
![Page 32: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/32.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Polynomial Curve Fitting
• What form is y(x)?• Let’s try polynomials of degree M :
y(x,w) = w0+w1x+w2x2+. . .+wMx
M
• This is the hypothesis space.
• How do we measure success?• Sum of squared errors:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2
• Among functions in the class, choosethat which minimizes this error
�
�
0 1
−1
0
1
t
x
y(xn,w)
tn
xn
Intro. to Machine Learning Alireza Ghane / Greg Mori 31
![Page 33: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/33.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Polynomial Curve Fitting
• What form is y(x)?• Let’s try polynomials of degree M :
y(x,w) = w0+w1x+w2x2+. . .+wMx
M
• This is the hypothesis space.
• How do we measure success?• Sum of squared errors:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2
• Among functions in the class, choosethat which minimizes this error
�
�
0 1
−1
0
1
t
x
y(xn,w)
tn
xn
Intro. to Machine Learning Alireza Ghane / Greg Mori 32
![Page 34: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/34.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Polynomial Curve Fitting
• What form is y(x)?• Let’s try polynomials of degree M :
y(x,w) = w0+w1x+w2x2+. . .+wMx
M
• This is the hypothesis space.
• How do we measure success?• Sum of squared errors:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2
• Among functions in the class, choosethat which minimizes this error
�
�
0 1
−1
0
1
t
x
y(xn,w)
tn
xn
Intro. to Machine Learning Alireza Ghane / Greg Mori 33
![Page 35: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/35.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Polynomial Curve Fitting
• Error function
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2
• Best coefficients
w∗ = arg minw
E(w)
• Found using pseudo-inverse (more later)
Intro. to Machine Learning Alireza Ghane / Greg Mori 34
![Page 36: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/36.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Which Degree of Polynomial?
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
• A model selection problem
• M = 9→ E(w∗) = 0: This is over-fittingIntro. to Machine Learning Alireza Ghane / Greg Mori 35
![Page 37: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/37.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Generalization
�
�����
0 3 6 90
0.5
1TrainingTest
• Generalization is the holy grail of ML• Want good performance for new data
• Measure generalization using a separate set• Use root-mean-squared (RMS) error: ERMS =
√2E(w∗)/N
Intro. to Machine Learning Alireza Ghane / Greg Mori 36
![Page 38: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/38.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Controlling Over-fitting: Regularization
�
�
�����
0 1
−1
0
1
• As order of polynomial M increases, so do coefficientmagnitudes
• Penalize large coefficients in error function:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2 +λ
2||w||2
Intro. to Machine Learning Alireza Ghane / Greg Mori 37
![Page 39: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/39.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Controlling Over-fitting: Regularization
�
�
�����
0 1
−1
0
1
• As order of polynomial M increases, so do coefficientmagnitudes
• Penalize large coefficients in error function:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2 +λ
2||w||2
Intro. to Machine Learning Alireza Ghane / Greg Mori 38
![Page 40: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/40.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Controlling Over-fitting: Regularization
�
�
� ������� �
0 1
−1
0
1
�
�
� ������
0 1
−1
0
1
Intro. to Machine Learning Alireza Ghane / Greg Mori 39
![Page 41: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/41.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Controlling Over-fitting: Regularization
�����
� ���−35 −30 −25 −200
0.5
1TrainingTest
• Note the ERMS for the training set. Perfect match oftraining set with the model is a result of over-fitting
• Training and test error show similar trend
Intro. to Machine Learning Alireza Ghane / Greg Mori 40
![Page 42: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/42.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Over-fitting: Dataset size
�
�
�������
0 1
−1
0
1
�
�
���������
0 1
−1
0
1
• With more data, more complex model (M = 9) can be fit
• Rule of thumb: 10 datapoints for each parameter
Intro. to Machine Learning Alireza Ghane / Greg Mori 41
![Page 43: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/43.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Validation Set
• Split training data into training set and validation set
• Train different models (e.g. diff. order polynomials) ontraining set
• Choose model (e.g. order of polynomial) with minimumerror on validation set
Intro. to Machine Learning Alireza Ghane / Greg Mori 42
![Page 44: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/44.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Cross-validation
run 1
run 2
run 3
run 4
• Data are often limited
• Cross-validation creates S groups of data, use S − 1 totrain, other to validate
• Extreme case leave-one-out cross-validation (LOO-CV): S isnumber of training data points
• Cross-validation is an effective method for model selection,but can be slow
• Models with multiple complexity parameters: exponentialnumber of runs
Intro. to Machine Learning Alireza Ghane / Greg Mori 43
![Page 45: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/45.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary
• Want models that generalize to new data• Train model on training set• Measure performance on held-out test set
• Performance on test set is good estimate of performance onnew data
Intro. to Machine Learning Alireza Ghane / Greg Mori 44
![Page 46: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/46.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Model Selection
• Which model to use? E.g. which degree polynomial?• Training set error is lower with more complex model
• Can’t just choose the model with lowest training error
• Peeking at test error is unfair. E.g. picking polynomial withlowest test error
• Performance on test set is no longer good estimate ofperformance on new data
Intro. to Machine Learning Alireza Ghane / Greg Mori 45
![Page 47: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/47.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Model Selection
• Which model to use? E.g. which degree polynomial?• Training set error is lower with more complex model
• Can’t just choose the model with lowest training error
• Peeking at test error is unfair. E.g. picking polynomial withlowest test error
• Performance on test set is no longer good estimate ofperformance on new data
Intro. to Machine Learning Alireza Ghane / Greg Mori 46
![Page 48: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/48.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Solutions I
• Use a validation set• Train models on training set. E.g. different degree
polynomials• Measure performance on held-out validation set• Measure performance of that model on held-out test set
• Can use cross-validation on training set instead of aseparate validation set if little data and lots of time
• Choose model with lowest error over all cross-validationfolds (e.g. polynomial degree)
• Retrain that model using all training data (e.g. polynomialcoefficients)
Intro. to Machine Learning Alireza Ghane / Greg Mori 47
![Page 49: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/49.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Solutions I
• Use a validation set• Train models on training set. E.g. different degree
polynomials• Measure performance on held-out validation set• Measure performance of that model on held-out test set
• Can use cross-validation on training set instead of aseparate validation set if little data and lots of time
• Choose model with lowest error over all cross-validationfolds (e.g. polynomial degree)
• Retrain that model using all training data (e.g. polynomialcoefficients)
Intro. to Machine Learning Alireza Ghane / Greg Mori 48
![Page 50: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/50.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Solutions II
• Use regularization• Train complex model (e.g high order polynomial) but
penalize being “too complex” (e.g. large weight magnitudes)• Need to balance error vs. regularization (λ)
• Choose λ using cross-validation
• Get more data
Intro. to Machine Learning Alireza Ghane / Greg Mori 49
![Page 51: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/51.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Summary - Solutions II
• Use regularization• Train complex model (e.g high order polynomial) but
penalize being “too complex” (e.g. large weight magnitudes)• Need to balance error vs. regularization (λ)
• Choose λ using cross-validation
• Get more data
Intro. to Machine Learning Alireza Ghane / Greg Mori 50
![Page 52: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/52.jpg)
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
![Page 53: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/53.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision Theory
For a sample x, decide which class(Ck) it is from.
Ideas:
• Maximum Likelihood
• Minimum Loss/Cost (e.g. misclassification rate)
• Maximum Aposteriori (MAP)
Intro. to Machine Learning Alireza Ghane 52
![Page 54: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/54.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Maximum Likelihood
• Inference step: Determine statistics from training data.
p(x, t) OR p(x|Ck)• Decision step: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
Intro. to Machine Learning Alireza Ghane 53
![Page 55: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/55.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Maximum Likelihood
• Inference step: Determine statistics from training data.
p(x, t) OR p(x|Ck)• Decision step: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
Intro. to Machine Learning Alireza Ghane 54
![Page 56: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/56.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Maximum Likelihood
• Inference step: Determine statistics from training data.
p(x, t) OR p(x|Ck)• Decision step: Determine optimal t for test input x:
t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }
Likelihood
Intro. to Machine Learning Alireza Ghane 55
![Page 57: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/57.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Minimum Misclassification Rate
p(mistake) = p (x ∈ R1, C2) + p (x ∈ R2, C1)=
∫R1p (x, C2) dx +
∫R2p (x, C1) dx
p(mistake) =∑k
∑j
∫Rjp (x, Ck) dx
R1 R2
x0 x
p(x, C1)
p(x, C2)
x
x: decision boundary.x0: optimal decision boundary
x0 : arg minR1
{p (mistake)}
Intro. to Machine Learning Alireza Ghane 56
![Page 58: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/58.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Minimum Misclassification Rate
p(mistake) = p (x ∈ R1, C2) + p (x ∈ R2, C1)=
∫R1p (x, C2) dx +
∫R2p (x, C1) dx
p(mistake) =∑k
∑j
∫Rjp (x, Ck) dx
R1 R2
x0 x
p(x, C1)
p(x, C2)
x
x: decision boundary.x0: optimal decision boundary
x0 : arg minR1
{p (mistake)}
Intro. to Machine Learning Alireza Ghane 57
![Page 59: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/59.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Minimum Loss/Cost
• Misclassification rate:
R : arg min{Ri|i∈{1,··· ,K}}
∑k
∑j
L (Rj , Ck)
• Weighted loss/cost function:
R : arg min{Ri|i∈{1,··· ,K}}
∑k
∑j
Wj,kL (Rj , Ck)
Is useful when:
• The population of the classes are different• The failure cost is non-symmetric• · · ·
Intro. to Machine Learning Alireza Ghane 58
![Page 60: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/60.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Decision: Maximum Aposteriori (MAP)
Bayes’ Theorem: P{A|B} = P{B|A}P{A}P{B}
p(Ck|x)︸ ︷︷ ︸Posterior
∝ p(x|Ck)︸ ︷︷ ︸Likelihood
p(Ck)︸ ︷︷ ︸Prior
• Provides an Aposteriori Belief for the estimation, ratherthan a single point estimate.
• Can utilize Apriori Information in the decision.
Intro. to Machine Learning Alireza Ghane 59
![Page 61: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/61.jpg)
Outline
Course Info.: People, References, Resources
Machine Learning: What, Why, and How?
Curve Fitting: (e.g.) Regression and Model Selection
Decision Theory: ML, Loss Function, MAP
Probability Theory: (e.g.) Probabilities and ParameterEstimation
Conclusion
![Page 62: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/62.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Coin Tossing
• Let’s say you’re given a coin, and you want to find outP (heads), the probability that if you flip it it lands as“heads”.
• Flip it a few times: H H T
• P (heads) = 2/3
• Hmm... is this rigorous? Does this make sense?
Intro. to Machine Learning Alireza Ghane / Greg Mori 61
![Page 63: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/63.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Coin Tossing
• Let’s say you’re given a coin, and you want to find outP (heads), the probability that if you flip it it lands as“heads”.
• Flip it a few times: H H T
• P (heads) = 2/3
• Hmm... is this rigorous? Does this make sense?
Intro. to Machine Learning Alireza Ghane / Greg Mori 62
![Page 64: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/64.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Coin Tossing
• Let’s say you’re given a coin, and you want to find outP (heads), the probability that if you flip it it lands as“heads”.
• Flip it a few times: H H T
• P (heads) = 2/3
• Hmm... is this rigorous? Does this make sense?
Intro. to Machine Learning Alireza Ghane / Greg Mori 63
![Page 65: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/65.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Coin Tossing - Model
• Bernoulli distribution P (heads) = µ, P (tails) = 1− µ• Assume coin flips are independent and identically
distributed (i.i.d.)• i.e. All are separate samples from the Bernoulli distribution
• Given data D = {x1, . . . , xN}, heads: xi = 1, tails: xi = 0,the likelihood of the data is:
p(D|µ) =
N∏n=1
p(xn|µ) =
N∏n=1
µxn(1− µ)1−xn
Intro. to Machine Learning Alireza Ghane / Greg Mori 64
![Page 66: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/66.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Given D with h heads and t tails
• What should µ be?
• Maximum Likelihood Estimation (MLE): choose µ whichmaximizes the likelihood of the data
µML = arg maxµ
p(D|µ)
• Since ln(·) is monotone increasing:
µML = arg maxµ
ln p(D|µ)
Intro. to Machine Learning Alireza Ghane / Greg Mori 65
![Page 67: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/67.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Likelihood:
p(D|µ) =
N∏n=1
µxn(1− µ)1−xn
• Log-likelihood:
ln p(D|µ) =
N∑n=1
xn lnµ+ (1− xn) ln(1− µ)
• Take derivative, set to 0:
d
dµln p(D|µ) =
N∑n=1
xn1
µ− (1− xn)
1
1− µ=
1
µh− 1
1− µt
⇒ µ =h
t+ hIntro. to Machine Learning Alireza Ghane / Greg Mori 66
![Page 68: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/68.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Likelihood:
p(D|µ) =
N∏n=1
µxn(1− µ)1−xn
• Log-likelihood:
ln p(D|µ) =
N∑n=1
xn lnµ+ (1− xn) ln(1− µ)
• Take derivative, set to 0:
d
dµln p(D|µ) =
N∑n=1
xn1
µ− (1− xn)
1
1− µ=
1
µh− 1
1− µt
⇒ µ =h
t+ hIntro. to Machine Learning Alireza Ghane / Greg Mori 67
![Page 69: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/69.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Likelihood:
p(D|µ) =
N∏n=1
µxn(1− µ)1−xn
• Log-likelihood:
ln p(D|µ) =
N∑n=1
xn lnµ+ (1− xn) ln(1− µ)
• Take derivative, set to 0:
d
dµln p(D|µ) =
N∑n=1
xn1
µ− (1− xn)
1
1− µ=
1
µh− 1
1− µt
⇒ µ =h
t+ hIntro. to Machine Learning Alireza Ghane / Greg Mori 68
![Page 70: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/70.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Likelihood:
p(D|µ) =
N∏n=1
µxn(1− µ)1−xn
• Log-likelihood:
ln p(D|µ) =
N∑n=1
xn lnµ+ (1− xn) ln(1− µ)
• Take derivative, set to 0:
d
dµln p(D|µ) =
N∑n=1
xn1
µ− (1− xn)
1
1− µ=
1
µh− 1
1− µt
⇒ µ =h
t+ hIntro. to Machine Learning Alireza Ghane / Greg Mori 69
![Page 71: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/71.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum Likelihood Estimation
• Likelihood:
p(D|µ) =
N∏n=1
µxn(1− µ)1−xn
• Log-likelihood:
ln p(D|µ) =
N∑n=1
xn lnµ+ (1− xn) ln(1− µ)
• Take derivative, set to 0:
d
dµln p(D|µ) =
N∑n=1
xn1
µ− (1− xn)
1
1− µ=
1
µh− 1
1− µt
⇒ µ =h
t+ hIntro. to Machine Learning Alireza Ghane / Greg Mori 70
![Page 72: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/72.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Bayesian Learning
• Wait, does this make sense? What if I flip 1 time, heads?Do I believe µ=1?
• Learn µ the Bayesian way:
P (µ|D) =P (D|µ)P (µ)
P (D)
P (µ|D)︸ ︷︷ ︸posterior
∝ P (D|µ)︸ ︷︷ ︸likelihood
P (µ)︸ ︷︷ ︸prior
• Prior encodes knowledge that most coins are 50-50
• Conjugate prior makes math simpler, easy interpretation• For Bernoulli, the beta distribution is its conjugate
Intro. to Machine Learning Alireza Ghane / Greg Mori 71
![Page 73: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/73.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Bayesian Learning
• Wait, does this make sense? What if I flip 1 time, heads?Do I believe µ=1?
• Learn µ the Bayesian way:
P (µ|D) =P (D|µ)P (µ)
P (D)
P (µ|D)︸ ︷︷ ︸posterior
∝ P (D|µ)︸ ︷︷ ︸likelihood
P (µ)︸ ︷︷ ︸prior
• Prior encodes knowledge that most coins are 50-50
• Conjugate prior makes math simpler, easy interpretation• For Bernoulli, the beta distribution is its conjugate
Intro. to Machine Learning Alireza Ghane / Greg Mori 72
![Page 74: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/74.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Bayesian Learning
• Wait, does this make sense? What if I flip 1 time, heads?Do I believe µ=1?
• Learn µ the Bayesian way:
P (µ|D) =P (D|µ)P (µ)
P (D)
P (µ|D)︸ ︷︷ ︸posterior
∝ P (D|µ)︸ ︷︷ ︸likelihood
P (µ)︸ ︷︷ ︸prior
• Prior encodes knowledge that most coins are 50-50
• Conjugate prior makes math simpler, easy interpretation• For Bernoulli, the beta distribution is its conjugate
Intro. to Machine Learning Alireza Ghane / Greg Mori 73
![Page 75: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/75.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Beta Distribution
• We will use the Beta distribution to express our priorknowledge about coins:
Beta(µ|a, b) =Γ(a+ b)
Γ(a)Γ(b)︸ ︷︷ ︸normalization
µa−1(1− µ)b−1
• Parameters a and b control the shape of this distribution
Intro. to Machine Learning Alireza Ghane / Greg Mori 74
![Page 76: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/76.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Posterior
P (µ|D) ∝ P (D|µ)P (µ)
∝N∏n=1
µxn(1− µ)1−xn︸ ︷︷ ︸likelihood
µa−1(1− µ)b−1︸ ︷︷ ︸prior
∝ µh(1− µ)tµa−1(1− µ)b−1
∝ µh+a−1(1− µ)t+b−1
• Simple form for posterior is due to use of conjugate prior
• Parameters a and b act as extra observations
• Note that as N = h+ t→∞, prior is ignored
Intro. to Machine Learning Alireza Ghane / Greg Mori 75
![Page 77: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/77.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Posterior
P (µ|D) ∝ P (D|µ)P (µ)
∝N∏n=1
µxn(1− µ)1−xn︸ ︷︷ ︸likelihood
µa−1(1− µ)b−1︸ ︷︷ ︸prior
∝ µh(1− µ)tµa−1(1− µ)b−1
∝ µh+a−1(1− µ)t+b−1
• Simple form for posterior is due to use of conjugate prior
• Parameters a and b act as extra observations
• Note that as N = h+ t→∞, prior is ignored
Intro. to Machine Learning Alireza Ghane / Greg Mori 76
![Page 78: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/78.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Posterior
P (µ|D) ∝ P (D|µ)P (µ)
∝N∏n=1
µxn(1− µ)1−xn︸ ︷︷ ︸likelihood
µa−1(1− µ)b−1︸ ︷︷ ︸prior
∝ µh(1− µ)tµa−1(1− µ)b−1
∝ µh+a−1(1− µ)t+b−1
• Simple form for posterior is due to use of conjugate prior
• Parameters a and b act as extra observations
• Note that as N = h+ t→∞, prior is ignored
Intro. to Machine Learning Alireza Ghane / Greg Mori 77
![Page 79: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/79.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum A Posteriori
• Given posterior P (µ|D) we could compute a single value,known as the Maximum a Posteriori (MAP) estimate for µ:
µMAP = arg maxµ
P (µ|D)
• Known as point estimation
• However, correct Bayesian thing to do is to use the fulldistribution over µ
• i.e. Compute
Eµ[f ] =
∫p(µ|D)f(µ)dµ
• This integral is usually hard to compute
Intro. to Machine Learning Alireza Ghane / Greg Mori 78
![Page 80: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/80.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum A Posteriori
• Given posterior P (µ|D) we could compute a single value,known as the Maximum a Posteriori (MAP) estimate for µ:
µMAP = arg maxµ
P (µ|D)
• Known as point estimation
• However, correct Bayesian thing to do is to use the fulldistribution over µ
• i.e. Compute
Eµ[f ] =
∫p(µ|D)f(µ)dµ
• This integral is usually hard to compute
Intro. to Machine Learning Alireza Ghane / Greg Mori 79
![Page 81: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/81.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Maximum A Posteriori
• Given posterior P (µ|D) we could compute a single value,known as the Maximum a Posteriori (MAP) estimate for µ:
µMAP = arg maxµ
P (µ|D)
• Known as point estimation
• However, correct Bayesian thing to do is to use the fulldistribution over µ
• i.e. Compute
Eµ[f ] =
∫p(µ|D)f(µ)dµ
• This integral is usually hard to compute
Intro. to Machine Learning Alireza Ghane / Greg Mori 80
![Page 82: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/82.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Polynomial Curve Fitting: What We Did
• What form is y(x)?• Let’s try polynomials of degree M :
y(x,w) = w0+w1x+w2x2+. . .+wMx
M
• This is the hypothesis space.
• How do we measure success?• Sum of squared errors:
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2
• Among functions in the class, choosethat which minimizes this error
�
�
0 1
−1
0
1
t
x
y(xn,w)
tn
xn
Intro. to Machine Learning Alireza Ghane 81
![Page 83: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/83.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Probabilistic Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =
N∏n=1
N(tn|y(xn,w), β−1
)Intro. to Machine Learning Alireza Ghane 82
![Page 84: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/84.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Probabilistic Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =N∏n=1
N(tn|y(xn,w), β−1
)
ln (p(t|x,w, β)) = − β2
N∑n=1
{y(xn,w)− tn}2︸ ︷︷ ︸βE(w)
+N
2lnβ︸ ︷︷ ︸
const.
− N2
ln (2π)︸ ︷︷ ︸const.
Intro. to Machine Learning Alireza Ghane 83
![Page 85: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/85.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Probabilistic Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =
N∏n=1
N(tn|y(xn,w), β−1
)
ln (p(t|x,w, β)) = − β2
N∑n=1
{y(xn,w)− tn}2︸ ︷︷ ︸βE(w)
+N
2lnβ︸ ︷︷ ︸
const.
− N2
ln (2π)︸ ︷︷ ︸const.
Maximize log-likelihood ⇔ Minimize E(w).Can optimize for β as well.
Intro. to Machine Learning Alireza Ghane 84
![Page 86: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/86.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =
N∏n=1
N(tn|y(xn,w), β−1
)
Intro. to Machine Learning Alireza Ghane 85
![Page 87: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/87.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =
N∏n=1
N(tn|y(xn,w), β−1
)
Posterior Dist.:p (w|x, t, α, β) ∝ p (t|x,w, β) p (t|α)
Intro. to Machine Learning Alireza Ghane 86
![Page 88: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/88.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian Approach
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
p(t|x,w, β) =
N∏n=1
N(tn|y(xn,w), β−1
)
Posterior Dist.:p (w|x, t, α, β) ∝ p (t|x,w, β) p (t|α)
Minimize:β
2
N∑n=1
{y(xn,w)− tn}2︸ ︷︷ ︸βE(w)
+α
2wTw︸ ︷︷ ︸
regularization.
Intro. to Machine Learning Alireza Ghane 87
![Page 89: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/89.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian
p (t|x0,w, β, α) = N (t|Pw(x0), Qw,β,α(x0))
p (t|x,x, t) = N(t|m(x), s2(x)
)
m(x) = φ(x)TSN∑n=1
φ(xn)tn
s2(x) = β−1(1 + φ(x)TSφ(x)
)S−1 =
α
βI +
N∑n=1
φ(xn)φ(xn)T
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
�
�
0 1
−1
0
1
Intro. to Machine Learning Alireza Ghane 88
![Page 90: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/90.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian
p (t|x0,w, β, α) = N (t|Pw(x0), Qw,β,α(x0))
p (t|x,x, t) = N(t|m(x), s2(x)
)
m(x) = φ(x)TSN∑n=1
φ(xn)tn
s2(x) = β−1(1 + φ(x)TSφ(x)
)S−1 =
α
βI +
N∑n=1
φ(xn)φ(xn)T
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
�
�
0 1
−1
0
1
Intro. to Machine Learning Alireza Ghane 89
![Page 91: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/91.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Curve Fitting: Bayesian
p (t|x0,w, β, α) = N (t|Pw(x0), Qw,β,α(x0))
p (t|x,x, t) = N(t|m(x), s2(x)
)
m(x) = φ(x)TS
N∑n=1
φ(xn)tn
s2(x) = β−1(1 + φ(x)TSφ(x)
)S−1 =
α
βI +
N∑n=1
φ(xn)φ(xn)T
t
xx0
2σy(x0,w)
y(x,w)
p(t|x0,w, β)
�
�
0 1
−1
0
1
Intro. to Machine Learning Alireza Ghane 90
![Page 92: Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course](https://reader035.fdocuments.in/reader035/viewer/2022062414/5f09ebcf7e708231d42924b0/html5/thumbnails/92.jpg)
Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion
Conclusion
• Readings: Chapter 1.1, 1.3, 1.5, 2.1
• Types of learning problems• Supervised: regression, classification• Unsupervised
• Learning as optimization• Squared error loss function• Maximum likelihood (ML)• Maximum a posteriori (MAP)
• Want generalization, avoid over-fitting• Cross-validation• Regularization• Bayesian prior on model parameters
Intro. to Machine Learning Alireza Ghane 91