PATTERN’RECOGNITION’’...

31
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION

Transcript of PATTERN’RECOGNITION’’...

Page 1: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

PATTERN  RECOGNITION    AND  MACHINE  LEARNING  CHAPTER  1:  INTRODUCTION  

Page 2: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Example  

Handwri/en  Digit  Recogni6on  

Page 3: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Polynomial  Curve  Fi=ng    

•  The input data set x of 10 points was generated by choosing values of xn, for n = 1, . . . , 10, spaced uniformly in range [0, 1]. •  The target data set t was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise having a Gaussian distribution to each corresponding value tn.

Page 4: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Sum-­‐of-­‐Squares  Error  Func6on  

Note: E(w) is a quadratic function w.r.t. each wi. Thus the partial derivative of E(w) w.r.t. each wi is a linear function of w’s.

This method is also called Linear Least Squares

Page 5: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

0th  Order  Polynomial  

Page 6: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

1st  Order  Polynomial  

Page 7: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

3rd  Order  Polynomial  

Page 8: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

9th  Order  Polynomial  

Page 9: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Over-­‐fi=ng  

Root-­‐Mean-­‐Square  (RMS)  Error:  

Test set: 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values tn.

Page 10: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Polynomial  Coefficients      

Page 11: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Data  Set  Size:    

9th  Order  Polynomial  

Page 12: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Data  Set  Size:    9th  Order  Polynomial  

One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. (However, the number of parameters is not necessarily the most appropriate measure of model complexity - a measure of how hard it is to learn from limited data.)

Page 13: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Regulariza6on  

Penalize  large  coefficient  values  

Page 14: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Regulariza6on:    

Page 15: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Regulariza6on:    

Page 16: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Regulariza6on:                      vs.    

Page 17: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Polynomial  Coefficients      

Page 18: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Probability  Theory  

 Marginal  Probability          Condi6onal  Probability  Joint  Probability  

 

Page 19: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Probability  Theory  

Sum  Rule        

Product  Rule    

Page 20: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

The  Rules  of  Probability  

 Sum  Rule  

 Product  Rule  Also: p( X, Y ) = p(X | Y) p(Y)

p(Y | X) p(X) = p(X | Y) p(Y)

Thus: 1.

2.

Page 21: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Bayes’  Theorem  

posterior  ∝  likelihood  ×  prior  

Page 22: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Probability  Theory  

Apples  and  Oranges  

4/10 6/10

Page 23: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Bayes’  Theorem  Generalized  

P(T|X)  =  sum  over  W  of    p(T|X,  W)  p(W|X)    

P(W|T,  X)  =  p(T|X,  W)  p(W|X)  /  p(T|X)  

P(W|T)  =  p(T|W)  p(W)  /  p(T)  

Page 24: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Probability  Densi6es  

P’(x) = p(x)

Page 25: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Transformed  Densi6es  

Suppose x = g(y)

Page 26: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Expecta6ons  

Condi6onal  Expecta6on  (discrete)  

Approximate  Expecta6on  (discrete  and  con6nuous)  

Page 27: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Variances  and  Covariances  

(standard deviation’s square)

(a symmetric matrix)

Page 28: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

The  Gaussian  Distribu6on  

Page 29: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

Gaussian  Mean  and  Variance  

(mean)

(variance)

(precision – the bigger is, the smaller is, thus the more “precise” the distribution is.)

Page 30: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

The  Gaussian  Distribu6on  

For the normal distribution, the values less than one standard deviation away from the mean account for 68.27% of the set; while two standard deviations from the mean account for 95.45%; and three standard deviations account for 99.73%.

Page 31: PATTERN’RECOGNITION’’ AND’MACHINE’LEARNING’szhou/568/BayesianCurveFitting_1.pdfPolynomial(Curve(Fing ((• The input data set x of 10 points was generated by choosing values

The  Mul6variate  Gaussian  Gaussian distribution defined over a D-dimensional vector x of continuous variables:

where the D-dimensional vector µ is called the mean, the D x D symmetric matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ.