CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter...

31
CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012 Luke ZeBlemoyer Slides adapted from Carlos Guestrin

Transcript of CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter...

Page 1: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

CSE546:  Linear  Regression  Bias  /  Variance  Tradeoff  

Winter  2012  

Luke  ZeBlemoyer      

Slides  adapted  from  Carlos  Guestrin  

Page 2: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

PredicJon  of  conJnuous  variables  •  Billionaire  says:  Wait,  that’s  not  what  I  meant!            •  You  say:  Chill  out,  dude.  •  He  says:  I  want  to  predict  a  conJnuous  variable  for  conJnuous  inputs:  I  want  to  predict  salaries  from  GPA.  

•  You  say:  I  can  regress  that…    

Page 3: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

0 20 0

20

40

0 10 20 30 40

0 10

20 30

20 22 24 26

Linear Regression

Prediction Prediction

Page 4: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Ordinary Least Squares (OLS)

0 20 0

Error or “residual”

Prediction

Observation

Page 5: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

The  regression  problem  •  Instances:  <xj,  tj>  •  Learn:  Mapping  from  x  to  t(x)  •  Hypothesis  space:  

–  Given,  basis  funcJons  {h1,…,hk} –  Find  coeffs  w={w1,…,wk}  

–  Why  is  this  usually  called  linear  regression?  •  model  is  linear  in  the  parameters  •  Can  we  esJmate  funcJons  that  are  not  lines???  

•  Precisely,  minimize  the  residual  squared  error:  

Page 6: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Regression:  matrix  notaJon  

N data points

K basis functions

N observed outputs

measurements weights

K basis func

Page 7: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Regression  soluJon:  simple  matrix  math  

where

k×k matrix for k basis functions

k×1 vector

Page 8: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

But,  why?  •  Billionaire  (again)  says:  Why  sum  squared  error???  

•  You  say:  Gaussians,  Dr.  Gateson,  Gaussians…  

•  Model:  predicJon  is  linear  funcJon  plus  Gaussian  noise  –  t(x)  =  ∑i  wi  hi(x)  +  ε

 

•  Learn  w  using  MLE:  

Page 9: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Maximizing  log-­‐likelihood  Maximize wrt w:

Least-squares Linear Regression is MLE for Gaussians!!!

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

argmaxwln

�1

σ√2π

�N

+N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

= argmaxw

N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

2

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

argmaxwln

�1

σ√2π

�N

+N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

= argmaxw

N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

2

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

argmaxwln

�1

σ√2π

�N

+N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

= argmaxw

N�

j=1

−[tj −�

i wihi(xj)]2

2σ2

= argminw

N�

j=1

[tj −�

i

wihi(xj)]2

2

Page 10: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Bias-­‐Variance  tradeoff  –  IntuiJon    

•  Model  too  simple:  does  not  fit  the  data  well  – A  biased  soluJon  

 •  Model  too  complex:  small  changes  to  the  data,  soluJon  changes  a  lot  – A  high-­‐variance  soluJon  

x

t

M = 0

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1

Page 11: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

(Squared)  Bias  of  learner  •  Given:  dataset  D  with  m  samples  •  Learn:  for  different  datasets  D,  you  will  get  different  

funcJons  h(x)  •  Expected  predicJon  (averaged  over  hypotheses):  ED[h(x)] •  Bias:  difference  between  expected  predicJon  and  truth  

– Measures  how  well  you  expect  to  represent  true  soluJon  

–  Decreases  with  more  complex  model    

Page 12: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Variance  of  learner  •  Given:  dataset  D  with  m  samples  •  Learn:  for  different  datasets  D,  you  will  get  different  

funcJons  h(x)  •  Expected  predicJon  (averaged  over  hypotheses):  ED[h(x)] •  Variance:  difference  between  what  you  expect  to  learn  and  

what  you  learn  from  a  from  a  parJcular  dataset    – Measures  how  sensiJve  learner  is  to  specific  dataset  –  Decreases  with  simpler  model  

Page 13: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Bias–Variance  decomposiJon  of  error  

•  Consider  simple  regression  problem  f:XàT                  f(x)  =  g(x)  +  ε

•  Collect  some  data,  and  learn  a  funcJon  h(x)  •  What  are  sources  of  predicJon  error?      

noise  ~  N(0,σ)  

determinisJc  

Page 14: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Sources  of  error  1  –  noise    

•  What  if  we  have  perfect  learner,  infinite  data?  –  If  our  learning  soluJon  h(x)  saJsfies  h(x)=g(x)  – SJll  have  remaining,  unavoidable  error  of                                                    σ2  due  to  noise  ε

f(x) = g(x) + ε

Page 15: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Sources  of  error  2  –  Finite  data  

• What  if  we  have  imperfect  learner,  or  only  m  training  examples?  

•  What  is  our  expected  squared  error  per  example?  –  ExpectaJon  taken  over  random  training  sets  D  of  size  m,  drawn  from  distribuJon  P(X,T)  

 

f(x) = g(x) + ε

Page 16: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Bias-­‐Variance  DecomposiJon  of  Error  Assume target function: t(x) = g(x) + ε

•  Then expected squared error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components:

Where:

Bishop Chapter 3

Page 17: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Bias-­‐Variance  Tradeoff  •  Choice  of  hypothesis  class  introduces  learning  bias  – More  complex  class  →  less  bias  – More  complex  class  →  more  variance  

Page 18: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Training  set  error  

•  Given  a  dataset  (Training  data)  •  Choose  a  loss  funcJon  

– e.g.,  squared  error  (L2)  for  regression  •  Training  error:  For  a  parJcular  set  of  parameters,  loss  funcJon  on  training  data:  

Page 19: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Training  error  as  a  funcJon  of  model  complexity  

Page 20: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

PredicJon  error  

•  Training  set  error  can  be  poor  measure  of  “quality”  of  soluJon  

•  PredicJon  error  (true  error):  We  really  care  about  error  over  all  possibiliJes:  

Page 21: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

PredicJon  error  as  a  funcJon  of  model  complexity  

Page 22: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

CompuJng  predicJon  error  •  To  correctly  predict  error  

•  Monte Carlo integration (sampling approximation) •  Sample a set of i.i.d. points {x1,…,xM} from p(x) •  Approximate integral with sample average

•  Hard integral! •  May not know t(x) for every x, may not know p(x)

Page 23: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Why  training  set  error  doesn’t  approximate  predicJon  error?  

•  Sampling  approximaJon  of  predicJon  error:  

•  Training  error  :  

•  Very  similar  equaJons!!!    –  Why  is  training  set  a  bad  measure  of  predicJon  error???  

Page 24: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Why  training  set  error  doesn’t  approximate  predicJon  error?  

•  Sampling  approximaJon  of  predicJon  error:  

•  Training  error  :  

•  Very  similar  equaJons!!!    –  Why  is  training  set  a  bad  measure  of  predicJon  error???  

Because you cheated!!!

Training error good estimate for a single w, But you optimized w with respect to the training error,

and found w that is good for this set of samples

Training error is a (optimistically) biased estimate of prediction error

Page 25: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Test  set  error  •  Given  a  dataset,  randomly  split  it  into  two  parts:    – Training  data  –  {x1,…,  xNtrain}  – Test  data  –  {x1,…,  xNtest}  

•  Use  training  data  to  opJmize  parameters  w  •  Test  set  error:  For  the  final  solu)on  w*,  evaluate  the  error  using:  

Page 26: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Test  set  error  as  a  funcJon  of  model  complexity  

Page 27: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Overfirng:  this  slide  is  so  important  we  are  looking  at  it  again!  

•  Assume:  – Data  generated  from  distribuJon  D(X,Y) –  A hypothesis space H

•  Define:  errors  for  hypothesis  h ∈ H –  Training error: errortrain(h) –  Data (true) error: errortrue(h)

•  We say h overfits the training data if there exists an h’ ∈ H such that:

errortrain(h) < errortrain(h’) and errortrue(h) > errortrue(h’)

Page 28: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Summary:  error  esJmators    

•  Gold  Standard:  

 •  Training:  opJmisJcally  biased  

•  Test:  our  final  meaure,  unbiased?  

Page 29: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Error  as  a  funcJon  of  number  of  training  examples  for  a  fixed  model  complexity  

little data infinite data

Page 30: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

Summary:  error  esJmators    

•  Gold  Standard:  

 •  Training:  opJmisJcally  biased  

•  Test:  our  final  meaure,  unbiased?  

Be careful!!!

Test set only unbiased if you never never ever ever do any any any any learning on the test data

For example, if you use the test set to select

the degree of the polynomial… no longer unbiased!!! (We will address this problem later in the semester)

Page 31: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear

What  you  need  to  know  •  Regression  

–  Basis  funcJon  =  features  –  OpJmizing  sum  squared  error  –  RelaJonship  between  regression  and  Gaussians  

•  Bias-­‐Variance  trade-­‐off  •  Play  with  Applet  

–  hBp://mste.illinois.edu/users/exner/java.f/leastsquares/  

•  True  error,  training  error,  test  error  –  Never  learn  on  the  test  data  

•  Overfirng