BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing !...

35
BiasVariance Tradeoff and Model selec2on Aar$ Singh and Eric Xing Machine Learning 10701/15781 Oct 1, 2012

Transcript of BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing !...

Page 1: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

 

Bias-­‐Variance  Tradeoff  and    Model  selec2on  

Aar$  Singh  and  Eric  Xing      

Machine  Learning  10-­‐701/15-­‐781  Oct  1,  2012  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAAAA  

Page 2: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

True  vs.  Empirical  Risk  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

True Risk:  Target  performance  measure                      Classifica$on  –  Probability  of  misclassifica$on    

                 Regression  –  Mean  Squared  Error    performance  on  a  random  test  point  (X,Y)  

Empirical Risk:  Performance  on  training  data                  Classifica$on  –  Propor$on  of  misclassified  examples      

           Regression  –  Average  Squared  Error  

Page 3: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Overfi=ng  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Is  the  following  predictor  a  good  one?            What  is  its  empirical  risk?  (performance  on  training  data)  

zero !  

What  about  true  risk?   > zero

 

Will  predict  very  poorly  on  new  random  test  point:    Large generalization error !

Page 4: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Overfi=ng  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

If  we  allow  very  complicated  predictors,  we  could  overfit  the  training  data.    

Examples:    Classifica$on  (0-­‐NN  classifier)      

Football  player  ?  

No  

Yes  

Weight  

Height   Height  Weight  

Page 5: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Overfi=ng  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

k=1   k=2  

k=3   k=7  

If  we  allow  very  complicated  predictors,  we  could  overfit  the  training  data.    

Examples:    Regression  (Polynomial  of  order  k  –  degree  up  to  k-­‐1)      

Page 6: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Effect  of  Model  Complexity  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Empirical risk is no longer a good indicator of true risk

fixed  #  training  data  

If  we  allow  very  complicated  predictors,  we  could  overfit  the  training  data.  

Page 7: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Behavior  of  True  Risk  

True  risk    (Mean  Square  Error)  

Bias  

Variance  

Regression      True  Risk  R(f) = E[(f(X)� Y )2]= E[(f(X)� E[f(X)])2] + E[(E[f(X)]� f⇤(X))2] + �2

Bias  Variance   Bayes  error  =  R(f*)  

Page 8: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Bias  –  Variance  Tradeoff  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Regression:                              Excess  Risk  =                    =  variance  +  bias^2  

No$ce:  Op$mal  predictor  does  not  have  zero  error  

variance bias^2 Noise var

Random component Model restriction

R(f⇤)

R( bfn)

R( bfn) R(f⇤)

Page 9: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Bias  –  Variance  Tradeoff:  Deriva2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Regression:  

No$ce:  Op$mal  predictor  does  not  have  zero  error  

0  

R(f⇤)

R( bfn)

Page 10: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Bias  –  Variance  Tradeoff:  Deriva2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Regression:  

No$ce:  Op$mal  predictor  does  not  have  zero  error  

R(f⇤)

R( bfn)

variance – how  much  does  the  predictor  vary  about  its  mean            for  different  training  datasets  

Page 11: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Bias  –  Variance  Tradeoff:  Deriva2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

0  since  noise  is  independent  and  zero  mean  

noise variance bias^2 – how  much  does  the    mean  of  the  predictor  differ  from  the    op$mal  predictor  

Second  term:  

Page 12: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Bias  –  Variance  Tradeoff  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Large  bias,  Small  variance  –  poor  approxima$on  but  robust/stable  

Small  bias,  Large  variance  –  good  approxima$on  but  instable  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

3 Independent training datasets

Page 13: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Behavior  of  True  Risk  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Due to restriction of model class

Excess Risk

Want              to  be  as  good  as  op$mal  predictor    

Excess  risk  

Approx.  error  

Es$ma$on    error  

Due to randomness of training data

R( bfn)�R(f⇤)✓inff2F

R(f)�R(f⇤)

Page 14: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Behavior  of  True  Risk  

R( bfn)�R(f⇤)✓inff2F

R(f)�R(f⇤)

Page 15: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Examples  of  Model  Spaces  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Model  Spaces  with  increasing  complexity:    •     Nearest-­‐Neighbor  classifiers  with  varying  neighborhood  sizes  k  =  1,2,3,…  

Small  neighborhood  =>  Higher  complexity    •     Decision  Trees  with  depth  k  or  with  k  leaves  

Higher  depth/  More  #  leaves  =>  Higher  complexity    •     Regression  with  polynomials  of  order  k  =  0,  1,  2,  …  

Higher  degree  =>  Higher  complexity    •     Kernel  Regression  with  bandwidth  h  

Small  bandwidth  =>  Higher  complexity      How can we select the right complexity model ?

Page 16: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Model  Selec2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Setup:    Model  Classes                                          of  increasing  complexity      

     We  can  select  the  right  complexity  model  in  a  data-­‐driven/adap$ve  way:    

q     Holdout  or  Cross-­‐valida$on    

q     Structural  Risk  Minimiza$on    q     Complexity  Regulariza$on    q     Informa)on  Criteria  -­‐  AIC,  BIC,  Minimum  Descrip$on  Length  (MDL)  

R(f)

Page 17: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Hold-­‐out  method  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

We  would  like  to  pick  the  model  that  has  smallest  generaliza$on  error.    Can  judge  generaliza$on  error  by  using  an  independent  sample  of  data.    Hold – out procedure:

     n  data  points  available  

   1)  Split  into  two  sets:            Training  dataset                Valida$on  dataset  

       

 2)  Use  DT  for  training  a  predictor  from  each  model  class:        

   

NOT test Data !!

Evaluated  on  training  dataset  DT  

Page 18: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Hold-­‐out  method  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

 3)  Use  Dv  to  select  the  model  class  which  has  smallest  empirical  error  on  Dv            

 4)  Hold-­‐out  predictor    

         

Intuition:  Small  error  on  one  set  of  data  will  not  imply  small  error  on                              a  randomly  sub-­‐sampled  second  set  of  data  

                           Ensures  method  is  “stable”  

Evaluated  on  valida$on  dataset  DV  

Page 19: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Hold-­‐out  method  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Drawbacks:      

§  May  not  have  enough  data  to  afford  seong  one  subset  aside  for  geong  a  sense  of  generaliza$on  abili$es    

§  Valida$on  error  may  be  misleading  (bad  es$mate  of  generaliza$on  error)  if  we  get  an  “unfortunate”  split  

Limitations of hold-out can be overcome by a family of random sub-sampling methods at the expense of more computation.

   

Page 20: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Cross-­‐valida2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

K-fold cross-validation

 1)  Create  K-­‐fold  par$$on  of  the  dataset.    2)  Form  K  hold-­‐out  predictors,  each  $me  using  one  par$$on  as  valida$on  and    

                     rest  K-­‐1  as  training  datasets.              K  predictors  for  each  model  class:  {    }  

valida$on  

Run  1  

Run  2  

Run  K  

training  

bf1, bf2, . . . , bfK �

Page 21: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Cross-­‐valida2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Leave-one-out (LOO) cross-validation    1)  Special  case  of  K-­‐fold  with  K=n  par$$ons      2)  Equivalently,  train  on  n-­‐1  samples  and  validate  on  only  one  sample  per  run                for  n  runs  

           K  predictors  for  each  model  class:    {    }  

Run  1  

Run  2  

Run  K  

training   valida$on  

bf1, bf2, . . . , bfK �

Page 22: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Cross-­‐valida2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Random subsampling

 1)  Randomly  subsample  a  fixed  frac$on  αn  (0<  α  <1)  of  the  dataset  for  valida$on.    2)  Form  hold-­‐out  predictor  with  remaining  data  as  training  data.              Repeat  K  $mes    K  predictors  for  each  model  class:  {    }  

   

Run  1  

Run  2  

Run  K  

training   valida$on  

bf1, bf2, . . . , bfK �

Page 23: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Model  selec2on  by  Cross-­‐valida2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

3)  Use  Dv  to  select  the  model  class  which  has  smallest  empirical  error  on  Dv                4)  Cross-­‐validated  predictor  

 Final  predictor              is  average/majority  vote  over  the  K  hold-­‐out  es$mates

                   {    }  

Evaluated  on  valida$on  dataset  DV  

bf1, bf2, . . . , bfK b�

bfk,�

Page 24: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Es2ma2ng  generaliza2on  error  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Hold-out ≡ 1-fold: Error  es$mate  =     K-fold/LOO/random Error  es$mate  =   sub-sampling:

     

We  want  to  es$mate  the  error  of  a  predictor      based  on  n  data  points.    

If  K  is  large  (close  to  n),  bias  of  error  es$mate    is  small  since  each  training  set  has  close  to  n    data  points.    

However,  variance  of  error  es$mate  is  high  since    each  valida$on  set  has  fewer  data  points  and                    might  deviate  a  lot  from  the  mean.  

Run  1  

Run  2  

Run  K  

training   valida$on  

bfk,�

bf�

Page 25: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Prac2cal  Issues  in  Cross-­‐valida2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

How to decide the values for K and α ? §         Large  K  

 +  The  bias  of  the  error  es$mate  will  be  small  (many  training  pts)    -­‐  The  variance  of  the  error  es$mate  will  be  large  (few  valida$on  pts)    -­‐  The  computa$onal  $me  will  be  very  large  as  well  (many  experiments)  

§  Small  K    +  The  #  experiments  and,  therefore,  computa$on  $me  are  reduced    +  The  variance  of  the  error  es$mate  will  be  small  (many  valida$on  pts)    -­‐  The  bias  of  the  error  es$mate  will  be  large  (few  training  pts)  

 Common  choice:  K  =  10,  α  =  0.1  J  

Page 26: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Structural  Risk  Minimiza2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Penalize  models  using  bound  on  devia2on  of  true  and  empirical  risks.            

With  high  probability,  

Bound  on  devia$on  from  true  risk  

Concentra$on  bounds  (later)  

High  probability  Upper  bound  on  true  risk  

C(f)  -­‐  large  for  complex  models  

Page 27: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Devia$on  bounds  are  typically  presy  loose,  for  small  sample  sizes.  In  prac$ce,                Problem: Iden$fy  flood  plain  from  noisy  satellite  images      

   

Structural  Risk  Minimiza2on  

Choose  by  cross-­‐valida$on!  

Noiseless  image   Noisy  image   True  Flood  plain  (eleva$on  level  >  x)    

Page 28: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Devia$on  bounds  are  typically  presy  loose,  for  small  sample  sizes.  In  prac$ce,                Problem: Iden$fy  flood  plain  from  noisy  satellite  images      

   

Structural  Risk  Minimiza2on  

Choose  by  cross-­‐valida$on!  

True  Flood  plain  (eleva$on  level  >  x)    

Theore$cal  penalty  CV  penalty  Zero  penalty  

Page 29: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Occam’s  Razor  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

William  of  Ockham  (1285-­‐1349)  Principle  of  Parsimony:      “One  should  not  increase,  beyond  what  is  necessary,  the  number  of  en$$es  required  to  explain  anything.”  

Alterna$vely,  seek  the  simplest  explana$on.      Penalize  complex  models  based  on    

•         Prior  informa$on  (bias)  •         Informa$on  Criterion  (MDL,  AIC,  BIC)  

Page 30: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Importance  of  Domain  knowledge  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Compton  Gamma-­‐Ray  Observatory  Burst    and  Transient  Source  Experiment  (BATSE)  

Distribu$on  of  photon  arrivals  

Oil  Spill  Contamina$on  

Page 31: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Complexity  Regulariza2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Penalize  complex  models  using  prior  knowledge.              Bayesian  viewpoint:      

 prior  probability  of  f,  p(f)    ≡            

 cost  is  small  if  f  is  highly  probable,  cost  is  large  if  f  is  improbable    

 ERM  (empirical  risk  minimiza$on)  over  a  restricted  class  F                                    ≡  uniform  prior  on  f  ϵ F,  zero  probability  for  other  predictors    

Cost  of  model  (log  prior)  

Page 32: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Complexity  Regulariza2on  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Penalize  complex  models  using  prior  knowledge.              Examples:        MAP  es$mators                    Regularized  Linear  Regression  -­‐  Ridge  Regression,  Lasso                How  to  choose  tuning  parameter  λ?  Cross-­‐valida2on  

Cost  of  model  (log  prior)  

Penalize  models  based  on  some  norm  of  regression  coefficients  

Page 33: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Informa2on  Criteria  –  AIC,  BIC  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Penalize  complex  models  based  on  their  informa2on  content.                        

 AIC  (Akiake  IC)                    C(f)  =  #  parameters    

     

 BIC  (Bayesian  IC)    C(f)  =  #  parameters  *  log  n      

 Penalizes  complex  models  more  heavily  –  limits  complexity  of  models    as  #  training  data  n  become  large  

Page 34: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

5 leaves => 9 bits to encode structure

Informa2on  Criteria  -­‐  MDL  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Penalize  complex  models  based  on  their  informa2on  content.          MDL  (Minimum  Descrip2on  Length)    Example:  Binary  Dyadic  Decision  trees                  k  leaves  =>  2k  –  1  nodes            2k  –  1  bits  to  encode  tree  structure        +      k  bits  to  encode  label  of  each  leaf  (0/1)    

#  bits  needed  to  describe  f  (descrip$on  length)  

Page 35: BiasVariance!Tradeoffand! Model!selec2on!smyth/courses/cs274/... · Overfing ! TexPointfonts%used%in%EMF.%% Read%the%TexPointmanual%before%you%delete%this%box.:% AAAAAAA% 0 0.1

Summary  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

True  and  Empirical  Risk    Over-­‐fiong    Bias  vs  Variance  tradeoff,  Approx  err  vs  Es$ma$on  err    Model  Selec$on,  Es$ma$ng  Generaliza$on  Error    

§   Hold-­‐out,  K-­‐fold  cross-­‐valida$on  

§   Structural  Risk  Minimiza$on  

§   Complexity  Regulariza$on  

§   Informa$on  Criteria  –  AIC,  BIC,  MDL