Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155&...

71
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 13: Latent Factor Models & NonNega:ve Matrix Factoriza:on 1 Lecture 13: Latent Factor Models & NonNega:ve Matrix Factoriza:on

Transcript of Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155&...

Page 1: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Machine  Learning  &  Data  Mining  CS/CNS/EE  155  

Lecture  13:  Latent  Factor  Models  &  

Non-­‐Nega:ve  Matrix  Factoriza:on  

1  Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 2: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Announcements  

•  Homework  6  Released  – Due  Tuesday  March  1st  

•  Miniproject  2  to  be  released  next  week  – Due  March  8th  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   2  

Page 3: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Today  

•  Some  useful  matrix  proper:es  – Useful  for  homework  

•  Latent  Factor  Models  – Low-­‐rank  models  with  missing  values  

•  Non-­‐nega:ve  matrix  factoriza:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   3  

Page 4: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  Orthogonal  Matrix  

•  A  matrix  U  is  orthogonal  if  UUT  =  UTU  =  I  –  For  any  column  u:    uTu  =  1  –  For  any  two  columns  u,  u’:    uTu’  =  0  –  U  is  a  rota:on  matrix,  and  UT  is  the  inverse  rota:on  –  If  x’  =  UTx,  then  x  =  Ux’  

4  

Principal)Component)Analysis)

Lecture)12:)Clustering)&)Dimensionality)Reduc<on) 44)

Principal)Component)Analysis)Lecture)12:)Clustering)&)Dimensionality)Reduc<on)

44)

x  

x’  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 5: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  Orthogonal  Matrix  

•  Any  subset  of  columns  of  U  defines  a  subspace  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   5  

Principal)Component)Analysis)Lecture)12:)Clustering)&)Dimensionality)Reduc<on)

50)

Principal)Component)Analysis)

Lecture)12:)Clustering)&)Dimensionality)Reduc<on) 44)u1

T x

Principal)Component)Analysis)

Lecture)12:)Clustering)&)Dimensionality)Reduc<on) 50)

u1u1T x

x ' =U1:KT x

projU1:K x( ) =U1:KU1:KT x

Transform  into  new  coordinates  Treat  U1:K  as  new  axes  

Project  x  onto  U1:K  in  original  space  “Low  Rank”  Subspace  

Page 6: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  Singular  Value  Decomposi:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   6  

X =UΣVT

Orthogonal  

Diagonal  Orthogonal  

X = x1,..., xN[ ]∈ ReD×N

xi −U1:KU1:KT xi

2

i=1

N

∑“Residual”  

U1:K  is  the  K-­‐dim  subspace  with    smallest  residual  

SVD  

Page 7: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  SVD  &  PCA  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   7  

XXT = UΣVT( ) UΣVT( )T=UΣVTVΣUT =UΣ2UT

XXT =UΛUT PCA  

X =UΣVT

Orthogonal  

Diagonal  Orthogonal  

SVD  

Orthogonal   Diagonal  

Page 8: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  Eigenfaces  

•  Each  col  of  U’  is  an  “Eigenface”  •  Each  col  of  V’T  =  coefficients  of  a  student  

Lecture  12:  Clustering  &  Dimensionality  Reduc:on   8  

=  

Principal)Component)Analysis)

Lecture)12:)Clustering)&)Dimensionality)Reduc<on) 44)

X’   U’  

225000-­‐dimensional!  

V’T  

Avg  Face  

Page 9: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Matrix  Norms  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   9  

•  Frobenius  Norm  

•  Trace  Norm  

XFro= Xij

2

ij∑ = σ d

2

d∑

X*= σ d

d∑ = trace XTX( )

X =UΣVT

Σ =

σ1σ 2

!σ D

"

#

$$$$$

%

&

'''''

Each  σd  is  guaranteed  to  be  non-­‐nega:ve  By  conven:on:  σ1  ≥  σ2  ≥  …  ≥  σD  ≥  0  

Page 10: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Proper:es  of  Matrix  Norms  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   10  

XFro

2= trace XTX( ) = trace UΣVT( )

TUΣVT( )

= trace VΣ2VT( ) = trace Σ2VTV( ) = trace Σ2( ) = σ d

2

d∑

trace(ABC) = trace(BCA) = trace(CAB)

X =UΣVT

Σ =

σ1σ 2

!σ D

"

#

$$$$$

%

&

'''''

Each  σd  is  guaranteed  to  be  non-­‐nega:ve  By  conven:on:  σ1  ≥  σ2  ≥  …  ≥  σD  ≥  0  

Page 11: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Proper:es  of  Matrix  Norms  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   11  

trace(ABC) = trace(BCA) = trace(CAB)

X =UΣVT

Σ =

σ1σ 2

!σ D

"

#

$$$$$

%

&

'''''

X*= trace UΣVT( )

TUΣVT"

#$

%

&'= trace VΣUTUΣVT( )

= trace VΣΣVT( ) = trace VΣ2VT( ) = trace VΣVT( ) = trace ΣVTV( ) = trace Σ( ) = σ d

d∑

Each  σd  is  guaranteed  to  be  non-­‐nega:ve  By  conven:on:  σ1  ≥  σ2  ≥  …  ≥  σD  ≥  0  

Page 12: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Frobenius  Norm  =  Squared  Norm  

•  Matrix  version  of  L2  Norm:  

 •  Useful  for  regularizing  matrix  models  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   12  

XFro

2= Xij

2

ij∑ = σ d

2

d∑

X =UΣVT Σ =

σ1σ 2

!σ D

"

#

$$$$$

%

&

'''''

Page 13: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recall:  L1  &  Sparsity  

•  w  is  sparse  if  mostly  0’s:  – Small  L0  Norm  

•  Why  not  L0  Regulariza:on?  – Not  conGnuous!  

•  L1  induces  sparsity  – And  is  con:nuous!  

argminw

λ w0+ yi −w

T xi( )2

i=1

N

w0= 1 wd≠0[ ]

d∑

argminw

λ w + yi −wT xi( )

2

i=1

N

Omijng  b  &  for  simplicity   13  

Page 14: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Trace  Norm  =  L1  of  Eigenvalues    

•  A  matrix  X  is  considered  low  rank  if  it  has  few  non-­‐zero  singular  values:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   14  

X*= σ d

d∑ = trace XTX( )

X =UΣVT Σ =

σ1σ 2

!σ D

"

#

$$$$$

%

&

'''''

XRank

= 1 σ d>0[ ]d∑

aka  “spectral  sparsity”  

Not  conGnuous!  

Page 15: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Other  Useful  Proper:es  •  Cauchy  Schwarz:  

•  AM-­‐GM  Inequality:  

 •  Orthogonal  Transforma:on  Invariance  of  Norms:  

•  Trace  Norm  of  Diagonals  

 

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   15  

A,B 2= trace(ATB)2 ≤ A,A B,B = trace(ATA)trace(BTB) = A

F

2 BF

2

A B = A 2 B 2≤12

A 2+ B 2( )

UAF= A

FUA

*= A

*

A*= Aii

i∑ If  A  is  a  square  diagonal  matrix  

If  U  is  a  full-­‐rank  orthogonal  matrix  

True  for  any  norm  

Page 16: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  SVD  &  PCA  

•  SVD:    

•  PCA:  

•  The  first  K  columns  of  U  are  the  best  rank-­‐K  subspace  that  minimizes  the  Frobenius  norm  residual:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   16  

X =UΣVT

XXT =UΣ2UT

X −U1:KU1:KT X

Fro

2

Page 17: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Latent  Factor  Models  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   17  

Page 18: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Nemlix  Problem  

•  Yij  =  ra:ng  user  i  gives  to  movie  j  

•  Solve  using  SVD!  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   18  

Y  

N  Movies  M  Users  

U  VT  

=  

N  

M  

K  

K  

“Latent  Factors”  

yij ≈ uiTvj

Page 19: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   19  

hop://www2.research.ao.com/~volinsky/papers/ieeecomputer.pdf  

Example  

Miniproject  2:  create  your  own.  

yij ≈ uiTvj

Page 20: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Actual  Nemlix  Problem  

•  Many  missing  values!  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   20  

Y  (missing  values)  

N  Movies  

M  Users  

U  V  

=  

N  

M  

K  

K  

“Latent  Factors”  

Page 21: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Collabora:ve  Filtering  

•  M  Users,  N  Items  •  Small  subset  of  user/item  pairs  have  ra:ngs  •  Most  are  missing  

•  Applicable  to  any  user/item  ra:ng  problem  – Amazon,  Pandora,  etc.  

•  Goal:  Predict  the  missing  values.  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   21  

Page 22: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Latent  Factor  Formula:on  

•  Only  labels,  no  features  

•  Learn  a  latent  representa:on  over  users  U  and  movies  V  such  that:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   22  

S = yij{ }

argminU,V

λ2

UFro

2+ V

Fro

2( )+ yij −uiTvj( )

ij∑

2

Page 23: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Connec:on  to  Trace  Norm  

•  Suppose  we  consider  all  U,V  that  achieve  perfect  reconstruc:on:  Y=UVT  

•  Find  U,V  with  lowest  complexity:  

•  Complexity  equivalent  to  trace  norm:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   23  

Y*= min

Y=UVT

12

UFro

2+ V

Fro

2( )

argminY=UVT

12

UFro

2+ V

Fro

2( )

Prove  in  homework!  

Page 24: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Proof  (One  Direc:on)  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   24  

Y*≥ min

Y=ABT

12

AFro

2+ B

Fro

2( )We  will  prove:   Y =UΣVT

Choose:   A =U Σ, B =V Σ

Then:   minY=ABT

12

AFro

2+ B

Fro

2( ) ≤ 12

U ΣFro

2+ V Σ

Fro

2( ) = 1

2trace U Σ( )

TU Σ( )#

$%

&'(+ trace V Σ( )

TV Σ( )#

$%

&'(

#$%

&'(

= 12

trace ΣUTU Σ( )+ trace ΣVTV Σ( )( ) = 1

2trace Σ Σ( )+ trace Σ Σ( )( )

= 12

trace Σ( )+ trace Σ( )( ) = trace Σ( ) = Y*

SVD  

Page 25: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Interpre:ng  Model  

•  Latent-­‐Factor  Model  Objec:ve  

 

•  Related  to:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   25  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ yij −uiTvj( )

ij∑

2

argminW

λ W*+ yij −wij( )

ij∑

2

W*= min

W=UVT

12

UFro

2+ V

Fro

2( ) Equivalent  when  U,V  =  rank  of  W  

Find  the  best  low-­‐rank    approximaGon  to  Y!  

Page 26: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

User/Movie  Symmetry    

•  If  we  knew  V,  then  linear  regression  to  learn  U  – Treat  V  as  features  

•  If  we  knew  U,  then  linear  regression  to  learn  V  – Treat  U  as  features  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   26  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ yij −uiTvj( )

ij∑

2

Page 27: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Op:miza:on  

•  Only  train  over  observed  yij  •  Two  ways  to  Op:mize  – Gradient  Descent  – Alterna:ng  op:miza:on  

•  Closed  Form  (for  each  sub-­‐problem)  – Homework  ques:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   27  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ ωij yij −uiTvj( )

ij∑

2ωij ∈ 0,1{ }

Page 28: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Gradient  Calcula:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   28  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 ωij yij −uiTvj( )

ij∑

2

∂ui = λui − ωijv j yij −uiTvj( )

T

j∑

ui = λIK + ωijv jvjT

j∑

"

#$$

%

&''

−1

ωij yijv jj∑"

#$$

%

&''

Closed  Form  Solu:on  (assuming  V  fixed):  

Page 29: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Gradient  Descent  Op:ons  

•  Stochas:c  Gradient  Descent  – Update  all  model  parameters  for  single  data  point  

•  Alterna:ng  SGD:  – Update  a  single  column  of  parameters  at  a  :me  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   29  

∂ui = λui − vj yij −uiTvj( )

ui = ui −η∂ui

Page 30: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Alterna:ng  Op:miza:on  

•  Ini:alize  U  &  V  randomly  •  Loop  – Choose  next  ui  or  vj  – Solve  op:mally:  

•  (assuming  all  other  variables  fixed)  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   30  

ui = λIK + ωijv jvjT

j∑

"

#$$

%

&''

−1

ωij yijv jj∑"

#$$

%

&''

Page 31: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Tradeoffs  

•  Alterna:ng  op:miza:on  much  faster  in  terms  of  #itera:ons  –  But  requires  inver:ng  a  matrix:  

•  Gradient  descent  faster  for  high-­‐dim  problems  –  Also  allows  for  streaming  data  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   31  

ui = λIK + ωijv jvjT

j∑

"

#$$

%

&''

−1

ωij yijv jj∑"

#$$

%

&''

ui = ui −η∂ui

Page 32: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   32  

hop://www2.research.ao.com/~volinsky/papers/ieeecomputer.pdf  

Page 33: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recap:  Collabora:ve  Filtering  

•  Goal:  predict  every  user/item  ra:ng  

•  Challenge:  only  a  small  subset  observed  

•  AssumpGon:  there  exists  a  low-­‐rank  subspace  that  captures  all  the  variability  in  describing  different  users  and  items  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   33  

Page 34: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Aside:  Mul:task  Learning  

•  M  Tasks:  

•  Example:  personalized  recommender  system  – One  task  per  user:  

 

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   34  

Sm = (xi, yim ){ }i=1

N

argminW

λ2R(W )+ 1

2yi −wm

T xi( )i∑

2

m∑

Regularizer  

Page 35: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

How  to  Regularize?  

•  Standard  L2  Norm:  

•  Decomposes  to  independent  tasks  – For  each  task,  learn  D  parameters  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   35  

Sm = (xi, yim ){ }i=1

Nargmin

W

λ2R(W )+ 1

2yi −wm

T xi( )i∑

2

m∑

argminW

λ2W 2

+ yi −wmT xi( )

i∑

2

m∑ =

λ2wm

2+ yi −wm

T xi( )i∑

2#

$%

&

'(

m∑

Page 36: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

How  to  Regularize?  

•  Trace  Norm:  

•  Induces  W  to  have  low  rank  across  all  task  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   36  

Sm = (xi, yim ){ }i=1

Nargmin

W

λ2R(W )+ 1

2yi −wm

T xi( )i∑

2

m∑

argminW

λ2W

*+ yi −wm

T xi( )i∑

2

m∑

Page 37: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Recall:  Trace  Norm  &  Latent  Factor  Models  

•  Suppose  we  consider  all  U,V  that  achieve  perfect  reconstruc:on:  W=UVT  

•  Find  U,V  with  lowest  complexity:  

•  Claim:  complexity  equivalent  to  trace  norm:  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   37  

W*= min

W=UVT

12

UFro

2+ V

Fro

2( )

argminW=UVT

12

UFro

2+ V

Fro

2( )

Page 38: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

How  to  Regularize?  

•  Latent  Factor  Approach  

•  Learns  a  feature  projec:on  x’  =  Vx  •  Learns  a  K  dimensional  model  per  task  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   38  

Sm = (xi, yim ){ }i=1

Nargmin

W

λ2R(W )+ 1

2yi −wm

T xi( )i∑

2

m∑

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi −umTVxi( )

i∑

2

m∑

Page 39: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Tradeoff  

•  D*N  parameters:  

•  D*K  +  N*K  parameters:  

– Sta:s:cally  more  efficient  – Great  if  low-­‐rank  assump:on  is  a  good  one  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   39  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi −umTVxi( )

i∑

2

m∑

argminW

λ2wm

2+12

yi −wmT xi( )

i∑

2#

$%

&

'(

m∑

Page 40: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Mul:task  Learning  

•  M  Tasks:  

•  Example:  personalized  recommender  system  – One  task  per  user:  –  If  x  is  topic  feature  representa:on  

•  V  is  subspace  of  correlated  topics  •  Projects  mul:ple  topics  together  

 Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   40  

Sm = (xi, yim ){ }i=1

N

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yim −um

TVxi( )i∑

2

m∑

Page 41: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Reduc:on  to  Collabora:ve  Filtering  

•  Suppose  each  xi  is  single  indicator  xi  =  ei  •  Then:  

•  Exactly  Collabora:ve  Filtering!  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   41  

Sm = (xi, yim ){ }i=1

NargminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yim −um

TVxi( )i∑

2

m∑

xi =

!010!

!

"

######

$

%

&&&&&&

Vxi = vi

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yim −um

Tvi( )i∑

2

m∑

Page 42: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Latent  Factor  Mul:task  Learning  vs  Collabora:ve  Filtering  

•  Projects  x  into  low-­‐dimensional  subspace  Vx  •  Learns  low-­‐dimensional  model  per  task  

•  Creates  low  dimensional  feature  for  each  movie  •  Learns  low-­‐dimensional  model  per  user  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   42  

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yim −um

TVxi( )i∑

2

m∑

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yim −um

Tvi( )i∑

2

m∑

Page 43: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

General  Bilinear  Models  

•  Users  described  by  features  z  •  Items  described  by  features  x  

•  Learn  a  projec:on  of  z  and  x  into  common  low-­‐dimensional  space  – Linear  model  in  low  dimensional  space  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   43  

S = (xi, zi,yi ){ }argminU,V

λ2

UFro

2+ V

Fro

2( )+ yi − ziTUTVxi( )

i∑

2

Page 44: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Why  are  Bilinear  Models  Useful?  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   44  

S = (xi, zi,yi ){ }

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi − ziTUTVxi( )

i∑

2

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi −umTVxi( )

i∑

2

m∑

argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi −umTvi( )

i∑

2

m∑ U:  MxK  

V:  NxK  

U:  MxK  V:  DxK  

U:  FxK  V:  DxK  

Page 45: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Story  So  Far:  Latent  Factor  Models  

•  Simplest  Case:  reduces  to  SVD  of  matrix  Y  – No  missing  values  –  (z,x)  indicator  features  

•  General  Case:  projects  high-­‐dimensional  feature  representa:on  into  low-­‐dimensional  linear  model  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   45  

S = (xi, zi,yi ){ }argminU,V

λ2

UFro

2+ V

Fro

2( )+ 12 yi − ziTUTVxi( )

i∑

2

Page 46: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Non-­‐Nega:ve  Matrix  Factoriza:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   46  

Page 47: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Limita:ons  of  PCA  &  SVD  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   47  

All  features    non-­‐nega:ve  

PCA/SVD  SoluGon  

BeVer  SoluGon?  

Page 48: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Non-­‐Nega:ve  Matrix  Factoriza:on  

•  Assume  Y  is  non-­‐nega:ve  •  Find  non-­‐nega:ve  U  &  V  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   48  

Y  M  

U  VT  

=  M  

K  

“Latent  Factors”  

N   K   N  

Page 49: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

49  

CS  155  Non-­‐Nega:ve  Face  Basis  0.76  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

0.00   0.74   0.32  

0.01   0.64   1.39   0.15  

0.97   1.58   0.00   0.54  

1.03   0.16   0.43   1.16  

0.00   1.02   0.93   0.33  

0.00   0.56   0.28   0.79  

Page 50: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

50  

CS  155  Eigenface  Basis  

Page 51: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Aside:  Non-­‐Orthogonal  Projec:ons  

•  If  columns  of  A  are  not  orthogonal,  ATA≠I  – How  to  reverse  transforma:on  x’=ATx?  – SoluGon:  Pseudoinverse!  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   51  

A =UΣVT

SVD  

A+ =VΣ+UT

Σ+ =

σ1 0 ! 00 σ 2 ! "

" ! ! 00 # 0 σ D

"

#

$$$$$

%

&

'''''

σ + = 1/σ if σ > 00 otherwise

!"#

Pseudoinverse  A+T AT x =UΣ+VTVΣUT x

=U1:KU1:KT x

IntuiGon:  use  the  rank-­‐K  orthogonal                                        basis  that  spans  A.      

Page 52: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Objec:ve  Func:on  

•  Squared  Loss:  –  Penalizes  squared  distance  

 

•  Generalized  Rela:ve  Entropy  –  Aka,  unnormalized  KL  divergence  –  Penalizes  ra:o  

•  Train  using  gradient  descent  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   52  

argminU≥0,V≥0

ℓ(yij,uiTvj )

ij∑

hop://hebb.mit.edu/people/seung/papers/nmfconverge.pdf  

ℓ(a,b) = (a− b)2

ℓ(a,b) = a log ab− a+ b

Page 53: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

SVD/PCA  vs  NNMF  

•  SVD/PCA:  –  Finds  the  best  orthogonal  basis  faces  •  Basis  faces  can  be  neg.  

–  Coeffs  can  be  nega:ve  –  Ozen  trickier  to  visualize  –  Beoer  reconstruc:ons  with  fewer  basis  faces  •  Basis  faces  capture  the  most  varia:ons  

•  NNMF:  –  Finds  best  set  of  non-­‐nega:ve  basis  faces  

–  Non-­‐nega:ve  coeffs  •  Ozen  non-­‐overlapping  

–  Easier  to  visualize  –  Requires  more  basis  faces  for  good  reconstruc:ons  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   53  

Page 54: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Non-­‐Nega:ve  Latent  Factor  Models  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   54  

S = (xi, zi,yi ){ }argminU,V

λ2

UFro

2+ V

Fro

2( )+ ℓ yi, ziTUTVxi( )

i∑

•  Simplest  Case:  reduces  to  NNMF  of  matrix  Y  – No  missing  values  –  (z,x)  indicator  features  

•  General  Case:  projects  high-­‐dimensional  non-­‐nega:ve  features  into  low-­‐dimensional  non-­‐nega:ve  linear  model  

Page 55: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Modeling  NBA  Gameplay  Using    Non-­‐Nega:ve  Spa:al    Latent  Factor  Models  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   55  

Page 56: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Fine-­‐Grained    Spa:al  Models  

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

40  feet  56  

•  Discre:ze  court  – 1x1  foot  cells  – 2000  cells  

•  1  weight  per  cell  – 2000  weights  

   

Fs (x) :Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 57: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Fine-­‐Grained    Spa:al  Models  

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

40  feet  57  

•  Discre:ze  court  – 1x1  foot  cells  – 2000  cells  

•  1  weight  per  cell  – 2000  weights  

   

Fs (x) :

But  most  players  haven’t  played  that  much!  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 58: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Visualizing  loca:on  factors  L  

Visualizing  players  BbL  

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  

Latent&Factor&Model&(Shoo1ng)&

•  Shoo$ng'Score:&

Fs' BT'

L'

='

D&

M&

D&

M&

K&

K&

Loca1on&Factors&

Player&Factors&

Fs (x) = Bb(x )T Ll (x )

Assume&that&players&can&be&represented&by&K>dimensional&vector&Learn&a&common&K>dimensional&latent&feature&representa1on&

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 59: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Training  Data  

STATS  SportsVU  2012/2013  Season,  630  Games,    

80K  Possessions,  380  frames  per  possession  59  

Page 60: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Predic:on  

•  Game  state:  x  –  Coordinates  of  all  players  –  Who  is  the  ball  handler  

•  Event:  y  –  Ball  handler  will  shoot  –  Ball  handler  will  pass  (to  whom?)  –  Ball  handler  will  hold  onto  the  ball  –  6  possibili:es  

•  Goal:    Learn  P(y|x)  

hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  60  Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 61: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Logis:c  Regression  (Simple  Version:  Just  for  Shoo:ng)  

P(y | x) =exp F(y | x){ }Z(x | F)

Z(x | F) = exp F(y ' | x){ }y '∈{s,⊥}∑

F(y ' | x) =Fs (x) y ' = sF⊥ y ' =⊥

"#$

%$

Offset  or  bias  

P(y = s | x) = 11+ exp −Fs (x)+F⊥{ }

Shot  

Hold  on  to  ball  

61  

Latent&Factor&Model&(Shoo1ng)&

•  Shoo$ng'Score:&

Fs' BT'

L'

='

D&

M&

D&

M&

K&

K&

Loca1on&Factors&

Player&Factors&

Fs (x) = Bb(x )T Ll (x )

Assume&that&players&can&be&represented&by&K>dimensional&vector&Learn&a&common&K>dimensional&latent&feature&representa1on&

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 62: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Learning  the  Model  

•  Given  training  data:  

•  Learn  parameters  of  model:  argmin

Fs ,F⊥

λ2Fs

2+ ℓ y,Fs (x)−F⊥( )(x,y)∈S∑

S = (x, y){ }

Log  Loss  

Player    Configura:on   What  Happened    

Next  

62  

P(y = s | x) = 11+ exp −Fs (x)+F⊥{ }

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 63: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Op:miza:on  via  Gradient  Descent  

Latent&Factor&Model&(Shoo1ng)&

•  Shoo$ng'Score:&

Fs' BT'

L'

='

D&

M&

D&

M&

K&

K&

Loca1on&Factors&

Player&Factors&

Fs (x) = Bb(x )T Ll (x )

Assume&that&players&can&be&represented&by&K>dimensional&vector&Learn&a&common&K>dimensional&latent&feature&representa1on&

∂Li = λ1Li −∂ logP(y | x)

∂Li(x,y)∑

argminB≥0,L≥0,F⊥

λ2

B 2+ L 2( ) + ℓ y,Bb(x )

T Ll (x ) −F⊥( )(x,y)∑

∂ logP(y | x)∂Li

= 1 y=s[ ] −P(s | x)( )Bb(x )

hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  63  Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 64: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

1 2 3 4 5 6 7 8 9 10

Shooting Factors (L)

Kawhi Carmelo Dirk Dion John Tim Kyrie Shawn Jeremy DavidLeonard Anthony Nowitzki Waiters Wall Duncan Irving Marion Lin Lee

Fig. 5. Top Row: Depicting the spatial coefficents of the latent factors corresponding to the spatial coefficients contributing to the ball handler taking a shot ateach location (L). Each player’s shooting model can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficients of a playerwith high affinity to each of the latent factors.

1 2 3 4 5 6 7 8 9 10

Receiving Pass Factors (M )

Tony Dirk LeBron Monta Manu David Jose Chandler Goran JoachimParker Nowitzki James Ellis Ginobili Lee Calderon Parsons Dragic Noah

Fig. 6. Top Row: Depicting the spatial coefficents of the latent factors corresponding to spatial coefficients contributing to a player receiving a pass at eachlocation (M ). Each player’s conditional shooting probabilities can be represented as a combination of these factors. Bottom Row: Depicting the spatial coefficentsof a player with high affinity to each of the latent factors.

left column. We evaluated using both ROC curves as wellas average log loss, and we observe that our latent factormodels offer significant improvement. The gain from addingthe temporal spatial components is small (i.e., +SDT offersabout a 0.1% relative reduction in average log loss comparedto +SD), but this gain is still statistically significant given theapproximately 6 million examples/frames in our test set.

Our dataset exhibits a notable amount of class imbalance– about 70% of frames have no action events forthcoming inthe next 1 second. To further tease out the interesting effectslearned by our model, we also evaluate predictive performancewhen conditioned on an action event (shot or pass), and alsoon a pass event. These results are depicted in the secondand third columns of Figure 4(a), respectively. We observe amore significant performance gain from our spatial latent factormodels compared to the baseline, although the performance of+SDT is no longer distinguishable from +SD.

Figure 4(b) depicts how predictive performance variesacross the basketball court. We see that the spatial latent factormodel improves performance throughout the court. When

comparing all events (top row), we observe a significant errorreduction in the two “low post” locations.9 When comparingaction events (middle row), we observe a significant errorreduction along the baseline behind the basket (for +SD and+SDT). We comparing only passing events (bottom row), weobserve a significant error reduction around the basket area(for +SD and +SDT). Our inspection of the learned factors inSection VI will shed light on how components of our modelcontribute to the various reductions in error.

VI. MODEL INSPECTION

We now provide a detailed inspection our learned modelin order to identify patterns that match known intuitions ofbasketball game play. We will also identify aspects of ourmodel that contribute to the various gains in performancedepicted in Figure 4(b). Although our focus is on basketball,this type of analysis is applicable to a wide range of domainswith fine-grained spatial patterns.

9The two low post locations are situated about 10 feet and directed at 45degree angles away from the basket.

Where  are  Players  Likely  to  Receive  Passes?  

Visualizing  Loca:on  Factors  M  

64  

Enforce  Non-­‐Nega:vity  (Accuracy  Worse)  (More  Interpretable)  

hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  

Latent&Factor&Model&(Passing)&

&&&

•  Shoo$ng'Score:&

Fp' PT'M'

='

D&

M&

D&

M&

K&

K&

Loca6on&Factors&

Player&Factors&

Lecture&15:&Recent&Applica6ons&of&Latent&Factor&Models& 24&

1'

hCp://www.yisongyue.com/publica6ons/icdm2014_bball_predict.pdf&

Fp(i, x) = PiTLl (i,x )

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 65: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

1 2 3 4 5 6 7 8 9 10 11 12

Passer Factors (Q1)

1 2 3 4 5 6 7 8 9 10 11 12

Receiver Factors (Q2)

Fig. 7. Top Row: Depicting the spatial coefficents of the latent factors corresponding to passing from each location. Bottom Row: Depicting the associatedspatial coefficents of the latent corresponding to receiving a pass at each location. Figure 8 shows examples of aggregate passing and receiving coefficients.

A. Where do Players Tend to Shoot?

We begin by inspecting the player-by-location shooting fac-tors B and L. We estimated a 10-dimensional representation,which visually correspond to known shooting patterns. Ourresults are shown in Figure 5. The top row depicts the tenlearned factors (i.e., the rows of L), and the bottom row depictsthe spatial coefficients of players which have high affinity foreach of the factors. The first three factors correspond to playerswho shoot at the 3 point line. The next three factor correspondto players who shoot midrange shots. The seventh factorcorresponds to players who shoot just inside the lane. Theeight factor corresponds to players who shoot while attackingthe basket from the baseline. The final two factors correspondto players who shoot near the basket.

We note that the first factor and the last two factors havethe strongest coefficients, which is unsurprising since those arethe most valuable shooting locations. Note that the first factoralso corresponds to the large error reduction in the “cornerthree” locations in the middle row of Figure 4(b). We furthernote that none of the factors have strong coefficients along thebaseline behind the basket, which is also a region of large errorreduction in the middle row of Figure 4(b).

A similar analysis was conducted by [8], but with twokey differences. First, [8] did not consider the in-game statebut only modeled the shot selection chart. In contrast, we areinterested in modeling the conditional probability of a playershooting given the current game state. For example, a playermay often occupy a certain location without shooting the ball.Second, as described below, we also model and analyze manymore spatial factors than just player shooting tendencies.

B. Where do Players Tend to Receive Passes?

We now inspect the player-by-location passing factors P

and M . We estimated a 10-dimensional representation, whichvisually correspond to known passing patterns. The top row inFigure 6 shows the ten learned factors (i.e., the rows of M ),and the bottom row depicts the spatial coefficients of playerswhich have high affinity for each of the factors.

The first factor corresponds to players that receive passeswhile cutting across the baseline. In fact, many of the factorsgive off a visual effect of being in motion (i.e., spatial blur),

Fig. 8. Top Row: Depicting spatial coefficients of passing to different locationson the court. The “X” denotes the location of the passer and corresponds toa row in Q>

1 Q2. Bottom Row: Depicting the spatial coefficients of receivinga pass from different locations on the court. The “X” denotes the location ofthe receiver, and corresponds to a column in Q>

1 Q2.

which suggests that players are often moving right beforereceiving a pass.10 The second factor corresponds to playersthat receive passes in the low post, which also correspondsto areas of high error reduction in the top row of Figure4(b).11 The next three factors depict various midrange regions.Typically, right handed players associate more with the fourthfactor, and left handed players associate more with the fifthfactor.12 The next three factors depict regions at the three-point line, and the final two factors depict regions in the backcourt.

The players with the strongest coefficients tend to be fromplayoff teams that have tracking equipment deployed on theirhome courts during the 2012-2013 NBA season, since thoseplayers have the most training data available for them. Forexample, we see that Figure 5 and Figure 6 contain severalplayers from the San Antonio Spurs, which had the most gamesin our dataset.

10This effect is most likely an artifact of how we formulated our task topredict events up to 1 second into the future. It may be beneficial to modelmultiple time intervals into the future (e.g., half a second, one second).

11Note that although players often receive passes in the low post, theytypically do not shoot from that location, as evidenced in the shooting factorsin Figure 5. Capturing this sharp spatial contrast between receiving passes andshooting contributes to the error reduction in those regions in Figure 4(b).

12This effect is due to players preferring to drive with their strong hand.

1 2 3 4 5 6 7 8 9 10 11 12

Passer Factors (Q1)

1 2 3 4 5 6 7 8 9 10 11 12

Receiver Factors (Q2)

Fig. 7. Top Row: Depicting the spatial coefficents of the latent factors corresponding to passing from each location. Bottom Row: Depicting the associatedspatial coefficents of the latent corresponding to receiving a pass at each location. Figure 8 shows examples of aggregate passing and receiving coefficients.

A. Where do Players Tend to Shoot?

We begin by inspecting the player-by-location shooting fac-tors B and L. We estimated a 10-dimensional representation,which visually correspond to known shooting patterns. Ourresults are shown in Figure 5. The top row depicts the tenlearned factors (i.e., the rows of L), and the bottom row depictsthe spatial coefficients of players which have high affinity foreach of the factors. The first three factors correspond to playerswho shoot at the 3 point line. The next three factor correspondto players who shoot midrange shots. The seventh factorcorresponds to players who shoot just inside the lane. Theeight factor corresponds to players who shoot while attackingthe basket from the baseline. The final two factors correspondto players who shoot near the basket.

We note that the first factor and the last two factors havethe strongest coefficients, which is unsurprising since those arethe most valuable shooting locations. Note that the first factoralso corresponds to the large error reduction in the “cornerthree” locations in the middle row of Figure 4(b). We furthernote that none of the factors have strong coefficients along thebaseline behind the basket, which is also a region of large errorreduction in the middle row of Figure 4(b).

A similar analysis was conducted by [8], but with twokey differences. First, [8] did not consider the in-game statebut only modeled the shot selection chart. In contrast, we areinterested in modeling the conditional probability of a playershooting given the current game state. For example, a playermay often occupy a certain location without shooting the ball.Second, as described below, we also model and analyze manymore spatial factors than just player shooting tendencies.

B. Where do Players Tend to Receive Passes?

We now inspect the player-by-location passing factors P

and M . We estimated a 10-dimensional representation, whichvisually correspond to known passing patterns. The top row inFigure 6 shows the ten learned factors (i.e., the rows of M ),and the bottom row depicts the spatial coefficients of playerswhich have high affinity for each of the factors.

The first factor corresponds to players that receive passeswhile cutting across the baseline. In fact, many of the factorsgive off a visual effect of being in motion (i.e., spatial blur),

Fig. 8. Top Row: Depicting spatial coefficients of passing to different locationson the court. The “X” denotes the location of the passer and corresponds toa row in Q>

1 Q2. Bottom Row: Depicting the spatial coefficients of receivinga pass from different locations on the court. The “X” denotes the location ofthe receiver, and corresponds to a column in Q>

1 Q2.

which suggests that players are often moving right beforereceiving a pass.10 The second factor corresponds to playersthat receive passes in the low post, which also correspondsto areas of high error reduction in the top row of Figure4(b).11 The next three factors depict various midrange regions.Typically, right handed players associate more with the fourthfactor, and left handed players associate more with the fifthfactor.12 The next three factors depict regions at the three-point line, and the final two factors depict regions in the backcourt.

The players with the strongest coefficients tend to be fromplayoff teams that have tracking equipment deployed on theirhome courts during the 2012-2013 NBA season, since thoseplayers have the most training data available for them. Forexample, we see that Figure 5 and Figure 6 contain severalplayers from the San Antonio Spurs, which had the most gamesin our dataset.

10This effect is most likely an artifact of how we formulated our task topredict events up to 1 second into the future. It may be beneficial to modelmultiple time intervals into the future (e.g., half a second, one second).

11Note that although players often receive passes in the low post, theytypically do not shoot from that location, as evidenced in the shooting factorsin Figure 5. Capturing this sharp spatial contrast between receiving passes andshooting contributes to the error reduction in those regions in Figure 4(b).

12This effect is due to players preferring to drive with their strong hand.

How  do  passes  tend  to  flow?  

Q1  

Q2  

65  hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  

Latent&Factor&Model&(Passing)&#2&

&&

&

•  Shoo$ng'Score:&

Fp' Q1'

Q2'

='

D&

D&

D&

D&

K&

K&

Target&Loca9on&Factors&

Source&Loca9on&Factors&

Lecture&15:&Recent&Applica9ons&of&Latent&Factor&Models& 26&

2'

hEp://www.yisongyue.com/publica9ons/icdm2014_bball_predict.pdf&

T'

Fp(i, x) = PiTLl (i,x ) +Q1,l (x )

T Q2,l (i,x )

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 66: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

How  do  passes  tend  to  flow?  1 2 3 4 5 6 7 8 9 10 11 12

Passer Factors (Q1)

1 2 3 4 5 6 7 8 9 10 11 12

Receiver Factors (Q2)

Fig. 7. Top Row: Depicting the spatial coefficents of the latent factors corresponding to passing from each location. Bottom Row: Depicting the associatedspatial coefficents of the latent corresponding to receiving a pass at each location. Figure 8 shows examples of aggregate passing and receiving coefficients.

A. Where do Players Tend to Shoot?

We begin by inspecting the player-by-location shooting fac-tors B and L. We estimated a 10-dimensional representation,which visually correspond to known shooting patterns. Ourresults are shown in Figure 5. The top row depicts the tenlearned factors (i.e., the rows of L), and the bottom row depictsthe spatial coefficients of players which have high affinity foreach of the factors. The first three factors correspond to playerswho shoot at the 3 point line. The next three factor correspondto players who shoot midrange shots. The seventh factorcorresponds to players who shoot just inside the lane. Theeight factor corresponds to players who shoot while attackingthe basket from the baseline. The final two factors correspondto players who shoot near the basket.

We note that the first factor and the last two factors havethe strongest coefficients, which is unsurprising since those arethe most valuable shooting locations. Note that the first factoralso corresponds to the large error reduction in the “cornerthree” locations in the middle row of Figure 4(b). We furthernote that none of the factors have strong coefficients along thebaseline behind the basket, which is also a region of large errorreduction in the middle row of Figure 4(b).

A similar analysis was conducted by [8], but with twokey differences. First, [8] did not consider the in-game statebut only modeled the shot selection chart. In contrast, we areinterested in modeling the conditional probability of a playershooting given the current game state. For example, a playermay often occupy a certain location without shooting the ball.Second, as described below, we also model and analyze manymore spatial factors than just player shooting tendencies.

B. Where do Players Tend to Receive Passes?

We now inspect the player-by-location passing factors P

and M . We estimated a 10-dimensional representation, whichvisually correspond to known passing patterns. The top row inFigure 6 shows the ten learned factors (i.e., the rows of M ),and the bottom row depicts the spatial coefficients of playerswhich have high affinity for each of the factors.

The first factor corresponds to players that receive passeswhile cutting across the baseline. In fact, many of the factorsgive off a visual effect of being in motion (i.e., spatial blur),

Fig. 8. Top Row: Depicting spatial coefficients of passing to different locationson the court. The “X” denotes the location of the passer and corresponds toa row in Q>

1 Q2. Bottom Row: Depicting the spatial coefficients of receivinga pass from different locations on the court. The “X” denotes the location ofthe receiver, and corresponds to a column in Q>

1 Q2.

which suggests that players are often moving right beforereceiving a pass.10 The second factor corresponds to playersthat receive passes in the low post, which also correspondsto areas of high error reduction in the top row of Figure4(b).11 The next three factors depict various midrange regions.Typically, right handed players associate more with the fourthfactor, and left handed players associate more with the fifthfactor.12 The next three factors depict regions at the three-point line, and the final two factors depict regions in the backcourt.

The players with the strongest coefficients tend to be fromplayoff teams that have tracking equipment deployed on theirhome courts during the 2012-2013 NBA season, since thoseplayers have the most training data available for them. Forexample, we see that Figure 5 and Figure 6 contain severalplayers from the San Antonio Spurs, which had the most gamesin our dataset.

10This effect is most likely an artifact of how we formulated our task topredict events up to 1 second into the future. It may be beneficial to modelmultiple time intervals into the future (e.g., half a second, one second).

11Note that although players often receive passes in the low post, theytypically do not shoot from that location, as evidenced in the shooting factorsin Figure 5. Capturing this sharp spatial contrast between receiving passes andshooting contributes to the error reduction in those regions in Figure 4(b).

12This effect is due to players preferring to drive with their strong hand.

1 2 3 4 5 6 7 8 9 10 11 12

Passer Factors (Q1)

1 2 3 4 5 6 7 8 9 10 11 12

Receiver Factors (Q2)

Fig. 7. Top Row: Depicting the spatial coefficents of the latent factors corresponding to passing from each location. Bottom Row: Depicting the associatedspatial coefficents of the latent corresponding to receiving a pass at each location. Figure 8 shows examples of aggregate passing and receiving coefficients.

A. Where do Players Tend to Shoot?

We begin by inspecting the player-by-location shooting fac-tors B and L. We estimated a 10-dimensional representation,which visually correspond to known shooting patterns. Ourresults are shown in Figure 5. The top row depicts the tenlearned factors (i.e., the rows of L), and the bottom row depictsthe spatial coefficients of players which have high affinity foreach of the factors. The first three factors correspond to playerswho shoot at the 3 point line. The next three factor correspondto players who shoot midrange shots. The seventh factorcorresponds to players who shoot just inside the lane. Theeight factor corresponds to players who shoot while attackingthe basket from the baseline. The final two factors correspondto players who shoot near the basket.

We note that the first factor and the last two factors havethe strongest coefficients, which is unsurprising since those arethe most valuable shooting locations. Note that the first factoralso corresponds to the large error reduction in the “cornerthree” locations in the middle row of Figure 4(b). We furthernote that none of the factors have strong coefficients along thebaseline behind the basket, which is also a region of large errorreduction in the middle row of Figure 4(b).

A similar analysis was conducted by [8], but with twokey differences. First, [8] did not consider the in-game statebut only modeled the shot selection chart. In contrast, we areinterested in modeling the conditional probability of a playershooting given the current game state. For example, a playermay often occupy a certain location without shooting the ball.Second, as described below, we also model and analyze manymore spatial factors than just player shooting tendencies.

B. Where do Players Tend to Receive Passes?

We now inspect the player-by-location passing factors P

and M . We estimated a 10-dimensional representation, whichvisually correspond to known passing patterns. The top row inFigure 6 shows the ten learned factors (i.e., the rows of M ),and the bottom row depicts the spatial coefficients of playerswhich have high affinity for each of the factors.

The first factor corresponds to players that receive passeswhile cutting across the baseline. In fact, many of the factorsgive off a visual effect of being in motion (i.e., spatial blur),

Fig. 8. Top Row: Depicting spatial coefficients of passing to different locationson the court. The “X” denotes the location of the passer and corresponds toa row in Q>

1 Q2. Bottom Row: Depicting the spatial coefficients of receivinga pass from different locations on the court. The “X” denotes the location ofthe receiver, and corresponds to a column in Q>

1 Q2.

which suggests that players are often moving right beforereceiving a pass.10 The second factor corresponds to playersthat receive passes in the low post, which also correspondsto areas of high error reduction in the top row of Figure4(b).11 The next three factors depict various midrange regions.Typically, right handed players associate more with the fourthfactor, and left handed players associate more with the fifthfactor.12 The next three factors depict regions at the three-point line, and the final two factors depict regions in the backcourt.

The players with the strongest coefficients tend to be fromplayoff teams that have tracking equipment deployed on theirhome courts during the 2012-2013 NBA season, since thoseplayers have the most training data available for them. Forexample, we see that Figure 5 and Figure 6 contain severalplayers from the San Antonio Spurs, which had the most gamesin our dataset.

10This effect is most likely an artifact of how we formulated our task topredict events up to 1 second into the future. It may be beneficial to modelmultiple time intervals into the future (e.g., half a second, one second).

11Note that although players often receive passes in the low post, theytypically do not shoot from that location, as evidenced in the shooting factorsin Figure 5. Capturing this sharp spatial contrast between receiving passes andshooting contributes to the error reduction in those regions in Figure 4(b).

12This effect is due to players preferring to drive with their strong hand.

Passing  From  “X”  

Passing  To  “X”  

66  hop://www.yisongyue.com/publica:ons/icdm2014_bball_predict.pdf  

Latent&Factor&Model&(Passing)&#2&

&&

&

•  Shoo$ng'Score:&

Fp' Q1'

Q2'

='

D&

D&

D&

D&

K&

K&

Target&Loca9on&Factors&

Source&Loca9on&Factors&

Lecture&15:&Recent&Applica9ons&of&Latent&Factor&Models& 26&

2'

hEp://www.yisongyue.com/publica9ons/icdm2014_bball_predict.pdf&

T'

Fp(i, x) = PiTLl (i,x ) +Q1,l (x )

T Q2,l (i,x )

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on  

Page 67: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Tensor  Latent  Factor  Models  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   67  

Page 68: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Tensor  Factoriza:on  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   68  

Y  

U  

W  ×  =  (Missing  Values)  

N  

M  

Q  

M  

K  

Q  

N   V  K  

Page 69: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Tri-­‐Linear  Model  

•  Predic:on  via  3-­‐way  dot  product:  –  Related  to  Hadamard  Product  

•  Example:  online  adver:sing  –  User  profile  z  –  Item  descrip:on  x  –  Query  q  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   69  

argminU,V ,W

λ2

UFro

2+ V

Fro

2+ W

Fro

2( )+ ℓ yi, UTzi,V

T xi,WTqi( )

i∑

a,b,c = akbkckk∑

Solve  using    Gradient  Descent  

Page 70: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Summary:  Latent  Factor  Models  •  Learns  a  low-­‐rank  model  of  a  matrix  of  observa:ons  Y  

–  Dimensions  of  Y  can  have  various  seman:cs  

•  Can  tolerate  missing  values  in  Y  

•  Can  also  use  features  

•  Widely  used  in  industry  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   70  

Ne#lix'Problem'

•  Yij'='ra1ng'user'i'gives'to'movie'j'

•  Solve'using'SVD!'

Lecture'13:'Latent'Factor'Models'&'NonENega1ve'Matrix'Factoriza1on' 18'

Y"

N'Movies'

M'Users'

U"VT"

="

N'

M'

K'

K'

“Latent'Factors”'

yij ≈ uiTvj

Page 71: Machine(Learning(&(DataMining€¦ · Machine(Learning(&(DataMining (CS/CNS/EE155& Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix(Factorizaon(Lecture(13:(LatentFactor(Models(&(Non9Negave(Matrix

Next  Week    

•  Embeddings  

•  Deep  Learning  

•  Next  Thursday:  Recita:on  on  Advanced  Op:miza:on  Techniques  

Lecture  13:  Latent  Factor  Models  &  Non-­‐Nega:ve  Matrix  Factoriza:on   71