Improving Response Prediction for Dyadic Data
description
Transcript of Improving Response Prediction for Dyadic Data
Improving Response Improving Response Prediction for Dyadic DataPrediction for Dyadic Data
Nik Tuzov
April 2008http://www.stat.purdue.edu/~ntuzov/
Dyadic DataDyadic Data
• Means that a certain “response value” is associated with a pair of objects
Applications:Applications:
• Social networks
• Internet advertising
• Recommendation systems
Unsupervised learningUnsupervised learning• Example: Collaborative filtering (MovieLens project)
Movie
1 2 3 4 5
1 A B C D A
2 A B C C A
User 3 A B C X? A
4 Y? B
5 • Movie 1 is “similar” to 5, hence Y is likely “B”• Users 1, 2, 3 are “similar” to each other, hence X is likely “C” or “D”
Co-clustering with Bregman differencesCo-clustering with Bregman differences
• K*L rectangular clusters – direct products of row/column clusters
Movie rating belongs to cluster ( ( ), ( )), where
( ) = I, I takes values from 1 to K
( ) = J, J takes values from 1 to L
ijy i j
i
j
Co-clustering with Bregman differencesCo-clustering with Bregman differences(example from http://videolectures.net/kdd07_agarwal_pdlfm/)(example from http://videolectures.net/kdd07_agarwal_pdlfm/)
PDLF-GLM ModelPDLF-GLM Model(Agarwal & Merugu’07)(Agarwal & Merugu’07)
IJ
[ ] - response matrix (movie ratings), 1, 1,
- s-dimensional observed covariate
K*L - number of clusters or "blocks" in the response matrix
scalar "offset" or "interaction term" for
ij
ij
Y y i m j n
x
cluster (I, J), I = 1,K J = 1,L
- proportion of [ ] belonging to cluster (I, J)
( ; ) - probability density of a GLM model; is a scalar parameter
- s-dimensional vector of regression co
IJ ij
ij
y
f y
IJ,
efficient of GLM model
Then the conditional density of is represented as:
( | ) ( ; )
ij
tij ij IJ ij ij
I J
y
p y x f y x
Neural Network as alternative to GLMNeural Network as alternative to GLMDenote: observed (non-missing) reponse
( | , ) - fitted response from the neural network,
given covariate and estimated weights
ij
ij ij
y
y x
( ), ( )
ij
ij ij
The predicted response for dyad (i, j) is: ( | , ) +
Input: Response matrix Y = [y ], i = 1,m j = 1,n
Covariates X = [x ], x is s-dimensional
Number of r
ij ij i jy x
IJ
ow clusters: K
Number of column clusters: L
Output: Neural network weights (dimension depends on the network),
K*L offsets { }
(.), (.) row and column cluster a
ssignments, m and n-dimensional,
that minimize overall sum of squared differences between the observed
and predicted response for dyad (i, j).
AlgorithmAlgorithm
IJ
(.), (.) - randomly
0
- by fitting the network based just on Y and X
Method:
Initialize:
Repeat
IJ
IJ argmin
Step 1: Update offsets (interaction effects) :
For each I = 1,K , J = 1, L
( ( ijy
2
i I, j J
IJi I, j J
| , ) )
that is,
( ( | ))
Step 2: Update neural n
ij ij
ij ij ij
y x
mean y y x
2( ), ( )argmin
all i, j
etwork weights, :
( ( | , ) ) ,
which amounts to fitting the network with
ij ij ij i jy y x
( ), ( )
argmin
covariates and "adjusted" response values ( )
Step 3: Update row cluster assignments: for each i = 1, m
(i)
ij ij i jx y
2( ) I
j = 1,n
( ( | , ) )
Step 4: Update column cluster assignments (similar to Step 3)
ij ij ij I jy y x
Data: MovieLensData: MovieLens
• 20603 ratings, 346 users, 966 movies
• From 1 to 198 ratings per movie, 32 to 105 ratings per user.
• 50 covariates for each (user, movie) pair
• 5700 observations held out for validation
• Using area under Receiver Operating Characteristic (ROC) curve to measure performance
Neural Network TopologyNeural Network Topology
0
0 1
( ), 1, - "derived features"
y ( ) - fitted probability of {y = 1}
(v) =1/(1 ) - sigmoid function
Total number of parameters ("weights"): ( 1) ( 1)
tl l l
t
v
Z X l r
Z
e
s r r
Number of nodes?Number of nodes?
• 40 nodes appear enough (produce similar overfitting)
ResultsResults
Logistic regression Neural network PDLF-Logistic PDLF-Neural PDLF-Neural PDLF-Neural
Clusters N/A N/A 4 * 4 4 * 4 6 * 6 3 * 4
Hidden nodes 1 40 1 40 40 40
Validation ROC 0.62 0.6742 0.6913 0.7128 0.6919 0.708
Max. cluster size N/A N/A 2022 1913 5184 1847
Min cluster size N/A N/A 274 412 5 709
Max delta N/A N/A 0.25 0.13 0.23 0.02
Min delta N/A N/A -0.4 -0.57 -0.36 -0.62
New Covariates?New Covariates?
Title Release date
Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) 1-Jan-22
Blue Angel, The (Blaue Engel, Der) (1930) 1-Jan-30
Pinocchio (1940) 1-Jan-40
Dial M for Murder (1954) 1-Jan-54
8 1/2 (1963) 1-Jan-63
Carrie (1976) 1-Jan-76
Top Gun (1986) 1-Jan-86
Bram Stoker's Dracula (1992) 1-Jan-92
Mortal Kombat: Annihilation (1997) 1-Jan-97
Sphere (1998) 13-Feb-98
• 756 ratings; 23 females and 55 males; No documentaries
Sample movies from the cluster with delta = -0.57 :
Contribution to ROC
Is Neural Network useful?Is Neural Network useful?
• Gain in ROC area depends on the order: extra linear features (n/network) are added first => gain from co-clustering is reduced
• The opposite is also true
• Hence, info in linear features is similar to that in clusters, so
• For this dataset, n/network is not so helpful, but…
• For other dyadic datasets, n/network can be a lot more useful
Related WorkRelated Work
• What if we want to predict response on
(Web page, Search query, Web user) ?
• B. Long, X. Wu, Z. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In KDD, 2006.
Additional InfoAdditional Info
• To obtain a detailed report and Matlab code, please visit my website:
http://www.stat.purdue.edu/~ntuzov/
• The project is posted in “Software skills / Matlab” section
• Questions? Contact me on [email protected]