MS Project Report - Final - GrockIt on Kaggle.com
Transcript of MS Project Report - Final - GrockIt on Kaggle.com
![Page 1: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/1.jpg)
“What do you know?” - Latent feature approach for the Kaggle’sGrockIt challenge
Rohan Anil [email protected]
University of California, San Diego, Department of Computer Science & Engineering, La Jolla, CA 92092 USA
March 19, 2012
Abstract
We describe our efforts in solving the GrockItcompetition in this report. The goal of thecompetition was to improve the state of theart in student evaluation. The task was topredict whether a student will answer thenext test question correctly. The data-setwas provided by GrockIt, an on-line platformfor students to practice questions for compet-itive exams. At the end of the competitionthere were a total of 252 teams, 581 individu-als and a total of 1803 submissions. We trainlatent feature log-linear models (LFL) for theprediction task. We exploit the rich meta in-formation associated with questions and thedyad to improve the performance of the mod-els. Finally, we explore different techniquesto blend the predictions from multiple mod-els. The competition used binomial cappeddeviance as the metric to rank teams. Ourteam of the ‘UCSD Triton’ was placed at rank4th in the public leaderboard with a BCDof 0.24665 and rank 5th in the final privateleaderboard with a BCD of 0.24792.
1. Introduction
Kaggle.com is a data-mining competition platform.We participated in one of the competitions titled“What do you know?” for improving the state-of-the-art in student evaluation and finished at rank 5th.The competition dataset was provided by GrockIt, anonline learning platform. GrockIt provides tools tohelp prepare students for competitive exams like theGMAT, ACT, SAT etc. The dataset mainly containsperformance information of students on various ques-tions.
Inspired by the success of latent-feature methods onthe Netflix prize challenge(Tscher et al., 2009), we seeif latent-feature methods are competetive in solvingthis task. To improve our leaderboard rank, we ex-plore two different ensemble learning technique i.e lin-ear regression and gradient boosted decision trees. Thereport is organized as follows, Section 2 introduces thetask of dyadic prediction, the dataset that was avail-able for the competition, the metric used to rank theteams, and the generic latent feature log-linear model(Menon & Elkan, 2010) (LFL). Section 3 formulatesthe student-evaluation task as a dyadic-prediction taskand derives the stochastic gradient update rules for theLFL model. Section 4, describes two techniques weused for ensembling and their results on the datasetand finally we conclude in Section 5.
2. GrockIt-Kaggle dataset
The dataset contains student responses for variousquestions. There are a total of 179,107 users and 6,046questions in the training set. The dataset is similar tothe typical dyadic dataset with a couple of key differ-ences: i) there can exist duplicate dyad pairs in thetraining set with different outcomes, since a studentcan answer a question many times, ii) In some gamestypes, students can collaboratively answer questions.The value of the feature “number of players” is thenumber of students answering the question. Grockitprovides a chat-box for the students to discuss the an-swer. If the question is a multiple-choice question,students can leave comments on the choices. The nextquestion is not displayed until everyone answers thequestion.
We can interpret each pair of user and question as adyad.
(s, q) → y(s,q)
s ∈ S: Studentsq ∈ Q: Questionsy ∈ O: Outcome
![Page 2: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/2.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Training set T = ((s, q) → y)
Figure 1. GrockIt-Kaggle dataset
The training set contains a total of 4,851,476 dyadpairs each with its corresponding label which is theoutcome. Outcome can be of four types, i) correct,ii) incorrect, iii) skipped and iv) timed-out. Thevalidation set contains a total of 80,075 dyads of80,075 unique students and test set contains 93,100dyads of 93,100 unique students. The validation setstudents is a subset of test set students. The studentswhich appear in the test set but do not appear inthe validation set are the students who have only fewdyads in the training set. The validation set has lateroutcomes relative to the training set and test set haslater outcomes relative to the validation set.
Our task is predict the probability of correct outcomefor every dyad in the test set.
Pr(y = ‘Correct′|(s, q) ∈ Test Set)
The validation and test set do not contain any skippedor timed-out outcomes. The competition uses 40% ofthe test data to rank teams on the public leaderboardand 60% of the test data for final ranking on theprivate leaderboard, which was only revealed at theend of the competition. This measure was used bythe organizers to prevent overfitting.
2.1. Baseline - Rasch Model
A baseline was provided by Kaggle for the dataset.The baseline uses the Rasch model (Rasch, 1960).Rasch models are widely used in education psychol-ogy research. The prediction from the model is,
Pr((s, q) → 1) =exp(Bs − δq)
1 + exp(Bs − δq)
where Bs is interpreted as the ability of the studentand δq is interepreted as the difficulty of question q.
2.2. Side Information
The dataset contains two types of side-information,question side-information listed in Table 1, and
student-question interaction side-information in Table2.
Table 1. Side-information associated with a question.
Type Description
Question-Type
i) Multiple Choice ii) Free Response
Group i)ACT, ii) GMAT, iii) SATTrack 9 types, listed in appendixSubtrack 15 typesTags each question is tagged with subjects it is
from, finer granularity than subtrackQuestionset
questions which share a question set idand share similarity in presentation on thescreen
Table 2. Side-information for a dyad.
Type Description
Game 12 types of gamesNumber of players players in the gameStarted Date and Time the question
was seen by the studentAnswered-at Date and Time the question
was answered by the studentDeactivated Date and Time the question
cleared from the screen.
2.3. Preprocessing the dataset
Our task is to predict the probability of correct out-come for the dyads in the test set. The test set isgauranteed to have only correct and incorrect out-comes. The training set contains dyads with outcomes,skipped and timed-out as illustrated in Figure 2. Wecreate two training sets from the orignal set. In thefirst training set, we exclude all dyads with skippedand timed-out outcome. In the second training set, weinclude all dyads but we treat the skipped and timed-out outcomes as incorrect outcomes.
2.4. Binomial Capped Deviance
The competition uses binomial capped deviance(BCD) as the metric to rank the teams. The met-ric is similar to the log-likelihood. For binary outcomethe BCD is calculated as follows
Let Pr(y = 1|(s, q) ∈ T ) be the predicted probabilityof correct (1) outcome for the dyad (s, q) observed inthe training set T . The BCD of the training set thenis
![Page 3: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/3.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Correct Incorrect Skipped Timed−Out0
0.5
1
1.5
2
2.5
3x 10
6
Figure 2. Histogram of outcomes in the training set
Multiple Choice Free response0
1000
2000
3000
4000
5000
6000
Figure 3. Division of questions by types
ACT GMAT SAT0
0.5
1
1.5
2
2.5x 10
6
# of
obs
erva
tions
of d
yads
Figure 4. Number of dyads in the training set for a partic-ular group
BCD(T ) =1
NT
∑((s,q)→y∈T )
(ylog(p)+ (1− y)log(1− p))
where NT = Number of dyads ∈ T
p = max(0.01,min(0.99,Pr(y = 1|(s, q) ∈ T ))
3. Dyadic Prediction
A dyadic prediction task is a learning task whichinvolves predicting a class label for a pair of items(Hofmann et al., 1999).
(u, i) → y(u,i) where u ∈ U, i ∈ I and y ∈ Y
The task involves the training set of pairs (u, i) eachwith its corresponding label y i.e. T = {((u, i) → y)}.There is sometimes more information available in thedataset i.e information associated with each item of thedyad and information associated with the dyadic pair.This information can be processed into an explicit fea-ture vector, which is termed as the side-information.
3.1. Latent feature log-linear model
For this competition, we need to predict the prob-ability of correct outcome for the dyadic pairs in thetest set. We also need to leverage the side-informationavailable in the dataset. This motivates the use ofthe latent feature log-linear(LFL) model for dyadicprediction. (Menon & Elkan, 2010). LFL can pre-dict well calibrated probabilities and incorporate side-information in the training process.
Let(u, i) → y ∈ Y , where |Y | > 2 i.e a multi-class
ProbabiltyPr(y|(u, i)) ∝ exp(Uyu · Iyi )
Uy → No of item of type u× k
Iy → No of item of type i× k
k = Number of latent features
Uyu = uthrow vector in matrix Uy
Iyi = ith row vector in matrix Iy
We predict the label which has the highest probabilityaccording to the model,
y = argmaxy
Pr(y|(s, q))∑y Pr(y|(s, q))
We train the LFL model using the negative log-likelihood as the objective function. We use stochasticgradient descent (SGD) to learn the model parame-ters. We optimize for the negative log-likelihood (LL),
LL =∑
(u,i)→y∈T − log Pr(y|(u, i))=
∑(u,i)→y∈T ll(u,i)
![Page 4: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/4.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Stochastic gradient descent (SGD), contribution ofeach example (u, i) → y ∈ T to the negative log-likelihood is where p = Pr(y|(u, i))
ll (u,i)→y∈T = −log(p)
∂∂Uy
ull = − 1
p∂
∂Uyup
∂∂Uy
ull = − 1
p∂
∂Uyup
∂∂Uy
up = −(1− p)× p× Iyi
∂∂Uy
ull = −(1− p)× Iyi
similarly ∂∂Iy
ill = −(1− p)× Uy
u
Derivate with respect to Uy′
u and Iy′
i
where y′ ∈ Y and y′ 6= y
∂
∂Uy′u
ll = −p(u,i)→y′ × Iy′
i
similarly ∂
∂Iy′i
ll = −p(u,i)→y′ × Uy′
u
After adding regularization terms to the objectivefunction we get,
Objective =∑
(u,i)∈T
(− log(p(u,i)→y)+µ
2|Uu|2+
µ
2|Ii|2)
Update rules in SGD algorithm used to minimize theobjective function are
Uyu = Uy
u − λ× (∂
∂Uyull + µ× Uy
u)
Iyi = Iyi − λ× (∂
∂Iyill + µ× Iyi )
3.2. LFL on Kaggle-GrockIt dataset
The test set only contains two outcomes i) Correctand ii) Incorrect for which we will use the binary LFLmodel. The binary case can be written as follows:-
Pr(y|(s, q)) =exp(Sy
s ·Qyq)∑
y exp(Sys ·Qy
q)
where y = 1 for correct response and 0 for incorrectresponse. We can fix S0 and Q0 to be zero i.e. keepingclass “0” as the base class.
Pr(y = 1|(s, q)) = 1
1 + exp(−S1s ·Q1
q)
.
The binary-LFL model has appeared in the literaturebefore (Schein et al., 2003; Agarwal & Chen, 2009).We use stochastic gradient descent to train this model.
# = 1 # >=20
0.5
1
1.5
2
2.5
3
3.5x 10
6
Number of Players
# of
dya
ds
Figure 5. Number of dyads with number of players
3.3. SGD Training
In the SGD algorithm, we do not randomize thedataset. We order the questions based on time andrun stochastic gradient descent - so that model adaptsto recently answered questions.
3.4. Parallelism
Updates of the stochastic gradient descent algorithmare independent for dyads (u, i) and (u′, i′) whereu 6= u′,i 6= i′. Hence we exploit this parallelism, bysplitting the input training set into non-overlappingblocks. We oberved this parallelism while competingin the KDD Cup 2011 and later on found the sameobservation was made independently by another group(Gemulla et al., 2011) The threads processes a set ofblocks such that no two blocks have the same columnor row index as shown in Figure 6. Figure 7. showsthe time taken for a epoch of LFL vs the number ofcores on the dataset using latent feature size of 5. Theexperiments were run on a Intel Core-i5 450M proces-sor.
3.5. LFL Models
In the Grockit-Kaggle dataset, side-information isavailable for questions as group, track, sub-track,
![Page 5: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/5.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Figure 6. Parallelism
Algorithm 1 Stochastic Gradient Descent
Input:Dyads (s, q) → y ∈ {TrainingSetEL: Epoch Limitµ: Regularizationλ: Learning rateprevious-bcd: DOUBLE MAXk: latent feature sizefor epoch = 1 to EL do
for each (s, q) → y do// Update latent vectorsS1s = S1
u − λ× ((py − y) ∗Q1i + µ× S1
s )Q1
q = Q1i − λ× ((py − y) ∗ S1
u + µ×Q1q)
current-bcd = calculate validation set bcdif ( current-bcd>previous-bcd ) then
breakelseprevious-bcd = current-bcd
end ifend foreachλ = λ× 0.99
end for
Figure 7. Parallelism
question-type and tags. All of the side-informationavailable are categorical in nature. Although, LFL isa powerful model that can leverage any type of side in-formation, we only experimented with models in Table3 during the competition.
The following section describes how to add side-information for a categorical variable “group” to theLFL model.
3.6. Adding Side-Information to the LFLmodel
For a question q, let g = group(q). We add a latentvector for each group. Let the prediction equation be,
Pr(y = 1|(s, q)) = 1
1 + exp(−S1s · (Q1
q +G1g))
The update rules are,
S1s = S1
s − λ× ((py − y) ∗ (Q1q +G1
g) + µ× S1s )
Q1q = Q1
q − λ× ((py − y) ∗ S1s + µ×Q1
q)
G1g = G1
g − λ× ((py − y) ∗ S1s + µ×G1
g)
S1 → Number of students × k
Q1 → Number of questions× k
G1 → Number of groups× k
3.7. Results
As discussed in the previous section, the training setcontains four types of outcomes, i) correct, ii) incor-rect, iii) skipped and iv) timed-out. We train the LFLmodels on training set after excluding skipped andtimed-out outcomes, results are presented in Table 4.and on the entire training set by treating skipped and
![Page 6: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/6.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Table 3. LFL models.
Prediction p((s, q) → 1) Description
1. 11+exp(−S1
s ·Q1q)
Basic LFL Model
2. 11+exp(−S1
s ·(Q1q+G1
g))G - Group
3. 11+exp(−S1
s ·(Q1q+T 1
t ))T - Track
4. 11+exp(−S1
s ·(Q1q+ST 1
st))ST - Subtrack
5. 11+exp(−S1
s ·(Q1q+QT 1
qt))QT - Question Type
6. 11+exp(−S1
s ·(Q1q+GT 1
gt))GT - Game Type
7. 11+exp(−S1
s ·(Q1q+G1
g+ST 1st))
G - Group &
ST - Subtrack
8. 11+exp(−S1
s ·(Q1q+G1
g+GT 1gt))
G - Group &
GT - Game Type
9. 11+exp(−S1
s ·(Q1q+ST 1
st+GT 1gt))
ST - Subtrack &
GT - GameType
10. 11+exp(−S1
s ·(Q1q+ST 1
st+QT 1qt))
ST - Subtrack &
QT Question Type
timed-out responses as incorrect response, results arepresented in Table 5. All the model parameters weretuned by grid search. The test predictions were ob-tained after re-training the model on both training andvalidation set on the tuned parameters. The results on60% of the test set was only available after the end ofthe competition.
Table 4. Results (BCD) on training set after excludingskipped and timed-out responses.
Model Validation Test(40%) Test(60%)
Rasch - 0.25663 0.25766LFL-1 0.252465 0.25398 0.25483LFL-2 0.252250 0.25343 0.25451LFL-3 0.251941 0.25340 0.25450LFL-4 0.251842 0.25288 0.25389LFL-5 0.252021 0.25300 0.25426LFL-6 0.251819 0.25296 0.25446LFL-7 0.252014 0.25328 0.25433LFL-8 0.251916 0.25310 0.25458LFL-9 0.251561 0.25266 0.25423LFL-10 0.251802 0.25291 0.25389Average 0.251320 0.25250 0.25362
e
Table 5. Results (BCD) on training set after excludingskipped and timed-out responses.
Model Validation Test(40%) Test(60%)
LFL-1 0.259718 0.25784 0.25921LFL-2 0.259337 0.25719 0.25869LFL-3 0.258626 0.25668 0.25799LFL-4 0.258597 0.25654 0.25795LFL-5 0.259242 0.25692 0.25847LFL-6 0.258606 0.25641 0.25810LFL-7 0.258980 0.25683 0.25832LFL-8 0.259162 0.25700 0.25874LFL-9 0.258639 0.25646 0.25822LFL-10 0.258897 0.25665 0.25821Average 0.258363 0.25624 0.25778
4. Ensemble Learning
The main motivation for ensemble learning for thiscompetition was because no single model performswell on every dyad (Takacs et al., 2009). Combiningpredictions from multiple models can outperform eachof the individual models. In the following section wedescribe different techniques to combine predictions.
The simplest technique to combine predictions frommultiple models is linear regression. Define matrix Pwhere pi,j is the prediction for dyad i using model j.Then linear regression learns a weight vector w suchthat,
Pw = Y
where Y is a column vector with Yi is the true label ofdyad i. Then prediction for dyad i is
∑j pi,jwj
To avoid overfitting, we tune parameters for each ofthe models using a held-out set. To achieve the bestpeformance on the test set, it is advisable to createa held-out set which is similar to the test set. Forthe competition, we trained the LFL-models on thetraining set and predict on the validation set. Wecross-validate i.e tune the learning parameter λ, regu-larization µ and epoch by treating the validation setas the held-out set. Finally, the validation set predic-tions are used for linear regression(ensemble learning).
We re-train each of the LFL-model on both thetraining set and validation set using the tuned pa-rameters to generate the predictions of the test set.The predictions from all the models and the weightslearned from linear regression are used for the final
![Page 7: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/7.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Figure 8. Decision tree. X1 to X10 are training examples
test set predictions. The results from linear regressionare listed in Table 6.
Table 6. Results from linear regression. i) WITH-OUT- training without skipped and timed-out responses. ii)WITH - training with skipped and timed-out responsestreated as incorrect responses
Description Validation Test(40%) Test(60%)
With-out 0.25114 0.25195 0.25311With 0.256500 0.25599 0.25770Combined 0.251025 0.25179 0.25300
4.1. Gradient Boosted Decision Trees
The main motivation for the use of Gradient BoostedDecision Trees (GBDT) (Friedman, 1999) is to includeside-information in the ensemble learning. GBDT is apowerful learning algorithm that is widely used (seeLi & Xu, 2009, chap. 6). The core of the algorithm isa decision tree learner.
A decision tree is illustrated in the Figure 7. Internalnodes are decision functions that select the subtree tosend the example through. The leaves are nodes witha small set of training examples which is used for thefinal prediction.
As illustrated decision trees can capture interactionsbetween features and can handle both numeric andcategorical features.
We can find the split value in case of numeric feature orby split based on subset of categories which minimizesa particular criteria. For the regression task, generallysum squared error is used as the criteria.
Decision functions are learned at every node so thatit partitions the dataset into two and this process is
recursively applied till the leaf node has only a fewexamples. The decision function at the root node islearned using the entire dataset. We can restrict thedepth of the decision tree using the paramter d whichhalts the recursive tree building at that depth.
The example to be predicted is used input to the deci-sion tree and the decision function at every node sendsthe example towards a particular sub-tree. Finally weend up at a leaf node which contains a set of trainingexamples. For the regression task the prediction is theaverage of the target regression value of the set.
In gradient boosting, we select the base learner and theloss function. For this competition we used decisiontree as the base learner and squared error as the lossfunction. Gradient boosting is an iterative-procedurewhere we fit a base learner on the gradients from theprevious iteration.
f(x) =
J∑j=0
ρj · Tj(x) (1)
where Tj(..) is a decision tree.
Algorithm 2 Gradient Boosting
Initialize:f0 = T0 = Count(Outcome==′Correct′)
Total and ρ0 = 1Input:L(yi, fj−1(xi)) = (yi − fj−1(xi))
2
Training data (xi, yi)for iteration = 1 to J do
(a) yi = −∂L(yi,fj−1(xi))∂fj−1
for i= 1 .. n
(b) Fit tree Tj(..) to (xi, yi) for i= 1 .. n.
(c) ρj = argminρ∑N
i [L(yi, fj−1(xi)) + ρT (xi) ](d) fj = fj−1 + ρj × Tj(x)
end for
4.2. Application of GBDTs
The input feature vector of a training examplecontains the predictions of the LFL models, theside-information associated with question, interactionfeatures between users and questions and meta-features that we compute from the statistics of thedataset. We can add a regularization parameterλ, 0 < λ = 1 to the equation (d) which results in thefollowing update
fj = fj−1 + λ× ρj × Tj(x)
The side-information and meta information we add asfeatures in ensemble learning are listed in Table 12.and Table 13 in the apppendix. For each question, the
![Page 8: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/8.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
dataset contains a set of tags. Each tag corresponds toa subject that the question is from. We first manuallymerge the duplicated tags. Then we cluster the tagsusing spectral clustering (Ng et al., 2001) with nor-malized co-ocurrence of tags as the similarity measureto generate the affinity matrix A.
Algorithm 3 Spectral Clustering
(a) A(i, j) =∑k=|Q|
k=1 I(tagi,tagj⊆Tags(qk))
|Q|(b) D(i, i) =
∑j Ai,k
(c) NL = D−1/2LD1/2, NL ∈ Rn×n
(d) V ∈ Rn×k contains first k eigenvectors of NL(e) U(i, j) = V(i,j)/(
∑k V
2(i,j))
1/2
Cluster rows of U using k-means. Cluster ci for row iis the cluster for tagi.
There are a total of 281 tags in the dataset. Aftermerging duplicated tags, we cluster tags into 40 clus-ters. The binary feature vector of length 40 was usedas a meta-feature.
We use the validation set predictions from the LFLmodels, combined with side-information and meta-features as the training set for GBDT. We observea marginal improvement over linear regression withGBDT in Table 7. After we add temporal meta-features listend in Table 14 into the training set furtherimproves the BCD in Table 8.
For the final submission, we randomly split thetraining set (GBDT) into two folds. We train GBDTwith varying µ and depth parameters on fold one andpredict on the the second one, and vice versa and savethe predictions for performing linear regression. Thetest set predictions are calculated using the weightsfrom linear regression and predictions on the the en-tire training set. The training set here refers to thevalidation set predictions from the LFL models, com-bined with side-information and meta-features.
Table 7. Results from GBDT with side-information listedin table 11. and meta-features listed in Table 12.
Param(N,µ,d) Validation Test(40%) Test(60%)
100,0.1,3 0.24782 0.25138 0.25269
Table 8. Results from GBDT with side-infomration, i)meta-features in Table 13. and ii) temporal meta-featuresin Table 14.
Param(N,µ,d) Validation Test(40%) Test(60%)
200,0.1,3 0.24148 0.24787 0.24906
200,0.05,3 0.24404 0.24793 0.24913
100,0.1,5 0.23563 0.24812 0.24894
100,0.1,3 0.24401 0.24792 0.24922
Table 9. Results from linear-regression on GBDT models15 models, using the following parameter combinations(µ = .1, .2, .3, .4, .5) and (depth=3,5,7)
Test(40%) Test(60%)
0.24665 0.24792
5. Conclusions
We interpreted the task for the competition as a dyadicprediction task and experimented with different vari-ations of LFL. The basic LFL model performs betterthan the baseline and would have placed us at 26th inthe competition. We explored different ways to encodecategorical side-information feature in the LFL model.The average prediction from all the LFL models wouldplace us at rank 14th. We further explored differ-ent ways techniques to combine predictions i.e linear
Table 10. Leaderboard ranks of different methods
Method Public Private
Rasch Model 78 76
Basic LFL 27 26Average of LFL models 17 14Ensemble MethodsLinear regression 11 11GBDT 11 11GBDT + LR with temporal features 4 5
![Page 9: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/9.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Table 11. Private Leaderboard ranks
Rank Team Location
1 Steffen University of Konstanz
2 D’yakonov Alexander Moscow State University3 Ekla4 Planet Thanet & Birutas UK & Brazil5 UCSD-Triton UC San Diego6 James Petterson Australian National University7 Indy Actuaries Indianapolis8 Yetiman Northpole9 Gxav Singapore10 Two Tacos UC Irvine
regression and gradient boosted decision trees. Us-ing linear-regression or GBDT to combine predictionswould place us at rank 11th in the competition. Thefinal improvement in BCD was achieved after addingtemporal features and using linear regression on dif-ferent parameter combination of GBDT placing us atrank 5th on the private leadboard and 4th on the pub-lic leaderboard. Table 10, contains both public andprivate leaderboard performances of various methodsand Table 11 lists the top 10 finishers in the competi-tion.
6. Acknowledgments
The author wishes to acknowledge the valuable in-puts, insights and advise from Aditya Menon, PhDstudent, CSE, UCSD and Charles Elkan, Professor,CSE, UCSD during and after the competition withoutwhich the project would not have been possible.
References
Agarwal, Deepak and Chen, Bee-Chung. Regression-based latent factor models. In Proceedings of the15th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’09, pp.19–28, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.
Friedman, Jerome H. Stochastic gradient boosting.Computational Statistics and Data Analysis, 38:367–378, 1999.
Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., andSismanis, Yannis. Large-scale matrix factorizationwith distributed stochastic gradient descent. In Pro-ceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data min-
ing, KDD ’11, New York, NY, USA, 2011. ACM.ISBN 978-1-4503-0813-7.
Hofmann, Thomas, Puzicha, Jan, and Jordan,Michael I. Learning from dyadic data. In Pro-ceedings of the 1998 conference on Advances in neu-ral information processing systems II, pp. 466–472,Cambridge, MA, USA, 1999. MIT Press. ISBN 0-262-11245-0.
Li, Xiaochun and Xu, Ronghui (eds.). High-dimensional data analysis in cancer research.Springer, CA, U.S.A, 2009.
Menon, Aditya Krishna and Elkan, Charles. A log-linear model with latent features for dyadic predic-tion. In ICDM’10, pp. 364–373, 2010.
Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair.On spectral clustering: Analysis and an algorithm.In Advances in Nueral Information Processing Sys-tems, pp. 849–856. MIT Press, 2001.
Rasch, Georg. Estimation of parameters and controlof the model for two response categories, 1960.
Schein, Andrew I., Lawrence, Andrew I., Saul,Lawrence K., and Ungar, Lyle H. A generalizedlinear model for principal component analysis of bi-nary data, 2003.
Takacs, Gabor, Pilaszy, Istvan, Nemeth, Bottyan, andTikk, Domonkos. Scalable collaborative filtering ap-proaches for large recommender systems. J. Mach.Learn. Res., 10:623–656, June 2009. ISSN 1532-4435.
Tscher, Andreas, Jahrer, Michael, and Bell, Robert M.The bigchaos solution to the netflix grand prize,2009.
![Page 10: MS Project Report - Final - GrockIt on Kaggle.com](https://reader036.fdocuments.in/reader036/viewer/2022081807/5450c4efaf7959f7088b4d1f/html5/thumbnails/10.jpg)
MS Project:Competing on Kaggle in GrockIts What do you know ?
Table 12. Side-information
Side-information for a question(All are categorical)
Question-TypeGroupTrackSubtrack lTagsInteraction Side-Information (All are categorical)
Game
Number of players (no=1,no>1)
7. Appendix
7.1. Binomial Capped Deviance
Algorithm 4 Binomial Capped Deviance
Input:Dyads (s, q) → y ∈ Validation set (V)S: Latent matrix for studentsQ: Latent matrix for questionsBCD = 0for each (s, q) → y do
p(y = 1|(s, q)) = 11+exp(−S1
s ·Q1q)
p = p(y = 1|(s, q))if p > .99 thenp = .99
end ifif prediction < .01 thenp = .01
end ifBCD = BCD +y × log(p) + (1− y)× log(1− p)
end foreachBCD = BCD
|V |return BCD
Table 13. Meta-features
Meta-features (numeric)
variance in outcomes for the uservariance in outcomes for the question
ratio of correct response for the following:-
-1. user-2. question-3. group-4. track-5. sub-track-6. game-type
log(number response of the type) for the following:-
-1. user-2. question-3. group-4. track-5. sub-track-6. game-type
Preprocessed tags (feature vector of length 40)
Table 14. Meta-features
Temporal Meta-features (numeric)
1. Time to answer the question.2. Average time (correct outcome) for current:-UserQuestionGroupTrackSubtrackGame Type
3. Average time (incorrect outcome)for current:-
UserQuestionGroupTrackSubtrackGame Type
4. Difference between the time to answer the question and
average time (correct outcome) for current:-UserQuestionGroupTrackSubtrackGame Type
5. Difference between the time to answer the question and
average time (incorrect outcome) for current:-UserQuestionGroupTrackSubtrackGame Type