MS Project Report - Final - GrockIt on Kaggle.com

10
“What do you know?” - Latent feature approach for the Kaggle’s GrockIt challenge Rohan Anil [email protected] University of California, San Diego, Department of Computer Science & Engineering, La Jolla, CA 92092 USA March 19, 2012 Abstract We describe our efforts in solving the GrockIt competition in this report. The goal of the competition was to improve the state of the art in student evaluation. The task was to predict whether a student will answer the next test question correctly. The data-set was provided by GrockIt, an on-line platform for students to practice questions for compet- itive exams. At the end of the competition there were a total of 252 teams, 581 individu- als and a total of 1803 submissions. We train latent feature log-linear models (LFL) for the prediction task. We exploit the rich meta in- formation associated with questions and the dyad to improve the performance of the mod- els. Finally, we explore different techniques to blend the predictions from multiple mod- els. The competition used binomial capped deviance as the metric to rank teams. Our team of the ‘UCSD Triton’ was placed at rank 4th in the public leaderboard with a BCD of 0.24665 and rank 5th in the final private leaderboard with a BCD of 0.24792. 1. Introduction Kaggle.com is a data-mining competition platform. We participated in one of the competitions titled “What do you know?” for improving the state-of- the-art in student evaluation and finished at rank 5th. The competition dataset was provided by GrockIt, an online learning platform. GrockIt provides tools to help prepare students for competitive exams like the GMAT, ACT, SAT etc. The dataset mainly contains performance information of students on various ques- tions. Inspired by the success of latent-feature methods on the Netflix prize challenge(Tscher et al., 2009), we see if latent-feature methods are competetive in solving this task. To improve our leaderboard rank, we ex- plore two different ensemble learning technique i.e lin- ear regression and gradient boosted decision trees. The report is organized as follows, Section 2 introduces the task of dyadic prediction, the dataset that was avail- able for the competition, the metric used to rank the teams, and the generic latent feature log-linear model (Menon & Elkan, 2010) (LFL). Section 3 formulates the student-evaluation task as a dyadic-prediction task and derives the stochastic gradient update rules for the LFL model. Section 4, describes two techniques we used for ensembling and their results on the dataset and finally we conclude in Section 5. 2. GrockIt-Kaggle dataset The dataset contains student responses for various questions. There are a total of 179,107 users and 6,046 questions in the training set. The dataset is similar to the typical dyadic dataset with a couple of key differ- ences: i) there can exist duplicate dyad pairs in the training set with different outcomes, since a student can answer a question many times, ii) In some games types, students can collaboratively answer questions. The value of the feature “number of players” is the number of students answering the question. Grockit provides a chat-box for the students to discuss the an- swer. If the question is a multiple-choice question, students can leave comments on the choices. The next question is not displayed until everyone answers the question. We can interpret each pair of user and question as a dyad. (s, q) y (s,q) s S: Students q Q: Questions y O: Outcome

Transcript of MS Project Report - Final - GrockIt on Kaggle.com

Page 1: MS Project Report - Final - GrockIt on Kaggle.com

“What do you know?” - Latent feature approach for the Kaggle’sGrockIt challenge

Rohan Anil [email protected]

University of California, San Diego, Department of Computer Science & Engineering, La Jolla, CA 92092 USA

March 19, 2012

Abstract

We describe our efforts in solving the GrockItcompetition in this report. The goal of thecompetition was to improve the state of theart in student evaluation. The task was topredict whether a student will answer thenext test question correctly. The data-setwas provided by GrockIt, an on-line platformfor students to practice questions for compet-itive exams. At the end of the competitionthere were a total of 252 teams, 581 individu-als and a total of 1803 submissions. We trainlatent feature log-linear models (LFL) for theprediction task. We exploit the rich meta in-formation associated with questions and thedyad to improve the performance of the mod-els. Finally, we explore different techniquesto blend the predictions from multiple mod-els. The competition used binomial cappeddeviance as the metric to rank teams. Ourteam of the ‘UCSD Triton’ was placed at rank4th in the public leaderboard with a BCDof 0.24665 and rank 5th in the final privateleaderboard with a BCD of 0.24792.

1. Introduction

Kaggle.com is a data-mining competition platform.We participated in one of the competitions titled“What do you know?” for improving the state-of-the-art in student evaluation and finished at rank 5th.The competition dataset was provided by GrockIt, anonline learning platform. GrockIt provides tools tohelp prepare students for competitive exams like theGMAT, ACT, SAT etc. The dataset mainly containsperformance information of students on various ques-tions.

Inspired by the success of latent-feature methods onthe Netflix prize challenge(Tscher et al., 2009), we seeif latent-feature methods are competetive in solvingthis task. To improve our leaderboard rank, we ex-plore two different ensemble learning technique i.e lin-ear regression and gradient boosted decision trees. Thereport is organized as follows, Section 2 introduces thetask of dyadic prediction, the dataset that was avail-able for the competition, the metric used to rank theteams, and the generic latent feature log-linear model(Menon & Elkan, 2010) (LFL). Section 3 formulatesthe student-evaluation task as a dyadic-prediction taskand derives the stochastic gradient update rules for theLFL model. Section 4, describes two techniques weused for ensembling and their results on the datasetand finally we conclude in Section 5.

2. GrockIt-Kaggle dataset

The dataset contains student responses for variousquestions. There are a total of 179,107 users and 6,046questions in the training set. The dataset is similar tothe typical dyadic dataset with a couple of key differ-ences: i) there can exist duplicate dyad pairs in thetraining set with different outcomes, since a studentcan answer a question many times, ii) In some gamestypes, students can collaboratively answer questions.The value of the feature “number of players” is thenumber of students answering the question. Grockitprovides a chat-box for the students to discuss the an-swer. If the question is a multiple-choice question,students can leave comments on the choices. The nextquestion is not displayed until everyone answers thequestion.

We can interpret each pair of user and question as adyad.

(s, q) → y(s,q)

s ∈ S: Studentsq ∈ Q: Questionsy ∈ O: Outcome

Page 2: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Training set T = ((s, q) → y)

Figure 1. GrockIt-Kaggle dataset

The training set contains a total of 4,851,476 dyadpairs each with its corresponding label which is theoutcome. Outcome can be of four types, i) correct,ii) incorrect, iii) skipped and iv) timed-out. Thevalidation set contains a total of 80,075 dyads of80,075 unique students and test set contains 93,100dyads of 93,100 unique students. The validation setstudents is a subset of test set students. The studentswhich appear in the test set but do not appear inthe validation set are the students who have only fewdyads in the training set. The validation set has lateroutcomes relative to the training set and test set haslater outcomes relative to the validation set.

Our task is predict the probability of correct outcomefor every dyad in the test set.

Pr(y = ‘Correct′|(s, q) ∈ Test Set)

The validation and test set do not contain any skippedor timed-out outcomes. The competition uses 40% ofthe test data to rank teams on the public leaderboardand 60% of the test data for final ranking on theprivate leaderboard, which was only revealed at theend of the competition. This measure was used bythe organizers to prevent overfitting.

2.1. Baseline - Rasch Model

A baseline was provided by Kaggle for the dataset.The baseline uses the Rasch model (Rasch, 1960).Rasch models are widely used in education psychol-ogy research. The prediction from the model is,

Pr((s, q) → 1) =exp(Bs − δq)

1 + exp(Bs − δq)

where Bs is interpreted as the ability of the studentand δq is interepreted as the difficulty of question q.

2.2. Side Information

The dataset contains two types of side-information,question side-information listed in Table 1, and

student-question interaction side-information in Table2.

Table 1. Side-information associated with a question.

Type Description

Question-Type

i) Multiple Choice ii) Free Response

Group i)ACT, ii) GMAT, iii) SATTrack 9 types, listed in appendixSubtrack 15 typesTags each question is tagged with subjects it is

from, finer granularity than subtrackQuestionset

questions which share a question set idand share similarity in presentation on thescreen

Table 2. Side-information for a dyad.

Type Description

Game 12 types of gamesNumber of players players in the gameStarted Date and Time the question

was seen by the studentAnswered-at Date and Time the question

was answered by the studentDeactivated Date and Time the question

cleared from the screen.

2.3. Preprocessing the dataset

Our task is to predict the probability of correct out-come for the dyads in the test set. The test set isgauranteed to have only correct and incorrect out-comes. The training set contains dyads with outcomes,skipped and timed-out as illustrated in Figure 2. Wecreate two training sets from the orignal set. In thefirst training set, we exclude all dyads with skippedand timed-out outcome. In the second training set, weinclude all dyads but we treat the skipped and timed-out outcomes as incorrect outcomes.

2.4. Binomial Capped Deviance

The competition uses binomial capped deviance(BCD) as the metric to rank the teams. The met-ric is similar to the log-likelihood. For binary outcomethe BCD is calculated as follows

Let Pr(y = 1|(s, q) ∈ T ) be the predicted probabilityof correct (1) outcome for the dyad (s, q) observed inthe training set T . The BCD of the training set thenis

Page 3: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Correct Incorrect Skipped Timed−Out0

0.5

1

1.5

2

2.5

3x 10

6

Figure 2. Histogram of outcomes in the training set

Multiple Choice Free response0

1000

2000

3000

4000

5000

6000

Figure 3. Division of questions by types

ACT GMAT SAT0

0.5

1

1.5

2

2.5x 10

6

# of

obs

erva

tions

of d

yads

Figure 4. Number of dyads in the training set for a partic-ular group

BCD(T ) =1

NT

∑((s,q)→y∈T )

(ylog(p)+ (1− y)log(1− p))

where NT = Number of dyads ∈ T

p = max(0.01,min(0.99,Pr(y = 1|(s, q) ∈ T ))

3. Dyadic Prediction

A dyadic prediction task is a learning task whichinvolves predicting a class label for a pair of items(Hofmann et al., 1999).

(u, i) → y(u,i) where u ∈ U, i ∈ I and y ∈ Y

The task involves the training set of pairs (u, i) eachwith its corresponding label y i.e. T = {((u, i) → y)}.There is sometimes more information available in thedataset i.e information associated with each item of thedyad and information associated with the dyadic pair.This information can be processed into an explicit fea-ture vector, which is termed as the side-information.

3.1. Latent feature log-linear model

For this competition, we need to predict the prob-ability of correct outcome for the dyadic pairs in thetest set. We also need to leverage the side-informationavailable in the dataset. This motivates the use ofthe latent feature log-linear(LFL) model for dyadicprediction. (Menon & Elkan, 2010). LFL can pre-dict well calibrated probabilities and incorporate side-information in the training process.

Let(u, i) → y ∈ Y , where |Y | > 2 i.e a multi-class

ProbabiltyPr(y|(u, i)) ∝ exp(Uyu · Iyi )

Uy → No of item of type u× k

Iy → No of item of type i× k

k = Number of latent features

Uyu = uthrow vector in matrix Uy

Iyi = ith row vector in matrix Iy

We predict the label which has the highest probabilityaccording to the model,

y = argmaxy

Pr(y|(s, q))∑y Pr(y|(s, q))

We train the LFL model using the negative log-likelihood as the objective function. We use stochasticgradient descent (SGD) to learn the model parame-ters. We optimize for the negative log-likelihood (LL),

LL =∑

(u,i)→y∈T − log Pr(y|(u, i))=

∑(u,i)→y∈T ll(u,i)

Page 4: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Stochastic gradient descent (SGD), contribution ofeach example (u, i) → y ∈ T to the negative log-likelihood is where p = Pr(y|(u, i))

ll (u,i)→y∈T = −log(p)

∂∂Uy

ull = − 1

p∂

∂Uyup

∂∂Uy

ull = − 1

p∂

∂Uyup

∂∂Uy

up = −(1− p)× p× Iyi

∂∂Uy

ull = −(1− p)× Iyi

similarly ∂∂Iy

ill = −(1− p)× Uy

u

Derivate with respect to Uy′

u and Iy′

i

where y′ ∈ Y and y′ 6= y

∂Uy′u

ll = −p(u,i)→y′ × Iy′

i

similarly ∂

∂Iy′i

ll = −p(u,i)→y′ × Uy′

u

After adding regularization terms to the objectivefunction we get,

Objective =∑

(u,i)∈T

(− log(p(u,i)→y)+µ

2|Uu|2+

µ

2|Ii|2)

Update rules in SGD algorithm used to minimize theobjective function are

Uyu = Uy

u − λ× (∂

∂Uyull + µ× Uy

u)

Iyi = Iyi − λ× (∂

∂Iyill + µ× Iyi )

3.2. LFL on Kaggle-GrockIt dataset

The test set only contains two outcomes i) Correctand ii) Incorrect for which we will use the binary LFLmodel. The binary case can be written as follows:-

Pr(y|(s, q)) =exp(Sy

s ·Qyq)∑

y exp(Sys ·Qy

q)

where y = 1 for correct response and 0 for incorrectresponse. We can fix S0 and Q0 to be zero i.e. keepingclass “0” as the base class.

Pr(y = 1|(s, q)) = 1

1 + exp(−S1s ·Q1

q)

.

The binary-LFL model has appeared in the literaturebefore (Schein et al., 2003; Agarwal & Chen, 2009).We use stochastic gradient descent to train this model.

# = 1 # >=20

0.5

1

1.5

2

2.5

3

3.5x 10

6

Number of Players

# of

dya

ds

Figure 5. Number of dyads with number of players

3.3. SGD Training

In the SGD algorithm, we do not randomize thedataset. We order the questions based on time andrun stochastic gradient descent - so that model adaptsto recently answered questions.

3.4. Parallelism

Updates of the stochastic gradient descent algorithmare independent for dyads (u, i) and (u′, i′) whereu 6= u′,i 6= i′. Hence we exploit this parallelism, bysplitting the input training set into non-overlappingblocks. We oberved this parallelism while competingin the KDD Cup 2011 and later on found the sameobservation was made independently by another group(Gemulla et al., 2011) The threads processes a set ofblocks such that no two blocks have the same columnor row index as shown in Figure 6. Figure 7. showsthe time taken for a epoch of LFL vs the number ofcores on the dataset using latent feature size of 5. Theexperiments were run on a Intel Core-i5 450M proces-sor.

3.5. LFL Models

In the Grockit-Kaggle dataset, side-information isavailable for questions as group, track, sub-track,

Page 5: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Figure 6. Parallelism

Algorithm 1 Stochastic Gradient Descent

Input:Dyads (s, q) → y ∈ {TrainingSetEL: Epoch Limitµ: Regularizationλ: Learning rateprevious-bcd: DOUBLE MAXk: latent feature sizefor epoch = 1 to EL do

for each (s, q) → y do// Update latent vectorsS1s = S1

u − λ× ((py − y) ∗Q1i + µ× S1

s )Q1

q = Q1i − λ× ((py − y) ∗ S1

u + µ×Q1q)

current-bcd = calculate validation set bcdif ( current-bcd>previous-bcd ) then

breakelseprevious-bcd = current-bcd

end ifend foreachλ = λ× 0.99

end for

Figure 7. Parallelism

question-type and tags. All of the side-informationavailable are categorical in nature. Although, LFL isa powerful model that can leverage any type of side in-formation, we only experimented with models in Table3 during the competition.

The following section describes how to add side-information for a categorical variable “group” to theLFL model.

3.6. Adding Side-Information to the LFLmodel

For a question q, let g = group(q). We add a latentvector for each group. Let the prediction equation be,

Pr(y = 1|(s, q)) = 1

1 + exp(−S1s · (Q1

q +G1g))

The update rules are,

S1s = S1

s − λ× ((py − y) ∗ (Q1q +G1

g) + µ× S1s )

Q1q = Q1

q − λ× ((py − y) ∗ S1s + µ×Q1

q)

G1g = G1

g − λ× ((py − y) ∗ S1s + µ×G1

g)

S1 → Number of students × k

Q1 → Number of questions× k

G1 → Number of groups× k

3.7. Results

As discussed in the previous section, the training setcontains four types of outcomes, i) correct, ii) incor-rect, iii) skipped and iv) timed-out. We train the LFLmodels on training set after excluding skipped andtimed-out outcomes, results are presented in Table 4.and on the entire training set by treating skipped and

Page 6: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Table 3. LFL models.

Prediction p((s, q) → 1) Description

1. 11+exp(−S1

s ·Q1q)

Basic LFL Model

2. 11+exp(−S1

s ·(Q1q+G1

g))G - Group

3. 11+exp(−S1

s ·(Q1q+T 1

t ))T - Track

4. 11+exp(−S1

s ·(Q1q+ST 1

st))ST - Subtrack

5. 11+exp(−S1

s ·(Q1q+QT 1

qt))QT - Question Type

6. 11+exp(−S1

s ·(Q1q+GT 1

gt))GT - Game Type

7. 11+exp(−S1

s ·(Q1q+G1

g+ST 1st))

G - Group &

ST - Subtrack

8. 11+exp(−S1

s ·(Q1q+G1

g+GT 1gt))

G - Group &

GT - Game Type

9. 11+exp(−S1

s ·(Q1q+ST 1

st+GT 1gt))

ST - Subtrack &

GT - GameType

10. 11+exp(−S1

s ·(Q1q+ST 1

st+QT 1qt))

ST - Subtrack &

QT Question Type

timed-out responses as incorrect response, results arepresented in Table 5. All the model parameters weretuned by grid search. The test predictions were ob-tained after re-training the model on both training andvalidation set on the tuned parameters. The results on60% of the test set was only available after the end ofthe competition.

Table 4. Results (BCD) on training set after excludingskipped and timed-out responses.

Model Validation Test(40%) Test(60%)

Rasch - 0.25663 0.25766LFL-1 0.252465 0.25398 0.25483LFL-2 0.252250 0.25343 0.25451LFL-3 0.251941 0.25340 0.25450LFL-4 0.251842 0.25288 0.25389LFL-5 0.252021 0.25300 0.25426LFL-6 0.251819 0.25296 0.25446LFL-7 0.252014 0.25328 0.25433LFL-8 0.251916 0.25310 0.25458LFL-9 0.251561 0.25266 0.25423LFL-10 0.251802 0.25291 0.25389Average 0.251320 0.25250 0.25362

e

Table 5. Results (BCD) on training set after excludingskipped and timed-out responses.

Model Validation Test(40%) Test(60%)

LFL-1 0.259718 0.25784 0.25921LFL-2 0.259337 0.25719 0.25869LFL-3 0.258626 0.25668 0.25799LFL-4 0.258597 0.25654 0.25795LFL-5 0.259242 0.25692 0.25847LFL-6 0.258606 0.25641 0.25810LFL-7 0.258980 0.25683 0.25832LFL-8 0.259162 0.25700 0.25874LFL-9 0.258639 0.25646 0.25822LFL-10 0.258897 0.25665 0.25821Average 0.258363 0.25624 0.25778

4. Ensemble Learning

The main motivation for ensemble learning for thiscompetition was because no single model performswell on every dyad (Takacs et al., 2009). Combiningpredictions from multiple models can outperform eachof the individual models. In the following section wedescribe different techniques to combine predictions.

The simplest technique to combine predictions frommultiple models is linear regression. Define matrix Pwhere pi,j is the prediction for dyad i using model j.Then linear regression learns a weight vector w suchthat,

Pw = Y

where Y is a column vector with Yi is the true label ofdyad i. Then prediction for dyad i is

∑j pi,jwj

To avoid overfitting, we tune parameters for each ofthe models using a held-out set. To achieve the bestpeformance on the test set, it is advisable to createa held-out set which is similar to the test set. Forthe competition, we trained the LFL-models on thetraining set and predict on the validation set. Wecross-validate i.e tune the learning parameter λ, regu-larization µ and epoch by treating the validation setas the held-out set. Finally, the validation set predic-tions are used for linear regression(ensemble learning).

We re-train each of the LFL-model on both thetraining set and validation set using the tuned pa-rameters to generate the predictions of the test set.The predictions from all the models and the weightslearned from linear regression are used for the final

Page 7: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Figure 8. Decision tree. X1 to X10 are training examples

test set predictions. The results from linear regressionare listed in Table 6.

Table 6. Results from linear regression. i) WITH-OUT- training without skipped and timed-out responses. ii)WITH - training with skipped and timed-out responsestreated as incorrect responses

Description Validation Test(40%) Test(60%)

With-out 0.25114 0.25195 0.25311With 0.256500 0.25599 0.25770Combined 0.251025 0.25179 0.25300

4.1. Gradient Boosted Decision Trees

The main motivation for the use of Gradient BoostedDecision Trees (GBDT) (Friedman, 1999) is to includeside-information in the ensemble learning. GBDT is apowerful learning algorithm that is widely used (seeLi & Xu, 2009, chap. 6). The core of the algorithm isa decision tree learner.

A decision tree is illustrated in the Figure 7. Internalnodes are decision functions that select the subtree tosend the example through. The leaves are nodes witha small set of training examples which is used for thefinal prediction.

As illustrated decision trees can capture interactionsbetween features and can handle both numeric andcategorical features.

We can find the split value in case of numeric feature orby split based on subset of categories which minimizesa particular criteria. For the regression task, generallysum squared error is used as the criteria.

Decision functions are learned at every node so thatit partitions the dataset into two and this process is

recursively applied till the leaf node has only a fewexamples. The decision function at the root node islearned using the entire dataset. We can restrict thedepth of the decision tree using the paramter d whichhalts the recursive tree building at that depth.

The example to be predicted is used input to the deci-sion tree and the decision function at every node sendsthe example towards a particular sub-tree. Finally weend up at a leaf node which contains a set of trainingexamples. For the regression task the prediction is theaverage of the target regression value of the set.

In gradient boosting, we select the base learner and theloss function. For this competition we used decisiontree as the base learner and squared error as the lossfunction. Gradient boosting is an iterative-procedurewhere we fit a base learner on the gradients from theprevious iteration.

f(x) =

J∑j=0

ρj · Tj(x) (1)

where Tj(..) is a decision tree.

Algorithm 2 Gradient Boosting

Initialize:f0 = T0 = Count(Outcome==′Correct′)

Total and ρ0 = 1Input:L(yi, fj−1(xi)) = (yi − fj−1(xi))

2

Training data (xi, yi)for iteration = 1 to J do

(a) yi = −∂L(yi,fj−1(xi))∂fj−1

for i= 1 .. n

(b) Fit tree Tj(..) to (xi, yi) for i= 1 .. n.

(c) ρj = argminρ∑N

i [L(yi, fj−1(xi)) + ρT (xi) ](d) fj = fj−1 + ρj × Tj(x)

end for

4.2. Application of GBDTs

The input feature vector of a training examplecontains the predictions of the LFL models, theside-information associated with question, interactionfeatures between users and questions and meta-features that we compute from the statistics of thedataset. We can add a regularization parameterλ, 0 < λ = 1 to the equation (d) which results in thefollowing update

fj = fj−1 + λ× ρj × Tj(x)

The side-information and meta information we add asfeatures in ensemble learning are listed in Table 12.and Table 13 in the apppendix. For each question, the

Page 8: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

dataset contains a set of tags. Each tag corresponds toa subject that the question is from. We first manuallymerge the duplicated tags. Then we cluster the tagsusing spectral clustering (Ng et al., 2001) with nor-malized co-ocurrence of tags as the similarity measureto generate the affinity matrix A.

Algorithm 3 Spectral Clustering

(a) A(i, j) =∑k=|Q|

k=1 I(tagi,tagj⊆Tags(qk))

|Q|(b) D(i, i) =

∑j Ai,k

(c) NL = D−1/2LD1/2, NL ∈ Rn×n

(d) V ∈ Rn×k contains first k eigenvectors of NL(e) U(i, j) = V(i,j)/(

∑k V

2(i,j))

1/2

Cluster rows of U using k-means. Cluster ci for row iis the cluster for tagi.

There are a total of 281 tags in the dataset. Aftermerging duplicated tags, we cluster tags into 40 clus-ters. The binary feature vector of length 40 was usedas a meta-feature.

We use the validation set predictions from the LFLmodels, combined with side-information and meta-features as the training set for GBDT. We observea marginal improvement over linear regression withGBDT in Table 7. After we add temporal meta-features listend in Table 14 into the training set furtherimproves the BCD in Table 8.

For the final submission, we randomly split thetraining set (GBDT) into two folds. We train GBDTwith varying µ and depth parameters on fold one andpredict on the the second one, and vice versa and savethe predictions for performing linear regression. Thetest set predictions are calculated using the weightsfrom linear regression and predictions on the the en-tire training set. The training set here refers to thevalidation set predictions from the LFL models, com-bined with side-information and meta-features.

Table 7. Results from GBDT with side-information listedin table 11. and meta-features listed in Table 12.

Param(N,µ,d) Validation Test(40%) Test(60%)

100,0.1,3 0.24782 0.25138 0.25269

Table 8. Results from GBDT with side-infomration, i)meta-features in Table 13. and ii) temporal meta-featuresin Table 14.

Param(N,µ,d) Validation Test(40%) Test(60%)

200,0.1,3 0.24148 0.24787 0.24906

200,0.05,3 0.24404 0.24793 0.24913

100,0.1,5 0.23563 0.24812 0.24894

100,0.1,3 0.24401 0.24792 0.24922

Table 9. Results from linear-regression on GBDT models15 models, using the following parameter combinations(µ = .1, .2, .3, .4, .5) and (depth=3,5,7)

Test(40%) Test(60%)

0.24665 0.24792

5. Conclusions

We interpreted the task for the competition as a dyadicprediction task and experimented with different vari-ations of LFL. The basic LFL model performs betterthan the baseline and would have placed us at 26th inthe competition. We explored different ways to encodecategorical side-information feature in the LFL model.The average prediction from all the LFL models wouldplace us at rank 14th. We further explored differ-ent ways techniques to combine predictions i.e linear

Table 10. Leaderboard ranks of different methods

Method Public Private

Rasch Model 78 76

Basic LFL 27 26Average of LFL models 17 14Ensemble MethodsLinear regression 11 11GBDT 11 11GBDT + LR with temporal features 4 5

Page 9: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Table 11. Private Leaderboard ranks

Rank Team Location

1 Steffen University of Konstanz

2 D’yakonov Alexander Moscow State University3 Ekla4 Planet Thanet & Birutas UK & Brazil5 UCSD-Triton UC San Diego6 James Petterson Australian National University7 Indy Actuaries Indianapolis8 Yetiman Northpole9 Gxav Singapore10 Two Tacos UC Irvine

regression and gradient boosted decision trees. Us-ing linear-regression or GBDT to combine predictionswould place us at rank 11th in the competition. Thefinal improvement in BCD was achieved after addingtemporal features and using linear regression on dif-ferent parameter combination of GBDT placing us atrank 5th on the private leadboard and 4th on the pub-lic leaderboard. Table 10, contains both public andprivate leaderboard performances of various methodsand Table 11 lists the top 10 finishers in the competi-tion.

6. Acknowledgments

The author wishes to acknowledge the valuable in-puts, insights and advise from Aditya Menon, PhDstudent, CSE, UCSD and Charles Elkan, Professor,CSE, UCSD during and after the competition withoutwhich the project would not have been possible.

References

Agarwal, Deepak and Chen, Bee-Chung. Regression-based latent factor models. In Proceedings of the15th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’09, pp.19–28, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.

Friedman, Jerome H. Stochastic gradient boosting.Computational Statistics and Data Analysis, 38:367–378, 1999.

Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., andSismanis, Yannis. Large-scale matrix factorizationwith distributed stochastic gradient descent. In Pro-ceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data min-

ing, KDD ’11, New York, NY, USA, 2011. ACM.ISBN 978-1-4503-0813-7.

Hofmann, Thomas, Puzicha, Jan, and Jordan,Michael I. Learning from dyadic data. In Pro-ceedings of the 1998 conference on Advances in neu-ral information processing systems II, pp. 466–472,Cambridge, MA, USA, 1999. MIT Press. ISBN 0-262-11245-0.

Li, Xiaochun and Xu, Ronghui (eds.). High-dimensional data analysis in cancer research.Springer, CA, U.S.A, 2009.

Menon, Aditya Krishna and Elkan, Charles. A log-linear model with latent features for dyadic predic-tion. In ICDM’10, pp. 364–373, 2010.

Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair.On spectral clustering: Analysis and an algorithm.In Advances in Nueral Information Processing Sys-tems, pp. 849–856. MIT Press, 2001.

Rasch, Georg. Estimation of parameters and controlof the model for two response categories, 1960.

Schein, Andrew I., Lawrence, Andrew I., Saul,Lawrence K., and Ungar, Lyle H. A generalizedlinear model for principal component analysis of bi-nary data, 2003.

Takacs, Gabor, Pilaszy, Istvan, Nemeth, Bottyan, andTikk, Domonkos. Scalable collaborative filtering ap-proaches for large recommender systems. J. Mach.Learn. Res., 10:623–656, June 2009. ISSN 1532-4435.

Tscher, Andreas, Jahrer, Michael, and Bell, Robert M.The bigchaos solution to the netflix grand prize,2009.

Page 10: MS Project Report - Final - GrockIt on Kaggle.com

MS Project:Competing on Kaggle in GrockIts What do you know ?

Table 12. Side-information

Side-information for a question(All are categorical)

Question-TypeGroupTrackSubtrack lTagsInteraction Side-Information (All are categorical)

Game

Number of players (no=1,no>1)

7. Appendix

7.1. Binomial Capped Deviance

Algorithm 4 Binomial Capped Deviance

Input:Dyads (s, q) → y ∈ Validation set (V)S: Latent matrix for studentsQ: Latent matrix for questionsBCD = 0for each (s, q) → y do

p(y = 1|(s, q)) = 11+exp(−S1

s ·Q1q)

p = p(y = 1|(s, q))if p > .99 thenp = .99

end ifif prediction < .01 thenp = .01

end ifBCD = BCD +y × log(p) + (1− y)× log(1− p)

end foreachBCD = BCD

|V |return BCD

Table 13. Meta-features

Meta-features (numeric)

variance in outcomes for the uservariance in outcomes for the question

ratio of correct response for the following:-

-1. user-2. question-3. group-4. track-5. sub-track-6. game-type

log(number response of the type) for the following:-

-1. user-2. question-3. group-4. track-5. sub-track-6. game-type

Preprocessed tags (feature vector of length 40)

Table 14. Meta-features

Temporal Meta-features (numeric)

1. Time to answer the question.2. Average time (correct outcome) for current:-UserQuestionGroupTrackSubtrackGame Type

3. Average time (incorrect outcome)for current:-

UserQuestionGroupTrackSubtrackGame Type

4. Difference between the time to answer the question and

average time (correct outcome) for current:-UserQuestionGroupTrackSubtrackGame Type

5. Difference between the time to answer the question and

average time (incorrect outcome) for current:-UserQuestionGroupTrackSubtrackGame Type