Internet Mathematics 2011

Post on 15-Jan-2015

766 views 0 download

Tags:

description

 

Transcript of Internet Mathematics 2011

Yandex Relevance Prediction ChallengeOverview of “CLL” team’s solution

R. Gareev1, D. Kalyanov2, A. Shaykhutdinova1, N. Zhiltsov1

1Kazan (Volga Region) Federal University

210tracks.ru

28 December 2011

1 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

2 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

3 / 52

Problem statement

I Predict document relevance from user behavior a.k.a«Implicit Relevance Feedback»

I See also http://imat-relpred.yandex.ru/en formore details

4 / 52

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

5 / 52

Labeled data

Given judgements for some pairs of documents and queries:

I a document Dj is relevant for a query Qi from aregion Ror

I a document Dj is not relevant for a query Qi from aregion R

6 / 52

The problem

I Given a set Q of search queries, for each (q, R) ∈ Qprovide a sorted list of documents D1, . . . , Dm that arerelevant to q in the region R

I Area Under the ROC Curve (AUC) averaged over allthe test query-pairs is the target evaluation metric

7 / 52

AUC score

I Consider list of documents: D1, . . . , Di︸ ︷︷ ︸Prefix of length i

, . . . Dm

I (FPR(i),TPR(i)) gives a single point in ROC curveI AUC is the area under ROC curveI AUC = Probability that randomly chosen relevant document

come before randomly chosen non-relevand document8 / 52

Our problem restatement

I We consider it as a machine learning taskI Using relevance judgements, learn a classifierH(R,Q,D) that predicts that document D is relevantto a query Q from a region R

I Replace RegionID, QueryID and DocumentID withrelated features extracted from click log

I Use the classifier H(R,Q,D) to compute a list, sortedw.r.t. classifier’s certainty scores, for a query Q from aregion R

9 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

10 / 52

Features

I A «feature» is a function of (Q,R,D).I Each feature is associated/not associated with itsrelated region

TypesI Document featuresI Query featuresI Time-concerned features

11 / 52

Document features

1 (Q,D)→ Number of occurences of an URL in the SERP list2 (Q,D)→ Number of clicks3 (Q,D)→ Click-through rate4 (Q,D)→ Average position in the click sequence5 (Q,D)→ Average rank in the SERP list6 (Q,D)→ Average rank in the SERP list when URL is clicked7 (Q,D)→ Probability of being last clicked8 (Q,D)→ Probability of being first clicked

12 / 52

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

13 / 52

Query features

1 (Q)→ Average number of clicks in subsession2 (Q)→ Probability of being rewritten (being not last query insession)

3 (Q)→ Probability of being resolved (probability of its resultsbeing last clicked)

14 / 52

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

15 / 52

Time-concerned features

1 (Q)→ Average time to first click2 (Q,D)→ Average time spent reading a document D

16 / 52

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

17 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

18 / 52

Two phase extraction

1 Normalization• lookup filtering by ’Important triples’ set• normalization is specific for each feature

2 Grouping and aggregating

19 / 52

Important triples

20 / 52

Normalization

I Converting clicklog entries to a relational table withthe following attributes:

• feature domain attributes, e.g:• (Q,R,U), (Q,U) for document features• (Q,R), (Q) for query features

• feature attribute valueI Sequential processing ’session-by-session’

• reject spam sessions• emit values (probably repeated)

21 / 52

Normalization example (I)Click log (with SessionID, TimePassed omitted):Action QueryID RegionID URLsQ 174 0 1625 1627 1623 2510 2524Q 1974 0 2091 17562 1626 1623 1627C 17562C 1627C 1625C 2510

Intermediate table for ’Average click position’ feature:QueryID URLID RegionID ClickPosition1974 17562 0 11974 1627 0 2174 1625 0 1174 2510 0 2

22 / 52

Normalization example (II)

Click log (sessionID was omitted):Time QueryID RegionID URLs

0 Q 5 0 99 16 87 396 C 84

120 Q 558 0 84 5043 5041 5039125 Q 8768 0 74672 74661 74674 74671145 C 74661

Intermediate table for ’Time to first click’ feature:QueryID RegionID FirstClickTime

5 0 68768 0 20

23 / 52

Aggregation example (by triple)

24 / 52

Aggregation example (by QU-pair)

25 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

26 / 52

Our final ML based solution in a nutshell

I Binary classification task for predicting assessors’ labelsI 26 features extracted from the click logI Gradient Boosted Trees learning model (gbm Rpackage)

I Tuning model’s parameters w.r.t. AUC averaged overgiven query-region pairs

I Ranking URLs according to the best model’sprobability scores

27 / 52

Training data

28 / 52

Training dataTarget values

29 / 52

Training dataFeature values

30 / 52

Training dataMissing values

31 / 52

Data Analysis Scheme1 Given initial training and test sets

2 Partitioning the initial training set into two sets:• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression

4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set

6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Boosting[Schapire, 1990]

I Given training set (x1, y1), . . . , (xN , yN),yi ∈ {−1,+1}

I For t = 1, . . . , T• construct distribution Dt on {1, . . . , N}• sample examples from it concentrating on the “hardest” ones• learn a “weak classifier” (at least better than random)

ht : X → {−1,+1}

with error εt on Dt:

εt = Pi∼Dt(ht(xi) 6= yi)

I Output the final classifier H as a weighted majorityvote of ht

33 / 52

AdaBoost[Freund & Schapire, 1997]

I Constructing Dt:• D1(i) =

1N

• given Dt and ht:

Dt+1(i) =Dt(i)

Zt×{e−αt if yi = ht(xi)eαt if yi 6= ht(xi)

where Zt – normalization factor and

αt =1

2ln

(1− εtεt

)> 0

I Final classifier:

H(x) = sign

(∑t

αtht(x)

)34 / 52

Gradient boosted trees[Friedman, 2001]

I Stochastic gradient decent optimization of the lossfunction

I Decision trees model as a weak classifierI Do not require feature normalizationI There is no need to handle missing values specificallyI Reported good performance in relevance predictionproblems [Piwowarski et al., 2009], [Hassan et al.,2010] and [Gulin et al., 2011]

35 / 52

Gradient boosted treesgbm R package implementation

I There are two available distributions for classificationtasks: Bernoulli and AdaBoost

I Three basic parameters: interaction depth (depth ofeach tree), number of trees (or iterations) andshrinkage (learning rate)

36 / 52

Logistic regressionglm, stats R package

I Preprocess the initial training data – imputing missingvalues with the help of bagged trees

I Fit the generalized linear model:

f(x) =1

1 + e−z,

where z = β0 + β1x1 + · · ·+ βkxk

37 / 52

Tuning gbmbernoulli model

3-fold CV estimate of AUC for the optimal parameters: 0.6457435

38 / 52

Tuning gbmadaboost model

3-fold CV estimate of AUC for the optimal parameters: 0.6455384

39 / 52

Comparative performance of three optimal modelsTest error estimates

Model Optimal parameter values Test estimate of AUCgbmbernoulli interaction.depth=2, 0.6324717

n.trees=500,shrinkage=0.01

gbmadaboost interaction.depth=4, 0.6313393n.trees=700,shrinkage=0.01

logistic regression - 0.618648

40 / 52

Variable importance according to the best model

41 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

42 / 52

Contest statistics

I 101 participants, 84 of them are eligible for prizeI Two-stage evaluation procedure: validation set and testset (their sizes were unknown during the contest)

I Validation set size is ≈ 11 000 instancesI Test set size is ≈ 20 000 instances

43 / 52

Preliminary ResultsValidation set

19th place (AUC=0.650004)

44 / 52

Final ResultsTest set

34th place (AUC=0.643346)

# Team AUC1 cointegral* 0.6673622 Evlampiy* 0.665063 alsafr* 0.6645274 alexeigor* 0.6631695 keinorhasen 0.6609826 mmp 0.6599147 Cutter* 0.6594528 S-n-D 0.658103. . . . . . . . .34 CLL 0.643346. . . . . . . . .

45 / 52

Acknowledgements

We would like to thank:

I the organizers from Yandex for an exciting challenge

I E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleaguesfrom Kazan Federal University for fruitful discussions and support

46 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

47 / 52

References I

[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoreticgeneralization of on-line learning and an application to boosting // Journalof Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.

[Friedman, 2001] Friedman, J. Greedy Function Approximation: A GradientBoosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.1189-1232.

[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning TheTransfer Learning Track of Yahoo!’s Learning To Rank Challenge withYetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.63-76.

[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:User behavior as a predictor of a successful search // Proceedings of thethird ACM international conference on Web search and data mining. –ACM. – 2010. – P. 221-230.

48 / 52

References II

[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining UserWeb Search Activity with Layered Bayesian Networks or How to Capture aClick in its Context // Proceedings of the Second ACM InternationalConference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.

[Schapire, 1990] Schapire, R. The strength of weak learnability // MachineLearning. – V.5. – No. 2. – 1990. – P. 197–227.

49 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

50 / 52

Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)

pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)

aucValues <- foreach (i=1:count, .combine="c") %do% {queryId <- queryRegions[i,"QueryID"]regionId <- queryRegions[i,"RegionID"]true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabelm <- mean(true.labels)if (m == 0 | m == 1) {

pred <- NAperf <- NAcurAUC <- NA

}else {

gbm.predictions <-predict.gbm(fit,pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],n.trees=ntrees, type="response")

pred <- prediction(gbm.predictions, true.labels)perf <- performance(pred, "auc")curAUC <- perf@y.values [[1]] [1]}

curAUC}return (mean(aucValues, na.rm=T))}

51 / 52

Tuning AUC for gbm model

TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {require(gbm)require(foreach)require(caret)require(sqldf)FUN <- match.fun(aucfunction)ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)

folds <- createFolds(trainSet$QueryID, foldsNum, T, T)aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {

inTrain <- folds[[i]]cvTrainData <- trainSet[inTrain,]cvTestData <- trainSet[-inTrain,]pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))

gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)

foreach(n=ntreesSeq, .combine="rbind") %do% {auc <- FUN(gbmFit, n, cvTestData)(c(n, auc))

}}aucvalues <- as.data.frame(aucvalues)avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")return (avgAuc)

}

52 / 52