By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G....

14
1 st November, 2013 By: Priya Goyal (10535) Ayush Mittal (11183) IIT Kanpur Advisor: Prof. Amitabha Mukerjee IIT Kanpur

Transcript of By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G....

Page 1: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

1st November, 2013

By:

Priya Goyal (10535) Ayush Mittal (11183)

IIT Kanpur

Advisor: Prof. Amitabha Mukerjee

IIT Kanpur

Page 2: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Stack Overflow follows a standard Q&A format: Unfit: Opinion based questions and questions with tendency to

generate discussions rather than answer – to be ‘closed’.

Fit: Questions on topics which contain specific programming problems, software algorithms, coding techniques etc.

Task: To build a classifier that predicts whether or not a question will be closed given the question as submitted along with the reason why the question was closed.

Reasons for closing a question: • Off-topic

• Not constructive

• Not a real question

• Too Localized

• Exact Duplicate

Page 3: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Off Topic: not about programming, related to another site on Stack Exchange network

Is there a way to turn off the automatic text translation at the MSDN library pages ? I do prefer……..

Too Localized: not helpful in future, relevant to a small geographic area, particular time moment etc. • What do you do on Friday afternoon when you have lot your drive

to work?

Page 4: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Not Constructive: not a good fit for Q/A format. Answers generally involve facts, references, etc. but the question will likely solicit opinion, debate, arguments, polling, or extended discussion. • What is the best comment in the source code you have ever

encountered?

Not a real question: not clear what is being asked – question is vague, ambiguous, incomplete and can’t be answered in current form.

Exact Duplicate: excluded here since it involved post history present in stack overflow database dump of 6GB size.

Page 5: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Currently, questions are closed by experienced users (3000+ reputation points) and community moderators via a systematic voting method.

Stack overflow user base is increasing exponentially and more than 6000 new questions are asked every weekday. Only 6% of them end up being closed.

Recent study by Denzil correa et. al, 2013 shows the community participation trend in closing the questions over 4 years: • ~27% questions closed by moderators only.

• >40% require moderator intervention.

• Declining trend in closing questions by experienced users.

• Currently 16 moderators only so increasing workload.

Page 6: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Shah et al. propose a classification model with features based on human assessed aspects and question answer meta information to predict answer quality on Answers based CQA [3].

Sakai et al. propose evaluation methods based on graded-relevance IR metrics to find the best answers [4].

[5] Li et al. analyse factors affecting question quality and propose a Mutual Reinforcement-based Label Propagation approach to predict question quality.

However these approaches focus on the answer quality on CQA sites but it has been proved that the answer quality directly depends on the question quality.

Page 7: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

The train dataset has 3,664,927 posts (through Oct. 2012) and each post has metadata associated with it.

The balanced validation train dataset is also available with nearly 2 million posts.

The distribution of closed/ open questions is:

Not a real Question: 38,622 Not Constructive: 20,897 Off-topic: 20,865

Too Localized: 8,910 Open: 3,575,678

The metadata is:

PostId, PostCreationDate OwnerId,

OwnerCreationDate, Title, BodyMarkDown,

Tag1, Tag2, Tag3,

Tag4, Tag5, OpenStatus, etc.

For calculating the baseline, a separate balanced validation data (with approx. same number of closed and open questions.) and train dataset ~4GB is available. The benchmarks (prior, uniform and basic) are also available to measure performance.

Page 8: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Goal: Given the raw test data, we have to predict the probability that a question is in each of the 5 classes.

For this multi-class classification problem with lot of data and features, we intend to use the library named Vowpal-Wabbit (vw) developed at Yahoo! Labs.

Tools used: Python, NLTK, scikit-learn and other libraries.

The overall approach is:

Run Learning, validation and predictions

Normalize and predict score

Pre-process dataset

Build vector-model of feature set

Page 9: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

It is a library and algorithms developed at Yahoo! Research by John Langford.

Intrinsically fast learning algorithms to deal with big data (spanning trees formed across worker nodes) implemented with various optimization algorithms available.

Highly useful in case of lot of data and lot of feature set (hashing approach).

Baseline algorithm is Sparse Gradient Descent (GD) on a loss function (square, logistic, hinge, quantile).

Learning rate, regularization, masking of feature (cubic, square etc.), reinforcement learning, active learning, neural network reduction, LDA etc. approaches implemented.

Page 10: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Feature vector is constructed and converted into vw format.

Three types of features available: User features, post features, tag features.

Extended feature vector is constructed for each post:

◦ Segment the document into code and non-code sections.

◦ Segment non-code section into sentences and into words (NLTK)

◦ Post and tag features: #ends with question, exclamation, #starts with ‘I’, length of code-block, text-block, title, #lines, #tags, #non-word portion, urls, digits, #occurrence of thanks, open status, etc. are considered.

◦ User features: time user has been on stack overflow, reputation at the time of post creation, #good posts by user, user ID etc.

Page 11: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Since we are dealing with large data and large feature set , using machine learning algorithms like kernel SVM, random forest, neural net etc. to build a classifier would not be trivial.

We will use VW for this purpose. It utilizes the logistic loss function and one-against-all classification.

The probabilities obtained are then normalized using Sigmoid function so that their sum is 1.

The results are evaluated using the multi-log loss function.

Page 12: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Multiclass logarithmic Loss function: It is negative the log likelihood of the model that says each test observation is chosen independently from a distribution that places the submitted probability mass on the corresponding class, for each observation:

where N is the number of observations, M is the number of class labels, log is the natural logarithm, yi,j is 1 if observation i is in class j and 0 otherwise, and pi,j is the predicted probability that observation i is in class j.

Page 13: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

[1] Why are some questions closed, and what does "closed" mean? http://stackoverflow.com/help/closed-questions [2] What is a day in life of a stack overflow moderator? http://meta.stackoverflow.com/a/166630/214223 February 2013. [3] C. Shah and J. Pomerantz. Evaluating and predicting answer quality in community qa. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 411–418. ACM, 2010. [4] T. Sakai, D. Ishikawa, N. Kando, Y. Seki, K. Kuriyama, and C.-Y. Lin. Using graded-relevance metrics for evaluating community qa answer selection. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 187–196. ACM, 2011. [5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In Proceedings of the 2011 annual conference on Human factors in computing systems, pages 2857–2866. ACM, 2011. [6] http://fastml.com/predicting-closed-questions-on-stack-overflow/ [7] https://github.com/JohnLangford/vowpal_wabbit/wiki [8] https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/data

Page 14: By: Priya Goyal (10535) Ayush Mittal (11183) IIT …...[5] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In

Questions?