Match Your Research Interest with Stanford AI...
Transcript of Match Your Research Interest with Stanford AI...
-
Haochong (Kevin) Shen and Weiqing Li
[email protected] and [email protected]
Match Your Research Interest with Stanford AI Professor
MotivationThe motivation of the project started as an idea of a content recommendation system. For the project, we set a more concrete goal of building a model to recommend professors to potential student based on there research and study interest.
We investigated multiple machine learning models for the classification task, trained them using published papers of professors and predicted matched professor by inputting student's research interests and subjectively compare to the topic modeling results of unsupervised learning methods and clustering for each professor.
DatasetWe collected the published papers of Stanford computer science professors with focus on artificial intelligence (AI) from a online journal database since 2005. This totals to 615 papers. We used the five-fold cross out validation to split the training and validation dataset.
Below is the distribution of our dataset.
Methodology
Results
We chose both supervised learning and unsupervised learning models for comparison. For both models, we preprocess the text document and extracted features. Then, for supervised learning models, we train the classifier using author as labels and tune the model to achieve best result. Finally we use AI topics as inputs and output professor will be our prediction of best match. For unsupervised learning, we run clustering methods and extract the topics for each professor’s paper. Then we rank the professor based on the result for each topic to find most prolific professor.
Discussion and Future Work
Supervised Learning Unsupervised Learning
Professor Course Published PapersAndrew Ng CS229 126
Chris Manning CS224N 109
Dan Boneh CS229 90
Doug James CS205A 34
Fei-fei Li CS231A 27
Jeannette Bohg CS223A 24
Mike Genesereth CS157 23
Percy Liang CS221 36
Silvio Savarese CS231A 100
Stefano Ermon CS221, CS228 46
Total 615
Our testing result leads to some interesting observation. First, we can see that each professor definitely have their area of focus that is captured by model. The relatively high test error can be attributed to a few things. First, we only used the first author as label while papers are usually collaborated. Also, professors with similar area of focus will be hard for the model to differentiate and is not expected from the model.
Obviously, we have a relatively small dataset to work on. More data covering more area of study would give us a better idea how our model works. Also, our current optimization target does not exactly line with our need in this application and needs improvement. Finally, we would like to try some neural network models like recurrent neural network (RNN)
Logistics Regression MNB SVM
Professor NameNo. of Testing Docs
Precision F1-score Precision F1-score Precision F1-score
Andrew Ng 11 0.73 0.73 0.7 0.67 0.8 0.76Chris Manning 9 0.69 0.82 0.69 0.82 0.64 0.78
Dan Boneh 8 1 1 1 1 1 1Doug James 3 1 1 1 1 1 1
Fei-fei Li 3 0 0 1 0.5 0 0Jeannette Bohg 2 1 1 1 0.67 1 1
Mike Genesereth 2 0.5 0.5 0.67 0.8 0.5 0.5Percy Liang 3 1 0.8 1 0.8 1 0.8
Silvio Savarese 8 0.88 0.88 0.88 0.88 0.88 0.88Stefano Ermon 4 1 1 1 1 1 1
Average/Total 53 0.80 0.81 0.85 0.82 0.80 0.81
Logistic Regression MNB SVM
Accuracy 0.83 0.83 0.83
Micro Precision 0.83 0.83 0.83
Macro Precision 0.78 0.89 0.78
Micro Recall 0.83 0.83 0.83
Macro Recall 0.78 0.80 0.78
Micro F1: 0.83 0.83 0.83
Macro F1: 0.77 0.81 0.77
mailto:[email protected]:[email protected]:[email protected]:[email protected]