Match Your Research Interest with Stanford AI...

1
Haochong (Kevin) Shen and Weiqing Li [email protected] and [email protected] Match Your Research Interest with Stanford AI Professor Motivation The motivation of the project started as an idea of a content recommendation system. For the project, we set a more concrete goal of building a model to recommend professors to potential student based on there research and study interest. We investigated multiple machine learning models for the classification task, trained them using published papers of professors and predicted matched professor by inputting student's research interests and subjectively compare to the topic modeling results of unsupervised learning methods and clustering for each professor. Dataset We collected the published papers of Stanford computer science professors with focus on artificial intelligence (AI) from a online journal database since 2005. This totals to 615 papers. We used the five-fold cross out validation to split the training and validation dataset. Below is the distribution of our dataset. Methodology Results We chose both supervised learning and unsupervised learning models for comparison. For both models, we preprocess the text document and extracted features. Then, for supervised learning models, we train the classifier using author as labels and tune the model to achieve best result. Finally we use AI topics as inputs and output professor will be our prediction of best match. For unsupervised learning, we run clustering methods and extract the topics for each professor’s paper. Then we rank the professor based on the result for each topic to find most prolific professor. Discussion and Future Work Supervised Learning Unsupervised Learning Professor Course Published Papers Andrew Ng CS229 126 Chris Manning CS224N 109 Dan Boneh CS229 90 Doug James CS205A 34 Fei-fei Li CS231A 27 Jeannette Bohg CS223A 24 Mike Genesereth CS157 23 Percy Liang CS221 36 Silvio Savarese CS231A 100 Stefano Ermon CS221, CS228 46 Total 615 Our testing result leads to some interesting observation. First, we can see that each professor definitely have their area of focus that is captured by model. The relatively high test error can be attributed to a few things. First, we only used the first author as label while papers are usually collaborated. Also, professors with similar area of focus will be hard for the model to dierentiate and is not expected from the model. Obviously, we have a relatively small dataset to work on. More data covering more area of study would give us a better idea how our model works. Also, our current optimization target does not exactly line with our need in this application and needs improvement. Finally, we would like to try some neural network models like recurrent neural network (RNN) Logistics Regression MNB SVM Professor Name No. of Testing Docs Precision F1-score Precision F1-score Precision F1-score Andrew Ng 11 0.73 0.73 0.7 0.67 0.8 0.76 Chris Manning 9 0.69 0.82 0.69 0.82 0.64 0.78 Dan Boneh 8 1 1 1 1 1 1 Doug James 3 1 1 1 1 1 1 Fei-fei Li 3 0 0 1 0.5 0 0 Jeannette Bohg 2 1 1 1 0.67 1 1 Mike Genesereth 2 0.5 0.5 0.67 0.8 0.5 0.5 Percy Liang 3 1 0.8 1 0.8 1 0.8 Silvio Savarese 8 0.88 0.88 0.88 0.88 0.88 0.88 Stefano Ermon 4 1 1 1 1 1 1 Average/Total 53 0.80 0.81 0.85 0.82 0.80 0.81 Logistic Regression MNB SVM Accuracy 0.83 0.83 0.83 Micro Precision 0.83 0.83 0.83 Macro Precision 0.78 0.89 0.78 Micro Recall 0.83 0.83 0.83 Macro Recall 0.78 0.80 0.78 Micro F1: 0.83 0.83 0.83 Macro F1: 0.77 0.81 0.77

Transcript of Match Your Research Interest with Stanford AI...

  • Haochong (Kevin) Shen and Weiqing Li

    [email protected] and [email protected]

    Match Your Research Interest with Stanford AI Professor

    MotivationThe motivation of the project started as an idea of a content recommendation system. For the project, we set a more concrete goal of building a model to recommend professors to potential student based on there research and study interest.

    We investigated multiple machine learning models for the classification task, trained them using published papers of professors and predicted matched professor by inputting student's research interests and subjectively compare to the topic modeling results of unsupervised learning methods and clustering for each professor.

    DatasetWe collected the published papers of Stanford computer science professors with focus on artificial intelligence (AI) from a online journal database since 2005. This totals to 615 papers. We used the five-fold cross out validation to split the training and validation dataset.

    Below is the distribution of our dataset.

    Methodology

    Results

    We chose both supervised learning and unsupervised learning models for comparison. For both models, we preprocess the text document and extracted features. Then, for supervised learning models, we train the classifier using author as labels and tune the model to achieve best result. Finally we use AI topics as inputs and output professor will be our prediction of best match. For unsupervised learning, we run clustering methods and extract the topics for each professor’s paper. Then we rank the professor based on the result for each topic to find most prolific professor.

    Discussion and Future Work

    Supervised Learning Unsupervised Learning

    Professor Course Published PapersAndrew Ng CS229 126

    Chris Manning CS224N 109

    Dan Boneh CS229 90

    Doug James CS205A 34

    Fei-fei Li CS231A 27

    Jeannette Bohg CS223A 24

    Mike Genesereth CS157 23

    Percy Liang CS221 36

    Silvio Savarese CS231A 100

    Stefano Ermon CS221, CS228 46

    Total 615

    Our testing result leads to some interesting observation. First, we can see that each professor definitely have their area of focus that is captured by model. The relatively high test error can be attributed to a few things. First, we only used the first author as label while papers are usually collaborated. Also, professors with similar area of focus will be hard for the model to differentiate and is not expected from the model.

    Obviously, we have a relatively small dataset to work on. More data covering more area of study would give us a better idea how our model works. Also, our current optimization target does not exactly line with our need in this application and needs improvement. Finally, we would like to try some neural network models like recurrent neural network (RNN)

    Logistics Regression MNB SVM

    Professor NameNo. of Testing Docs

    Precision F1-score Precision F1-score Precision F1-score

    Andrew Ng 11 0.73 0.73 0.7 0.67 0.8 0.76Chris Manning 9 0.69 0.82 0.69 0.82 0.64 0.78

    Dan Boneh 8 1 1 1 1 1 1Doug James 3 1 1 1 1 1 1

    Fei-fei Li 3 0 0 1 0.5 0 0Jeannette Bohg 2 1 1 1 0.67 1 1

    Mike Genesereth 2 0.5 0.5 0.67 0.8 0.5 0.5Percy Liang 3 1 0.8 1 0.8 1 0.8

    Silvio Savarese 8 0.88 0.88 0.88 0.88 0.88 0.88Stefano Ermon 4 1 1 1 1 1 1

    Average/Total 53 0.80 0.81 0.85 0.82 0.80 0.81

    Logistic Regression MNB SVM

    Accuracy 0.83 0.83 0.83

    Micro Precision 0.83 0.83 0.83

    Macro Precision 0.78 0.89 0.78

    Micro Recall 0.83 0.83 0.83

    Macro Recall 0.78 0.80 0.78

    Micro F1: 0.83 0.83 0.83

    Macro F1: 0.77 0.81 0.77

    mailto:[email protected]:[email protected]:[email protected]:[email protected]