R Package Recommendation

7/28/2019 R Package Recommendation

1/4

R package

RecommendationYingying Xu

Abstract: New comers to R programming language always face the

problem of choosing packages that are more relevant and are of

higher qualities. An automated R package recommendation system

would be useful in telling what packages average people have

installed, and the measurable qualities of the packages. When

building a recommendation system, we can interested in the

probability of a user installing a package, so in this paper, various

features that might influence the users decision are explored, and

several statistical learning methods are experimented.

1. IntroductionEach programming language contains a large number of

libraries/packages that extend the functionality of the core

language. The fluent use of a number of libraries is importantfor every programmer as well as the mastery of the basic syntax.

New comers to a programming language always face the

problem of choosing libraries that are more relevant and are of

higher qualities. Inspecting each library manually by looking at

its functionality descriptions can be a daunting task, and they

provide little information on the quality of the library. Thus, an

automated package recommendation system would be useful in

telling the programmer what packages other people have

installed, and the measurable properties that generate this result.

R is a language and software for statistical computing andgraphics. In this paper, we will try several methods for building

an R package recommendation engine. A popular website for R,

CRAN, is a network of ftp and web servers around the worldthat stores code and documentation for R. On CRAN, each R

package summary contains a short description of what the

package does, the authors and maintainer of the package, the

imports and dependencies, and other packages it suggests. The

rich meta-data each R package contains enables us to explore

many potential relationships and make fairly accurate

predictions on whether a user will install a package

1. Data Description1.1 Network GraphIn R package network, there are several entities, package,

topic/task view, author, maintainer, and user. Topic/task view

enables user to browse packages by its topic. There are currently29 available topics/task views on CRAN. The links between

packages are imports, depend and suggests. A package is written

by several authors, maintained by one maintainer (maintainer is

usually one of the authors), be in one or several topics/task

views, and installed by users. A maintainer can be maintaining

one or several packages at the same time. We also know

whether a package is a recommended package on CRAN, and

whether it is a core package.

1.2 Training dataThe training data we used are taken from the Dataists Rrecommendation system contest

(https://github.com/johnmyleswhite/r_recommendation_system). The graph data were obtained by crawling the entire CRAN

website and the predictors were derived from the rich meta-data.

The training data contains 99,640 rows of data describing

installation information for 1865 packages for 52 users of R. Its

a matrix with each row provides the following information:

Package: The name of the current R package.

User: The numeric ID of the current user who may or may not

have installed the current package.

Installed: A dummy variable indicating whether the current

package is installed by the current user.

DependencyCount: The number of other R packages that dependupon the current package.

SuggestionCount: The number of other R packages that suggest

the current package.

ImportCount: The number of other R packages that import the

current package.

ViewsIncluding: The number of task views on CRAN that

include the current package.

CorePackage: A dummy variable indicating whether the current

package is part of core R.

RecommendedPackage: A dummy variable indicating whetherthe current package is a recommended R package.

Maintainer: The name and e-mail address of the package's

maintainer.

PackagesMaintaining: The number of other R packages that are

being maintained by the current package's maintainer

Besides the given features, we can also use the open source

website crawler to crawl the entire CRAN website and derive

more informative links from the network graph, such as the

actual name of each package that depend on, imports, or

suggests the current package, and the name of the task views

that the current package appears. This will be discussed in the

following section.

2. Feature SelectionThere are several intuitions and assumptions in predicting

whether a user will install an R package in this paper. A

package !is more likely to be installed by user! if


2/4


3/4

f. Let !!!! = !| ! |3. The predicted value for an input x is

!

!

!!! !

3.4 Nave BayesNave Bayes assumes that conditioning on the class label, each

feature are conditionally independent from each other, i.e.,

= (|) =(!"#$%$"#$"&$!""#$%&'()) (!|)!

!!!

Where y is the class label, is the feature vector, !is each

feature, p is the total number of features, is a normalizingfactor ensuring = 1 +P (y= -1|)=1.

is calculated for each sample point, and y is assigned to

the class that has the higher probability.

Despite the naive assumption of this model, it achieves goodAUC score in the experiment.

3.5 SVMSVM with linear, Quadratic and Gaussian kernel areexperimented, and quadratic kernel returns the best AUC.

Linear Kernel:

, != (, )

Gaussian Radial Basis Function Kernel:

, != exp(|| ||!)

The Rational Quadratic kernel

, ! = 1 || ||!

| !||+

The Rational Quadratic Kernel is less computationally intensive

than the Gaussian kernel

4. Experiment4.1 Data preprocessingThe training data contains a large number of rows that has

missing class labels (we dont know whether a user installed the

package). Since there are much more (10 times more) training

data in class 1 (not installed) than in class -1 (installed), the

missing class labels are simply replaced by 1.

4.2 PredictorsIn logistic regression, the final predictors used in the

experiments are:Package, User, DependencyScore,

SuggestionScore, DependencyScore, ViewsIncluding,

CorePackage, Recommended Package, Maintainer, and

PackagesMaintaining. The meaning of each predictor has been

explained in section 1.2 and 2.

In KNN, the total number of sample points is used as K; when

calculating the cosine similarity between two packages, those

numerical predictors are used.

For those that are core package, we predict their class labels to

be -1(indication the package will be installed by a user) for all

users.

4.3 ResultsFive-fold cross-validation AUC is used as the criterion.

The results are summarized in the table 4.3.1:

Model Average AUC in 5-fold cross-

validation

Baseline model 0.9028

KNN with K= # observations 0.9472

L1 regularized logistic regression+Adaboost

0.9635

Nave Bayes 0.9019

SVM with quadratic kernel 0.9455

Table 4.3.1

5. DiscussionAmong the models stated above, L1 regularized logistic

regression gives the highest AUC. In this paper, only in-links,

such as being suggested by other packages, are considered. In

future studies, out-links, like importing other packages or

suggesting other packages can also be utilized. Classification

might also be improved if we separately classify the packages in

each topic.

REFERENCES

[1] Alexandros Karatzoglou and David Meyer. Support Vector Machines in R.

Journal of Statistical Software, Volume 15, Issue 9,April 2006

[2] Greg Ridgeway. Generalized Boosted Models: A guide to the gbm package.August 3, 2007

[3] Jerome H. Friedman.Regularized Discriminant Analysis. Journal of theAmerican Statistical Association, Vol. 84, No. 405 (Mar., 1989), pp. 165-175

[4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of

Statistical Learning

[5] http://cran.r-project.org/web/packages/e1071/index.html

[6] http://www.cs.princeton.edu/~schapire/boost.html

[7] http://cran.r-project.org/web/packages/glmnet/index.html

[8] http://cran.r-project.org/


4/4

[9] Training data set and the web crawler open source code are downloadedfrom: https://github.com/johnmyleswhite/r_recommendation_system

R Package Recommendation

Documents

Transcript of R Package Recommendation