R Package Recommendation

download R Package Recommendation

of 4

Transcript of R Package Recommendation

  • 7/28/2019 R Package Recommendation

    1/4

    R package

    RecommendationYingying Xu

    Abstract: New comers to R programming language always face the

    problem of choosing packages that are more relevant and are of

    higher qualities. An automated R package recommendation system

    would be useful in telling what packages average people have

    installed, and the measurable qualities of the packages. When

    building a recommendation system, we can interested in the

    probability of a user installing a package, so in this paper, various

    features that might influence the users decision are explored, and

    several statistical learning methods are experimented.

    1. IntroductionEach programming language contains a large number of

    libraries/packages that extend the functionality of the core

    language. The fluent use of a number of libraries is importantfor every programmer as well as the mastery of the basic syntax.

    New comers to a programming language always face the

    problem of choosing libraries that are more relevant and are of

    higher qualities. Inspecting each library manually by looking at

    its functionality descriptions can be a daunting task, and they

    provide little information on the quality of the library. Thus, an

    automated package recommendation system would be useful in

    telling the programmer what packages other people have

    installed, and the measurable properties that generate this result.

    R is a language and software for statistical computing andgraphics. In this paper, we will try several methods for building

    an R package recommendation engine. A popular website for R,

    CRAN, is a network of ftp and web servers around the worldthat stores code and documentation for R. On CRAN, each R

    package summary contains a short description of what the

    package does, the authors and maintainer of the package, the

    imports and dependencies, and other packages it suggests. The

    rich meta-data each R package contains enables us to explore

    many potential relationships and make fairly accurate

    predictions on whether a user will install a package

    1. Data Description1.1 Network GraphIn R package network, there are several entities, package,

    topic/task view, author, maintainer, and user. Topic/task view

    enables user to browse packages by its topic. There are currently29 available topics/task views on CRAN. The links between

    packages are imports, depend and suggests. A package is written

    by several authors, maintained by one maintainer (maintainer is

    usually one of the authors), be in one or several topics/task

    views, and installed by users. A maintainer can be maintaining

    one or several packages at the same time. We also know

    whether a package is a recommended package on CRAN, and

    whether it is a core package.

    1.2 Training dataThe training data we used are taken from the Dataists Rrecommendation system contest

    (https://github.com/johnmyleswhite/r_recommendation_system). The graph data were obtained by crawling the entire CRAN

    website and the predictors were derived from the rich meta-data.

    The training data contains 99,640 rows of data describing

    installation information for 1865 packages for 52 users of R. Its

    a matrix with each row provides the following information:

    Package: The name of the current R package.

    User: The numeric ID of the current user who may or may not

    have installed the current package.

    Installed: A dummy variable indicating whether the current

    package is installed by the current user.

    DependencyCount: The number of other R packages that dependupon the current package.

    SuggestionCount: The number of other R packages that suggest

    the current package.

    ImportCount: The number of other R packages that import the

    current package.

    ViewsIncluding: The number of task views on CRAN that

    include the current package.

    CorePackage: A dummy variable indicating whether the current

    package is part of core R.

    RecommendedPackage: A dummy variable indicating whetherthe current package is a recommended R package.

    Maintainer: The name and e-mail address of the package's

    maintainer.

    PackagesMaintaining: The number of other R packages that are

    being maintained by the current package's maintainer

    Besides the given features, we can also use the open source

    website crawler to crawl the entire CRAN website and derive

    more informative links from the network graph, such as the

    actual name of each package that depend on, imports, or

    suggests the current package, and the name of the task views

    that the current package appears. This will be discussed in the

    following section.

    2. Feature SelectionThere are several intuitions and assumptions in predicting

    whether a user will install an R package in this paper. A

    package !is more likely to be installed by user! if

  • 7/28/2019 R Package Recommendation

    2/4

  • 7/28/2019 R Package Recommendation

    3/4

    f. Let !!!! = !| ! |3. The predicted value for an input x is

    !

    !

    !!! !

    3.4 Nave BayesNave Bayes assumes that conditioning on the class label, each

    feature are conditionally independent from each other, i.e.,

    = (|) =(!"#$%$"#$"&$!""#$%&'()) (!|)!

    !!!

    Where y is the class label, is the feature vector, !is each

    feature, p is the total number of features, is a normalizingfactor ensuring = 1 +P (y= -1|)=1.

    is calculated for each sample point, and y is assigned to

    the class that has the higher probability.

    Despite the naive assumption of this model, it achieves goodAUC score in the experiment.

    3.5 SVMSVM with linear, Quadratic and Gaussian kernel areexperimented, and quadratic kernel returns the best AUC.

    Linear Kernel:

    , != (, )

    Gaussian Radial Basis Function Kernel:

    , != exp(|| ||!)

    The Rational Quadratic kernel

    , ! = 1 || ||!

    | !||+

    The Rational Quadratic Kernel is less computationally intensive

    than the Gaussian kernel

    4. Experiment4.1 Data preprocessingThe training data contains a large number of rows that has

    missing class labels (we dont know whether a user installed the

    package). Since there are much more (10 times more) training

    data in class 1 (not installed) than in class -1 (installed), the

    missing class labels are simply replaced by 1.

    4.2 PredictorsIn logistic regression, the final predictors used in the

    experiments are:Package, User, DependencyScore,

    SuggestionScore, DependencyScore, ViewsIncluding,

    CorePackage, Recommended Package, Maintainer, and

    PackagesMaintaining. The meaning of each predictor has been

    explained in section 1.2 and 2.

    In KNN, the total number of sample points is used as K; when

    calculating the cosine similarity between two packages, those

    numerical predictors are used.

    For those that are core package, we predict their class labels to

    be -1(indication the package will be installed by a user) for all

    users.

    4.3 ResultsFive-fold cross-validation AUC is used as the criterion.

    The results are summarized in the table 4.3.1:

    Model Average AUC in 5-fold cross-

    validation

    Baseline model 0.9028

    KNN with K= # observations 0.9472

    L1 regularized logistic regression+Adaboost

    0.9635

    Nave Bayes 0.9019

    SVM with quadratic kernel 0.9455

    Table 4.3.1

    5. DiscussionAmong the models stated above, L1 regularized logistic

    regression gives the highest AUC. In this paper, only in-links,

    such as being suggested by other packages, are considered. In

    future studies, out-links, like importing other packages or

    suggesting other packages can also be utilized. Classification

    might also be improved if we separately classify the packages in

    each topic.

    REFERENCES

    [1] Alexandros Karatzoglou and David Meyer. Support Vector Machines in R.

    Journal of Statistical Software, Volume 15, Issue 9,April 2006

    [2] Greg Ridgeway. Generalized Boosted Models: A guide to the gbm package.August 3, 2007

    [3] Jerome H. Friedman.Regularized Discriminant Analysis. Journal of theAmerican Statistical Association, Vol. 84, No. 405 (Mar., 1989), pp. 165-175

    [4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of

    Statistical Learning

    [5] http://cran.r-project.org/web/packages/e1071/index.html

    [6] http://www.cs.princeton.edu/~schapire/boost.html

    [7] http://cran.r-project.org/web/packages/glmnet/index.html

    [8] http://cran.r-project.org/

  • 7/28/2019 R Package Recommendation

    4/4

    [9] Training data set and the web crawler open source code are downloadedfrom: https://github.com/johnmyleswhite/r_recommendation_system