R User Group Singapore, Data Mining with R (Workshop II) - Random forests
R for Data Analysis and Data Mining
description
Transcript of R for Data Analysis and Data Mining
©UFS
R for Data Analysis and Data Mining
Jianping Liu
Mar 19, 2014
2
Outline
• R and RStudio installation
• Basics of R : data types and operators
• R for Statistical Analysis and Data mining
3
What is R?
• “a language and environment for statistical computing and graphics”; a combination of statistical packages ( interactive statistical analysis) and a programming language
• a dialect of the S language that was developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks in 90’s
• Run on multiple platforms and various devices: MacOS, Windows, Linux, PC, iPhone …
• Frequent releases and bugfix; active development
• Free
Installation of R and Resources online
• http://www.r-project.org/
• http://www.rseek.org/
• http://www.rstudio.com/
• http://www.rdatamining.com/
• http://www.ats.ucla.edu/stat/r/
# R download & installation
# RStudio installation
# web-based R search
• http://cran.r-project.org/doc/manuals/R-intro.html
# data mining examples
# Stat analysis examples
• http://www.coursera.org # R Programming start 4/7/2014
5
RStudio : an integrated development environment for R
6
The uses of R
• R may be used as a calculator• R provide numerical or graphical summaries of data• R has extensive graphical abilities• R will handle a variety of specific analyses• R is an interactive programming language
• Software for Data Analysis: Programming with R (Statistics and Computing) by John M. Chambers (Springer)• S Programming (Statistics and Computing) Brian D. Ripley and William N. Venables (Springer)
useofR
7
Packages
• Install.packages(“name of the package”)
• library(pkg)
• detach(“package:pkg”)
• update.packages(“”)
Example:
install.packages(“sos”)
library(sos)
Alert: R is case sensitive
8
Getting help and info
• help(package=“sos”) #documentation on topic• ?'&&'• ??audit• help.search("time series")• library(sos)• findFn("time series")• example(data.frame)• demo(lm.glm, package=“stats”, ask=T)
helpsearch.R
9
Data Types and Basic Operations
R has five “atomic” classes of Objects:• Character• Numeric (real numbers)• Integer• Complex• Logical(True/False)The most basic object is a vector• A vector contain objects of the same class : c()• A list can contain objects of various classes: list()
10
Data Types and Basic Operations
Matrices are vectors with a dimension attribute.• The dimension attribute is itself an integer vector of length 2
(nrow, ncol)• Matrices are constructed column-wise, or specify row-wise
Factors are used to represent categorical data.• Factors can be unordered or ordered.• One can think of a factor as an integer vector where each
integer has a label.
11
Data frames are used to store tabular data
• They are fundamental to the use of the R modelling and graphics functions
• They are represented as a special type of list where every element of the list has to have the same length
• Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class
• Data frames are usually created by calling read.table() or read.csv()
• Can be converted to a matrix by calling data.matrix()
Data Types and Basic Operationsdatatypes
12
R for Regression Analysis
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf Faraway_practical linear model
logitRegression.R
• Regression analysis is the analysis of the relationship between a response or outcome variable and another set of variables
• The relationship is expressed through a statistical model equation that predicts a response variable (also called a dependent variable or criterion) from a function of explanatory variables (also called independent variables, predictors, factors, or carriers) and parameters
13
R for Time series Analysis
• Introductory Time Series with R
• Time Series Analysis and Its Applications: With R Examples (3rd ed) by R.H. Shumway and D.S. Stoffer. Springer Texts in Statistics, 2011(package: astsa)
http://www.stat.pitt.edu/stoffer/tsa3/
http://elena.aut.ac.nz/~pcowpert/ts/#RScripts
14
R Reference Card
R_referencecard_2.0
R_referencecard_regression
R _referencecard_timeseries
R_referencecard_data_mining
15
Data Mining with Rattle
# to install package rattle and load the GUI
install.packages("rattle", dependencies = c("Depends", "Suggests"))library(rattle)rattle()
• Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R!) by Graham Williams • http://www.r-project.org/doc/bib/R-books.html
16
Drawbacks of R
• Little support on dynamic or interactive graphics
• Objects must generally be stored in physical memory
• Functionality is based on consumer demand and user distribution
• Not ideal for all situations
17
Thank you !