Lecture 3_R Basics
-
Upload
pragativbora -
Category
Documents
-
view
11 -
download
0
description
Transcript of Lecture 3_R Basics
-
Copyright 2014, Simplilearn, All rights reserved.
Copyright 2014, Simplilearn, All rights reserved.
Lesson 3
Basic Analytic Techniques Using R
-
Copyright 2014, Simplilearn, All rights reserved.
Get a basic introduction to R
Understand exploration of data
Explore data using R
Visualize data using R
Understand diagnostic analytics
Implement diagnostic analytics using R
After completing this course, you will be able to:
Objective Slide
-
Copyright 2014, Simplilearn, All rights reserved.
Introduction to R
Programming language for graphics and statistical computations
Available freely under the GNU public license
Used in data mining and statistical analysis
Included time series analysis, linear and non linear modeling among others
Very active community and package contributions
Very little programming language knowledge necessary
Can be downloaded from http://www.r-project.org/
R Studio - optional
http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/
-
Copyright 2014, Simplilearn, All rights reserved.
Packages -
install.packages('package_name')
library(package_name)
Loading data
data(dataset_name)
read and write functions
getwd() and setwd(dir)
read and write functions use full path name
Example : read.csv(C:/Rtutorials/Sampledata.csv).
Assignment operator :
-
Copyright 2014, Simplilearn, All rights reserved.
Basic functions for data exploration in R
Data stored as data frames
Data frames tabular representation of data with rows and columns
Every row denotes a particular case
Sample data frame (iris data set)
Data exploration using R
Sepal length Sepal width Petal length Petal width Species
7.9 3.8 6.4 2 I. virginica
7.7 3 6.1 2.3 I. virginica
5.6 2.5 3.9 1.1 I. versicolor
5.6 2.8 4.9 2 I. virginica
5.5 4.2 1.4 0.2 I. setosa
5.5 3.5 1.3 0.2 I. setosa
7.1 3 5.9 2.1 I. virginica
7 3.2 4.7 1.4 I. versicolor
-
Copyright 2014, Simplilearn, All rights reserved.
Iris dataset built in data frame
View data set
iris
View first few rows of the data set
head(iris, n)
View last few rows of the data set
tail(iris, n)
Viewing data frame
-
Copyright 2014, Simplilearn, All rights reserved.
View the dimensions of the data set
dim(iris)
View the number of columns
ncol(iris)
View the number of rows
nrow(iris)
Dimensions of data frame
-
Copyright 2014, Simplilearn, All rights reserved.
View column names/headers
names(iris)
View all attributes
attributes(iris)
Attributes of Data frame
-
Copyright 2014, Simplilearn, All rights reserved.
iris$Petal.Length
iris* , Petal.Length+
View column data
-
Copyright 2014, Simplilearn, All rights reserved.
View particular rows
iris[10:15, ]
View particular rows of a single column
iris*10:15, Petal.Length+
View row data
-
Copyright 2014, Simplilearn, All rights reserved.
summary(data_frame)
summary(iris)
Output : Mean, Median, Minimum, Maximum, 1st and 3rd quartile
table(dataframe$columnname)
Example : table(iris$Species)
Summarizing data in R
-
Copyright 2014, Simplilearn, All rights reserved.
min(column_name)
max(column_name)
range(column_name)
mean(column_name)
median(column_name)
IQR(column_name)
sd(column_name)
var(column_name)
Individual summaries in R
-
Copyright 2014, Simplilearn, All rights reserved.
aggregate() - function for column wise aggregation
Numerical summaries for a subset of data
aggregate(formula, data, function)
Subset summary
-
Copyright 2014, Simplilearn, All rights reserved.
plot() - generic function for plotting in R
plot(iris)
Data Visualization in R
-
Copyright 2014, Simplilearn, All rights reserved.
Plot Sepal Length against Species
Attributes of plot function : main, xlab, ylab
plot(iris$Sepal.Length, iris$Species,
main = "Iris Data",
xlab = "Sepal Length,
ylab = "Species")
Data Visualization in R
-
Copyright 2014, Simplilearn, All rights reserved.
Pie charts to visualize the numerical proportion of different classes through sectors of the circle
pie(table(iris$Species),
main = "Iris Data by Species")
Table of data by species -
Pie Charts
Setosa Virginica Versicolor
50 50 50
-
Copyright 2014, Simplilearn, All rights reserved.
USPersonalExpenditure
barplot(USPersonalExpenditure,
main = "US Personal Expenditure by Year",
xlab = "Year",
ylab = "Expenditures")
Bar plots
1940 1945 1950 1955 1960
Food and Tobacco 22.2 44.5 59.6 73.2 86.8
Household Operation 10.5 15.5 29 36.5 46.2
Medical and Health 3.53 5.76 9.71 14 21.1
Personal Care 1.04 1.98 2.45 3.4 5.4
Private Education 0.341 0.974 1.8 2.6 3.64
-
Copyright 2014, Simplilearn, All rights reserved.
boxplot(Sepal.Length ~ Species,
data = iris,
main = "Iris Data Set",
xlab = "Species type",
ylab = "Sepal Length")
Box Plot
-
Copyright 2014, Simplilearn, All rights reserved.
Histograms to depict frequency distribution
islands dataset
hist(iris$Sepal.Length,
main = "Iris data",
xlab = "Sepal Length",
ylab = "Frequency")
Histogram
Sepal length
Freq
uen
cy
-
Copyright 2014, Simplilearn, All rights reserved.
Class of statistical relationships between variables
Default of cor.test() Pearsons correlation
cor.test(column1, column2)
Correlation
-
Copyright 2014, Simplilearn, All rights reserved.
aov() generic method to implement Analysis of Variance
Analysis of variance
-
Copyright 2014, Simplilearn, All rights reserved.
Implement and print the result of chi-squared test for goodness of fit
margin.table(HairEyeColor, 1)
chisq.test(variable) or chisq.test(variable, probabilities)
Chi-squared test
-
Copyright 2014, Simplilearn, All rights reserved.
Pairwise t-tests
data(anorexia, package = MASS)
attributes Treatment, pre-weight, post weight
T-test
-
Copyright 2014, Simplilearn, All rights reserved.
T test
Independent t-tests
-
Copyright 2014, Simplilearn, All rights reserved.
A basic introduction to R
Data exploration using R
Data visualizations in R
Pie Charts Bar plots
Box plots
Histogram
Diagnostic analytics using R
Chi Squared test
T tests
Analysis of Variance
Summary
Here is a quick recap of what we have learned in this lesson
-
Copyright 2014, Simplilearn, All rights reserved.
Quiz
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Linear modeling
Non linear modeling
Developing web applications
Time series analysis
1 Which of the following is not a functionality of R?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: d.
Explanation: R is a statistical analysis and data mining tool.
Linear modeling
Non linear modeling
Developing web applications
Time series analysis
1 Which of the following is not a functionality of R?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Which of the following is the function to display the first 5 rows of data?
top(data,5)
first(data,5)
2
first(5, data)
head(data,5)
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: b.
Explanation: head() function is used to display the first few rows of data
Which of the following is the function to display the first 5 rows of data?
top(data,5)
first(data,5)
2
first(5, data)
head(data,5)
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
What will be the result of the following command class(iris)
integer
matrix
data.frame
vector
3
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: d.
Explanation: Iris is a sample data set that is stored as a data frame
What will be the result of the following command class(iris)
integer
matrix
data.frame
vector
3
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Aggregate no columns
Aggregate the first column
Aggregate the last column
Aggregate all columns
4 What is the significance of the dot symbol before tilda in aggregate function?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: a.
Explanation: The dot symbol in the formula is used to specify aggregation on all columns
Aggregate no columns
Aggregate the first column
Aggregate the last column
Aggregate all columns
4 What is the significance of the dot symbol before tilda in aggregate function?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
40
30
45
55
5 Create a histogram of the islands dataset. What is the highest frequency of the dataset?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: b.
Explanation: Plot the histogram using hist(islands). The highest frequency from the graph is 40.
40
30
45
55
5 Create a histogram of the islands dataset. What is the highest frequency of the dataset?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
t.test(preheight, postheight)
t.test(preheight, postheight, paired = TRUE)
chisq.test(preheight, postheight)
aov(preheight~postheight)
6
The heights of a sample population is recorded before and after a height increasing drug. Which of the following commands would be used to BEST check if there is an effect of the drug on height of a person?
-
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: c.
Explanation: The variables are paired, and hence paired t tests need to be used to best learn the relation.
t.test(preheight, postheight)
t.test(preheight, postheight, paired = TRUE)
chisq.test(preheight, postheight)
aov(preheight~postheight)
6
The heights of a sample population is recorded before and after a height increasing drug. Which of the following commands would be used to BEST check if there is an effect of the drug on height of a person?
-
Copyright 2014, Simplilearn, All rights reserved.
Thank You
Copyright 2014, Simplilearn, All rights reserved.