Lecture 3_R Basics

download Lecture 3_R Basics

of 39

description

R

Transcript of Lecture 3_R Basics

  • Copyright 2014, Simplilearn, All rights reserved.

    Copyright 2014, Simplilearn, All rights reserved.

    Lesson 3

    Basic Analytic Techniques Using R

  • Copyright 2014, Simplilearn, All rights reserved.

    Get a basic introduction to R

    Understand exploration of data

    Explore data using R

    Visualize data using R

    Understand diagnostic analytics

    Implement diagnostic analytics using R

    After completing this course, you will be able to:

    Objective Slide

  • Copyright 2014, Simplilearn, All rights reserved.

    Introduction to R

    Programming language for graphics and statistical computations

    Available freely under the GNU public license

    Used in data mining and statistical analysis

    Included time series analysis, linear and non linear modeling among others

    Very active community and package contributions

    Very little programming language knowledge necessary

    Can be downloaded from http://www.r-project.org/

    R Studio - optional

    http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/

  • Copyright 2014, Simplilearn, All rights reserved.

    Packages -

    install.packages('package_name')

    library(package_name)

    Loading data

    data(dataset_name)

    read and write functions

    getwd() and setwd(dir)

    read and write functions use full path name

    Example : read.csv(C:/Rtutorials/Sampledata.csv).

    Assignment operator :

  • Copyright 2014, Simplilearn, All rights reserved.

    Basic functions for data exploration in R

    Data stored as data frames

    Data frames tabular representation of data with rows and columns

    Every row denotes a particular case

    Sample data frame (iris data set)

    Data exploration using R

    Sepal length Sepal width Petal length Petal width Species

    7.9 3.8 6.4 2 I. virginica

    7.7 3 6.1 2.3 I. virginica

    5.6 2.5 3.9 1.1 I. versicolor

    5.6 2.8 4.9 2 I. virginica

    5.5 4.2 1.4 0.2 I. setosa

    5.5 3.5 1.3 0.2 I. setosa

    7.1 3 5.9 2.1 I. virginica

    7 3.2 4.7 1.4 I. versicolor

  • Copyright 2014, Simplilearn, All rights reserved.

    Iris dataset built in data frame

    View data set

    iris

    View first few rows of the data set

    head(iris, n)

    View last few rows of the data set

    tail(iris, n)

    Viewing data frame

  • Copyright 2014, Simplilearn, All rights reserved.

    View the dimensions of the data set

    dim(iris)

    View the number of columns

    ncol(iris)

    View the number of rows

    nrow(iris)

    Dimensions of data frame

  • Copyright 2014, Simplilearn, All rights reserved.

    View column names/headers

    names(iris)

    View all attributes

    attributes(iris)

    Attributes of Data frame

  • Copyright 2014, Simplilearn, All rights reserved.

    iris$Petal.Length

    iris* , Petal.Length+

    View column data

  • Copyright 2014, Simplilearn, All rights reserved.

    View particular rows

    iris[10:15, ]

    View particular rows of a single column

    iris*10:15, Petal.Length+

    View row data

  • Copyright 2014, Simplilearn, All rights reserved.

    summary(data_frame)

    summary(iris)

    Output : Mean, Median, Minimum, Maximum, 1st and 3rd quartile

    table(dataframe$columnname)

    Example : table(iris$Species)

    Summarizing data in R

  • Copyright 2014, Simplilearn, All rights reserved.

    min(column_name)

    max(column_name)

    range(column_name)

    mean(column_name)

    median(column_name)

    IQR(column_name)

    sd(column_name)

    var(column_name)

    Individual summaries in R

  • Copyright 2014, Simplilearn, All rights reserved.

    aggregate() - function for column wise aggregation

    Numerical summaries for a subset of data

    aggregate(formula, data, function)

    Subset summary

  • Copyright 2014, Simplilearn, All rights reserved.

    plot() - generic function for plotting in R

    plot(iris)

    Data Visualization in R

  • Copyright 2014, Simplilearn, All rights reserved.

    Plot Sepal Length against Species

    Attributes of plot function : main, xlab, ylab

    plot(iris$Sepal.Length, iris$Species,

    main = "Iris Data",

    xlab = "Sepal Length,

    ylab = "Species")

    Data Visualization in R

  • Copyright 2014, Simplilearn, All rights reserved.

    Pie charts to visualize the numerical proportion of different classes through sectors of the circle

    pie(table(iris$Species),

    main = "Iris Data by Species")

    Table of data by species -

    Pie Charts

    Setosa Virginica Versicolor

    50 50 50

  • Copyright 2014, Simplilearn, All rights reserved.

    USPersonalExpenditure

    barplot(USPersonalExpenditure,

    main = "US Personal Expenditure by Year",

    xlab = "Year",

    ylab = "Expenditures")

    Bar plots

    1940 1945 1950 1955 1960

    Food and Tobacco 22.2 44.5 59.6 73.2 86.8

    Household Operation 10.5 15.5 29 36.5 46.2

    Medical and Health 3.53 5.76 9.71 14 21.1

    Personal Care 1.04 1.98 2.45 3.4 5.4

    Private Education 0.341 0.974 1.8 2.6 3.64

  • Copyright 2014, Simplilearn, All rights reserved.

    boxplot(Sepal.Length ~ Species,

    data = iris,

    main = "Iris Data Set",

    xlab = "Species type",

    ylab = "Sepal Length")

    Box Plot

  • Copyright 2014, Simplilearn, All rights reserved.

    Histograms to depict frequency distribution

    islands dataset

    hist(iris$Sepal.Length,

    main = "Iris data",

    xlab = "Sepal Length",

    ylab = "Frequency")

    Histogram

    Sepal length

    Freq

    uen

    cy

  • Copyright 2014, Simplilearn, All rights reserved.

    Class of statistical relationships between variables

    Default of cor.test() Pearsons correlation

    cor.test(column1, column2)

    Correlation

  • Copyright 2014, Simplilearn, All rights reserved.

    aov() generic method to implement Analysis of Variance

    Analysis of variance

  • Copyright 2014, Simplilearn, All rights reserved.

    Implement and print the result of chi-squared test for goodness of fit

    margin.table(HairEyeColor, 1)

    chisq.test(variable) or chisq.test(variable, probabilities)

    Chi-squared test

  • Copyright 2014, Simplilearn, All rights reserved.

    Pairwise t-tests

    data(anorexia, package = MASS)

    attributes Treatment, pre-weight, post weight

    T-test

  • Copyright 2014, Simplilearn, All rights reserved.

    T test

    Independent t-tests

  • Copyright 2014, Simplilearn, All rights reserved.

    A basic introduction to R

    Data exploration using R

    Data visualizations in R

    Pie Charts Bar plots

    Box plots

    Histogram

    Diagnostic analytics using R

    Chi Squared test

    T tests

    Analysis of Variance

    Summary

    Here is a quick recap of what we have learned in this lesson

  • Copyright 2014, Simplilearn, All rights reserved.

    Quiz

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Linear modeling

    Non linear modeling

    Developing web applications

    Time series analysis

    1 Which of the following is not a functionality of R?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: d.

    Explanation: R is a statistical analysis and data mining tool.

    Linear modeling

    Non linear modeling

    Developing web applications

    Time series analysis

    1 Which of the following is not a functionality of R?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Which of the following is the function to display the first 5 rows of data?

    top(data,5)

    first(data,5)

    2

    first(5, data)

    head(data,5)

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: b.

    Explanation: head() function is used to display the first few rows of data

    Which of the following is the function to display the first 5 rows of data?

    top(data,5)

    first(data,5)

    2

    first(5, data)

    head(data,5)

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    What will be the result of the following command class(iris)

    integer

    matrix

    data.frame

    vector

    3

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: d.

    Explanation: Iris is a sample data set that is stored as a data frame

    What will be the result of the following command class(iris)

    integer

    matrix

    data.frame

    vector

    3

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Aggregate no columns

    Aggregate the first column

    Aggregate the last column

    Aggregate all columns

    4 What is the significance of the dot symbol before tilda in aggregate function?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: a.

    Explanation: The dot symbol in the formula is used to specify aggregation on all columns

    Aggregate no columns

    Aggregate the first column

    Aggregate the last column

    Aggregate all columns

    4 What is the significance of the dot symbol before tilda in aggregate function?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    40

    30

    45

    55

    5 Create a histogram of the islands dataset. What is the highest frequency of the dataset?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: b.

    Explanation: Plot the histogram using hist(islands). The highest frequency from the graph is 40.

    40

    30

    45

    55

    5 Create a histogram of the islands dataset. What is the highest frequency of the dataset?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    t.test(preheight, postheight)

    t.test(preheight, postheight, paired = TRUE)

    chisq.test(preheight, postheight)

    aov(preheight~postheight)

    6

    The heights of a sample population is recorded before and after a height increasing drug. Which of the following commands would be used to BEST check if there is an effect of the drug on height of a person?

  • Copyright 2014, Simplilearn, All rights reserved.

    QUIZ

    a.

    b.

    c.

    d.

    Answer: c.

    Explanation: The variables are paired, and hence paired t tests need to be used to best learn the relation.

    t.test(preheight, postheight)

    t.test(preheight, postheight, paired = TRUE)

    chisq.test(preheight, postheight)

    aov(preheight~postheight)

    6

    The heights of a sample population is recorded before and after a height increasing drug. Which of the following commands would be used to BEST check if there is an effect of the drug on height of a person?

  • Copyright 2014, Simplilearn, All rights reserved.

    Thank You

    Copyright 2014, Simplilearn, All rights reserved.