BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An...

24
BIO5312: R Session 1 An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, 2016 Yujin Chung R Session 1 Fall, 2016 1/24

Transcript of BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An...

Page 1: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

BIO5312: R Session 1An Introduction to R and Descriptive Statistics

Yujin Chung

August 30th, 2016

Fall, 2016

Yujin Chung R Session 1 Fall, 2016 1/24

Page 2: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Introduction to R

R software

R is both open source and open development.

You can look at the source code and you can propose changes

You can write R functions and publish them.

R is available for many platforms:I Unix of many flavors, including Linux, Solaris, FreeBSD, AIXI Windows 95 and laterI Mac OS X

Binaries and source code are available from www.r-project.org

The R Console

Basic interaction with R is by typing in the console, a.k.a.terminal or command-line

You type in commands, R gives back answers (or errors)

Yujin Chung R Session 1 Fall, 2016 2/24

Page 3: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

RStudio

RStudio allows the user to run R in a more user-friendly environment.It is open-source (i.e. free) and available at http://www.rstudio.com/

R console

R script/editor

Environment/Workspace: all the active object

History: a list of commands used so far

Files: shows all the files and folders in your default workingdirectory

I changing working directory: More→Set As Working Directory

Plots, Packages, Help

Yujin Chung R Session 1 Fall, 2016 3/24

Page 4: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Quick start

• Mathematical operators/functions

> log(64) # natural logarithm

[1] 4.158883

> sqrt(2) # square root

[1] 1.414214

Q) What are the R outputs of the followings?

7+5, 7/5, 7*5, 7-2, 7^2, 7%%5, 7%/%5

• Comparisons are also binary operators: they take two objects, likenumbers, and give a Boolean

> 7 == 5 # 7 is equal to 5

[1] FALSE

7 > 5 # 7 is larger than 5

[1] TRUE

7 != 5 # 7 is not equal to 5

[1] TRUE

Yujin Chung R Session 1 Fall, 2016 4/24

Page 5: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Quick start II

• Boolean operators: & (and), | (or)Q) What are the R outputs of the followings?

(5>7) & (6*7 == 42)

(5>7) | (6*7 == 42)

!(5>7) | (6*7 == 42)

• R help

> help(log) # or

> ?log

Yujin Chung R Session 1 Fall, 2016 5/24

Page 6: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Operators

Arithmetic operatorsOperator Description

+ addition- subtraction* multiplication/ division

ˆ or ** exponentiationx %% y remainderx %/% y quotient

Logical OperatorsOperator Description

< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to! = not equal to!x Not x

x | y x OR yx & y x AND y

isTRUE(x) test if X is TRUE

Yujin Chung R Session 1 Fall, 2016 6/24

Page 7: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data type

We can give names to data objects; these give us variables! Variablesare created with the assignment operator: = or <- (arrow)

Numeric: numbers, either floating point or integer

> x=5 # or x <- 5

character : a character string

> x = "I like chocolate ice cream"

logical : TRUE or FALSE

> x = (1 > 2)

built-in variables. E.g. TRUE (or T), FALSE (or F)

Yujin Chung R Session 1 Fall, 2016 7/24

Page 8: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structures

Group of data values into one object of type including

vector

data frame

list

matrix

factors

tables

Some R packages have their own data structure

Yujin Chung R Session 1 Fall, 2016 8/24

Page 9: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structures: vectors

Vectors: a sequence of values of numerical, character or logical.Function c() returns a vector containing all its arguments in order.

numeric vector

> x = c(1,2,3)

> x

[1] 1 2 3

> length(x) # the length of x

[1] 3

character

> x=c("a","b")

> x

[1] "a" "b"

> length(x)

[1] 2

Yujin Chung R Session 1 Fall, 2016 9/24

Page 10: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structures: vectors II

Sequence generators

> x = seq(from=1, to=3, by=1) # or seq(1,3,1)

> x

[1] 1 2 3

> x = 1:3 # same as seq(1,3,1)

Extracting sub-vectors

> x[2] # return the 2nd elements

[1] 2

> x[2:3] # extracting subset from the 2nd to 3rd elements

[1] 2 3

> x[c(2,3)] # same as x[2:3]

> x[-2] # drop off the 2nd elementsx=c("a","b")

[1] 1 3

Yujin Chung R Session 1 Fall, 2016 10/24

Page 11: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structures: vectors III

Element-wise arithmetic

> x = 1:5

> x+1

[1] 2 3 4 5 6

> x <= 3

[1] TRUE TRUE TRUE FALSE FALSE

Pairwise arithmetic

> x = 1:5

> y = c(-1, 0, 3:5)

> x+y

[1] 0 2 6 8 10

> x == y

[1] FALSE FALSE TRUE TRUE TRUE

Yujin Chung R Session 1 Fall, 2016 11/24

Page 12: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structure: data frames

Data frames: a data set that can be represented as a set ofobservations (rows) on several variables (columns).

Example: Fishers or Andersons iris data set (built-in)

> iris

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

## Basic information of data frame

> names(iris) # variable names

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> dim(iris) # the numbers of rows and columns

[1] 150 5

> nrow(iris) # the number of rows

[1] 150

> ncol(iris) # the number of columns

Yujin Chung R Session 1 Fall, 2016 12/24

Page 13: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structure: data frames II

Extracting subset

> iris$Sepal.Length # extracting variable (1st column)

[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1

[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0

> iris[,1] # extracting the 1st column

[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1

[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0

> iris[2,] # extracting the 2nd row

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

2 4.9 3 1.4 0.2 setosa

> iris[1,1] # the element of the 1st column & the 1st row

[1] 5.1

Yujin Chung R Session 1 Fall, 2016 13/24

Page 14: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Data structure: data frames III

Each of columns or rows is a vector

> sepal = iris$Sepal.Length # extracting variable (1st column)

> length(sepal)

[1] 150

> temp = iris[1:10,2]

> length(temp)

[1] 10

Yujin Chung R Session 1 Fall, 2016 14/24

Page 15: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Writing data

Writing a data set in a text or CSV file

write.table(x,file,quote,sep,row.names,col.names,...)

x: data object to save, file: file namequote: if TRUE, elements surrounded by double quotes. If FALSE,nothing is quotedsep: the field separator string. Values within each row are separatedby this stringrow.names: a logical value indicating whether the row (or column)names of x are to be written along with xcol.names: a logical value indicating whether the column names of xare to be written along with x

> write.table(iris, file = "iris.txt", quote=F, sep = " ",

row.names = F, col.names=T)

Yujin Chung R Session 1 Fall, 2016 15/24

Page 16: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Reading data

Reading a data set in a text or CSV file

read.table(file, header, sep = , ...)

file: data file to readheader: a logical value indicating whether the file contains the namesof the variables as its first linesep: the field separator string. Values within each row are separatedby this string

> iris2 = read.table("iris.txt", header =T, sep=" ",)

> lead = read.table("LEAD.DAT.txt",header=T)

Yujin Chung R Session 1 Fall, 2016 16/24

Page 17: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Functions

Built-in functions: write.table, read.table, read.cross, etc

> names(lead) ## variable names of "lead"

[1] "id" "area" "ageyrs" "sex" "iqv_inf" "iqv_comp"

[7] "iqv_ar" "iqv_ds" "iqv_raw" "iqp_pc" "iqp_bd" "iqp_oa"

[13] "iqp_cod" "iqp_raw" "hh_index" "iqv" "iqp" "iqf"

[19] "iq_type" "lead_grp" "Group" "ld72" "ld73" "fst2yrs"

[25] "totyrs" "pica" "colic" "clumsi" "irrit" "convul"

[31] "X_2plat_r" "X_2plar_l" "visrea_r" "visrea_l" "audrea_r" "audrea_l"

[37] "fwt_r" "fwt_l" "hyperact" "maxfwt"

> dim(lead)

[1] 124 40

Yujin Chung R Session 1 Fall, 2016 17/24

Page 18: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

min(), max() and range()

min(x): the minimum of the argument xmax(x): the maximum of the argument xrange(): the minimum and maximum of the argument x

> fwt = lead$maxfwt # extracting "maxfwt" and creating a new variable

> fwt

[1] 72 61 49 48 51 49 50 58 50 51 59 65 57 53 74 50 84 46 52 64 59 55 99 46 52

[26] 63 52 42 57 23 65 38 59 26 53 50 56 49 76 68 60 46 57 45 46 64 40 62 13 79

> min(fwt)

[1] 13

> max(fwt)

[1] 99

> range(fwt)

[1] 13 99

> diff(range(fwt)) # the "range" of fwt

[1] 86

Yujin Chung R Session 1 Fall, 2016 18/24

Page 19: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

mean()

mean(): the arithmetic mean of the argumentsum(): the sum of the argument

> mean(fwt) # arithmetic mean of fwt

[1] 61.44355

> sum(fwt)/length(fwt)

[1] 61.44355

> colMeans(lead) # the mean of each column

id area ageyrs sex iqv_inf iqv_comp iqv_ar

240.233871 1.717742 8.935000 1.387097 6.766129 7.532258 8.306452

cf) colMeans(), rowMeans(), colSums(), rowSums()

Yujin Chung R Session 1 Fall, 2016 19/24

Page 20: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

median() and quantile()

median(x): the median of the argument xquantile(x): returns the minimum, Q1, Q2 (median), Q3, maximumof xIQR(x): the IQR of x

> mean(fwt) # the median of fwt

[1] 56

> quantile(fwt)

0% 25% 50% 75% 100%

13 48 56 72 99

> IQR(fwt)

[1] 24

> quantile(fwt,probs=.25) # Q1, the 25th percentile

25%

48

> quantile(fwt,probs=.75) - quantile(fwt,probs=.25) # IQR

75%

24 Yujin Chung R Session 1 Fall, 2016 20/24

Page 21: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

var() and sd()

var(x): the variance of the argument xsd(x): the standard deviation of the argument x

> var(fwt)

[1] 490.4114

> sd(fwt)

[1] 22.14523

> n = length(fwt)

> n

[1] 124

> sum( (fwt - mean(fwt))^2 )/ (n-1) # variance

[1] 490.4114

> sqrt( sum( (fwt - mean(fwt))^2 )/(n-1) ) # standard deviation

[1] 22.14523

Yujin Chung R Session 1 Fall, 2016 21/24

Page 22: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

summary()

summary(x): the minimum, Q1, median, mean, Q3 and maximum ofthe argument x

> summary(fwt)

Min. 1st Qu. Median Mean 3rd Qu. Max.

13.00 48.00 56.00 61.44 72.00 99.00

Yujin Chung R Session 1 Fall, 2016 22/24

Page 23: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Writing and calling functions

The structure of a function

<function name> = function(arg1, arg2, ... ){

statements

return(object)

}

We write another summary function, called mySummary, that returnsthe mean and standard deviation of an argument variable afterremoving missing values.

> mySummary = function(dat){ # define a function

res = c(mean(dat), sd(dat))

return(res)

}

> mySummary(fwt) # calling a function

[1] 61.44355 22.14523

Yujin Chung R Session 1 Fall, 2016 23/24

Page 24: BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

More intro

Some R resourcesI The official intro, “An Introduction to R”, available online in

http://cran.r-project.org/doc/manuals/R-intro.pdfI Norman Matloff, The Art of R Programming: A Tour of Statistical

Software DesignI Phil Spector, Data Manipulation with RI Paul Teetor, The R Cookbook

Yujin Chung R Session 1 Fall, 2016 24/24