R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics...

158
R Introduction and descriptive statistics tutorial 1

Transcript of R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics...

Page 1: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Introduction and descriptive statistics

tutorial 1

Page 2: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

2

what is R

R is a free software programming language and software environment for statistical computing and graphics. (Wikipedia)

R is open source.

Page 3: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

3

what is R

R is an object oriented programming language.

Everything in R is an object.

R objects are stored in memory, and are acted upon by functions (and operators).

Page 4: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

4

Homepage CRAN: Comprehensive R Archive Network

http://www.r-project.org/

how to get R

Page 5: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

5

Page 6: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

Editor RStudio

6

http://www.rstudio.com

how to edit R

Page 7: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software
Page 8: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

8

RStudio

http://www.rstudio.com

Page 9: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

how R works

9

Page 10: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

using R as a calculator

10

Users type expressions to the R interpreter.

R responds by computing and printing the answers.

Page 11: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

11

arithmetic operators

type operator action performed

arithmetic

results in numeric value(s)

+ addition

- subtraction

* multiplication

/ division

^ raise to power

Page 12: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

12

logical operators

type operator action performed

comparison

results in logical

value(s):

TRUE FALSE

< less than

> greater than

== equal to

!= not equal to

<= greater than or equal to

>= less than or equal to

connectors

& boolean intersection operator (logical and)

| boolean union operator (logical or)

Page 13: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

13

arithmetic operators

power multiplication > 3 ^ 2

> 2 ^ (-2)

Note: > 100 ^ (1/2)

is equivalent to

> sqrt(100)

addition / subtraction > 5 + 5

> 10 - 2

multiplication / division > 10 * 10

> 25 / 5

Page 14: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

14

logical operators

> 4 < 3 [1] FALSE

> 2^3 == 9 [1] FALSE

> (3 + 1) != 3 [1] TRUE

> (3 >= 1) & (4 == (3+1)) [1] TRUE

Page 15: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

assignment

15

Values are stored by assigning them a name.

The statements

> z = 17

> z <- 17

> 17 -> z

all store the value 17 under the name z in the workspace. Assignment operators are: <- , = , ->

Page 16: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

16

data types

There are three basic types or modes of variables:

numeric (numbers: integers, real)

logical (TRUE, FALSE)

character (text strings, in "")

The type is shown by the mode() function. Note: A general missing value indicator is NA.

Page 17: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

17

data types

> a = 49 # numeric

> sqrt(a)

[1] 7

> mode(a)

[1] "numeric"

> a = "The dog ate my homework" # character

> a

[1] "The dog ate my homework"

> mode(a)

[1] "character"

> a = (1 + 1 == 3) # logical

> a

[1] FALSE

> mode(a)

[1] "logical"

Page 18: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

18

Elements: numeric, logical, character in

vectors ordered sets of elements of one type

data.frames ordered sets of vectors (different vector types)

matrices ordered sets of vectors (all of one vector type)

lists ordered sets of anything.

data structures

Page 19: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

19

creating vectors

> x = c(1, 3, 5, 7, 8, 9) # numerical vector

> x

[1] 1 3 5 7 8 9

> z = c("I","am","Ironman") # character vector

> z

[1] "I" "am" "Ironman"

> x = c(TRUE,FALSE,NA) # logical vector

> x

[1] TRUE FALSE NA

The function c( ) can combine several elements into vectors.

Page 20: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

combining vectors

20

The function c( ) can be used to combine both vectors and elements into larger vectors. > x = c(1, 2, 3, 4)

> c(x, 10)

[1] 1 2 3 4 10

> c(x, x)

[1] 1 2 3 4 1 2 3 4

In fact, R stores elements like 10 as vectors of length one, so that both arguments in the expression above are vectors

Page 21: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

sequences

21

A useful way of generating vectors is using the sequence operator. The expression n1:n2, generates the sequence of integers ranging from n1 to n2. > 1:15

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13

[14] 14 15

> 5:-5

[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5

> y = 1:11

> y

[1] 1 2 3 4 5 6 7 8 9 10 11

Page 22: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

22

extracting elements

> x = c(1, 3, 5, 7, 8, 9)

> x[3] # extract 3rd position

[1] 5

> x[1:3] # extract positions 1-3

[1] 1 3 5

> x[-2] # without 2nd position

[1] 1 5 7 8 9

> x[x<7] # select values < 7

[1] 1 3 5

> x[x!=5] # select values not equal to 5

[1] 1 3 7 8 9

Page 23: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

data frame

23

Data frames provide a way of grouping a number of related vectors into a single data object. The function data.frame() takes a number of vectors with same lengths and returns a single object containing all the variables.

df = data.frame(var1, var2, ...)

Page 24: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

24

data frame

In a data frame the column labels are the vector names. Note: Vectors can be of different types in a data frame (numeric, logical, character).

Data frames can be created in a number of ways:

Binding together vectors by the function data.frame( ).

Reading in data from an external file.

Page 25: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

25

data frame

> time = c("early","mid","late","late","early")

> type <- c("G", "G", "O", "O", "G")

> counts <- c(20, 13, 8, 34, 7)

> data <- data.frame(time,type,counts)

> data

time type counts

1 early G 20

2 mid G 13

3 late O 8

4 late O 34

5 early G 7

> fix(data)

Page 26: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

26

example data: low birth weight

name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

The birthweight data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986.

Page 27: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

27

example data: low birth weight

loading: library(MASS), the dataframe is called birthwt. Overview over dataframes:

dim(birthwt)

summary(birthwt)

head(birthwt)

str(birthwt)

Page 28: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

extracting vectors

28

data$vectorlabel gives the vector named vectorlabel of the dataframe named data. Extracting elements from this vector is done as usually.

> birthwt$age

> birthwt$age[33]

> birthwt$age[1:10]

Page 29: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

29

some functions in R

name function

summary(x) summary statistics of the elements of x

max(x) maximum of the elements of x

min(x) minimum of the elements of x

sum(x) sum of the elements of x

mean(x) mean of the elements of x

sd(x) standard deviation of the elements of x

median(x) median of the elements of x

quantile(x, probs=…) quantiles of the elements of x

sort(x) ordering the elements of x

Page 30: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

30

some functions in R

> mean(birthwt$age)

[1] 23.2381

> max(birthwt$age)

[1] 45

> min(birthwt$age)

[1] 14

Page 31: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

31

getting help

to get help on the sd() function you can type either of > help(sd)

> ?sd

Page 32: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

sorting vectors

32

> help(sort)

> x=sort(birthwt$age, decreasing=FALSE)

> x[1:10]

[1] 14 14 14 15 15 15 16 16 16 16

> x=sort(birthwt$age, decreasing=TRUE)

> x[1:10]

[1] 45 36 36 35 35 34 33 33 33 32

> x[25] # 25th highest age

[1] 30

Sorting / ordering of data in vectors with the function sort()

Page 33: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

graphics

33

R has extensive graphics facilities.

Graphic functions are differentiated in

high-level graphics functions

low-level graphics functions

The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.

Page 34: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

34

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

Page 35: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

35

high-level graphics

> hist(birthwt$age)

> boxplot(birthwt$age)

Page 36: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

36

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Load the dataframe into your workspace with the data("PimaIndiansDiabetes2") command. Get an overview with the functions dim, head.

Calculate the mean and median of the variable insulin. Remove NAs for the calculation with the na.rm = TRUE option in mean and median functions.

Plot an histogram of the variable insulin.

Page 37: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Graphics and probability theory

tutorial 2

Page 38: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

graphics

38

R has extensive graphics facilities.

Graphic functions are differentiated in

high-level graphics functions

low-level graphics functions

The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.

Page 39: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

39

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

Page 40: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

plot function

40

The core R graphics command is plot(). This is an all-in-one function which carries out a number of actions:

It opens a new graphics window.

It plots the content of the graph (points, lines etc.).

It plots x and y axes and boxes around the plot and produces the axis labels and title.

….

Page 41: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

plot function

41

Parameters in the plot() function are:

x x-coordinate(s) y y-coordinates (optional, depends on x)

Page 42: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

42

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot) , type:

> a = c(1,2,3,4)

> b = c(4,4,0,5)

> plot(x=a,y=b)

> plot(a,b) # the same

Page 43: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

43

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot), type:

> library(MASS)

> plot(x=birthwt$age,y=birthwt$lwt)

# lwt: mothers weight in pounds

> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])

# first 10 mothers

Page 44: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

44

plot function

Another example:

> a = seq(-5, +5, by=0.2)

# generates a sequence from -5 to +5 with increment

0.2

[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2

...

[45] 3.8 4.0 4.2 4.4 4.6 4.8 5.0

> b = a^2 # squares all components of a

> plot(a,b)

Page 45: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

plot function

45

Parameters in the plot() function are (see help(plot) and help(par)):

x x-coordinate(s) y y-coordinates (optional, depends on x) main, sub title and subtitle xlab, ylab axes labels xlim, ylim range of values for x and y type type of plot lty type of lines pch plot symbol cex scale factor col color of points etc. ...

Page 46: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

plot symbol / line type

46

plot symbol: pch=

line type: lty=

plot type: type=

“p‘‘ points ‘‘l‘‘ lines ‘‘b“ both ‘‘s“ steps ‘‘h“ vertical lines ‘‘n“ nothing …

Page 47: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

47

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a, b)

> plot(a,b,main="quadratic function")

> plot(a,b,main="quadratic function",cex=2)

> plot(a,b,main="quadratic function",col="blue")

Page 48: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

48

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a,b,main="quadratic function",type="l")

> plot(a,b,main="quadratic function",type="b")

> plot(a,b,main="quadratic function",pch=2)

-4 -2 0 2 4

05

10

15

20

25

quadratic function

x

y

Page 49: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

probability theory, factorials

49

Binomial coefficients can be computed by choose(n,k):

> choose(8,5)

[1] 56

Page 50: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

50

functions for random variables

Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates

d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation

and the last part of the name of the function specifies the type of distribution, e.g.

binomial dististribution normal distribution

Page 51: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

binomial distribution

51

• dbinom(x, size, prob)

x k size n prob π

knk )1(k

n)kX(P)k(f

Probability function:

Page 52: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

52

• dnorm(x, mean, sd)

Density function:

2

2

2

)x(

e2

1)x(f

Page 53: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

53

Calculating the probability density function: > dnorm(x=2, mean=6, sd=2)

[1] 0.02699548

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

x

f(x)

Page 54: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

54

• pnorm(q, mean, sd)

q: b

Distribution function:

b

dx)x(f)b(F

b

f(x) 'density'

x

Page 55: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

55

13

f(x) 'density'

x

normal distribution

Distribution function:

N(10,25) distribution

> pnorm(q=13, mean=10, sd=5)

[1] 0.7257469

Page 56: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

> dbinom(x=5, size=50, prob=0.15)

# Probability of having exactly 5 successes in 50

independent observations/measurements with a success

probability of 0.15 each

[1] 0.1072481

> dbinom(5, 50, 0.15) # the same

binomial distribution

56

Probability function:

Page 57: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

57

Plotting the density of a N(5,49) distribution:

> x_values=seq(-15, 25, by=0.5)

> y_values=dnorm(x_values, mean=5, sd=7)

> plot(x_values,y_values,type="l")

Page 58: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

58

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Make a scatter plot for the variables glucose and insulin. What are the possible realizations of a random variable X distributed according to Bin(4,0.85)? Calculate all possible values of the probability function of X. Plot the probability function of X with the possible realizations of X on the x axis and the corresponding values of the probability function on the y axis.

Page 59: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Random numbers and factors

tutorial 3

Page 60: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

60

functions for random variables

Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates

d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation

and the last part of the name of the function specifies the type of distribution, e.g.

binomial dististribution normal distribution t distribution

Page 61: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

binomial distribution

61

• rbinom(n, size, prob)

n: number of samples to draw size: n prob=π

output: number of successes

knk )1(k

n)kX(P)k(f

Generating random realizations:

Page 62: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

62

• rnorm(n, mean, sd)

n: number of samples to draw

2

2

2

)x(

e2

1)x(f

Generating random realizations:

Page 63: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

t distribution

63

• qt(p, df)

p: quantile probability df: degrees of freedom

Quantiles:

Page 64: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

> rbinom(n=1, size=50, prob=0.15)

# Generating one sample of 50 independent

observations/measurements with a success probability

of 0.15 each

[1] 14 # 14 successes in this simulation

> rbinom(n=1, size=50, prob=0.15)

[1] 7

binomial distribution

64

Generating random realizations:

Page 65: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

> rbinom(n=10, size=50, prob=0.15)

# Generating 10 samples

[1] 14 10 6 12 8 6 7 10 5 9

# The number of successes for all samples

binomial distribution

65

Generating random realizations:

Page 66: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

normal distribution

66

> values=rnorm(10, mean=0, sd=1)

> values

[1] -0.56047565 -0.23017749 1.55870831 0.07050839

0.12928774 1.71506499 0.46091621 -1.26506123

-0.68685285 -0.44566197

# 10 simulations from a N(0,1) distribution

> mean(values)

[1] 0.07462565

Page 67: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

= for α=0.05, n=100

t distribution

67

• qt(p, df)

p: quantile probability df: degrees of freedom

> qt(p=0.95,df=9)

[1] 1.833113

> qt(p=0.95,df=99)

[1] 1.660391

> qnorm(p=0.95,mean=0,sd=1)

[1] 1.644854

> qt(p=0.975,df=99)

[1] 1.984217

Quantiles:

121 n,/t α

Page 68: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

68

object classes

All objects in R have a class. The class attribute allows R

to treat objects differently (e.g. for summary() or plot()).

Possible classes are:

numeric

logical

character

list

matrix

data.frame

array

factor

The class is shown by the class() function.

Page 69: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

factors

69

Categorical variables in R are often specified as factors.

Factors have a fixed number of categories, called levels.

summary(factor) displays the frequency of the factor levels.

Functions in R for creating factors:

factor(), as.factor()

levels() displays and sets levels.

Page 70: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

factors

70

• factor(x,levels,labels)

• as.factor(x)

x: vector of data, usually small number of values levels: specifies the values (categories) of x labels: labels the levels

Page 71: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

71

> smoke = c(0,0,1,1,0,0,0,1,0,1)

> smoke

[1] 0 0 1 1 0 0 0 1 0 1

> summary(smoke)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0 0.0 0.0 0.4 1.0 1.0

> class(smoke)

[1] "numeric"

factors

Page 72: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

72

> smoke_new=factor(smoke)

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> summary(smoke_new)

0 1

6 4

> class(smoke_new)

[1] "factor"

factors

Page 73: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

73

> smoke_new=factor(smoke,levels=c(0,1))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> smoke_new=factor(smoke,levels=c(0,1,2))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1 2

> summary(smoke_new)

0 1 2

6 4 0

factors

Page 74: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

74

> smoke_new=factor(smoke,levels=c(0,1),

labels=c("no", "yes")

> smoke_new

[1] no no yes yes no no no yes no yes

Levels: no yes

> summary(smoke_new)

no yes

6 4

factors

Page 75: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

75

> library(MASS)

> summary(birthwt$race)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.000 1.000 1.847 3.000 3.000

> race_new=as.factor(birthwt$race)

> summary(race_new)

1 2 3

96 26 67

> levels(race_new)

[1] "1" "2" "3"

> levels(race_new)=c("white","black","other")

> summary(race_new)

white black other

96 26 67

factors

Page 76: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

76

hands-on example

Sample 20 realizations of a N(0,1) distribution. Calculate mean and standard deviation. What is the formula for the confidence interval for the mean for unknown σ? For a 90% confidence interval and the above sample: What are the parameters α and n? Which value has t1- α/2,n-1? Calculate the 90% confidence interval for our example.

Page 77: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Reading data from files,

frequency tables

tutorial 4

Page 78: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

78

functions for random variables

Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates

d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation

and the last part of the name of the function specifies the type of distribution, e.g.

binomial dististribution normal distribution t distribution

Page 79: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

• qnorm(p, mean, sd)

p: quantile probability

> qnorm(p=0.95,mean=0,sd=1)

[1] 1.644854

> qnorm(p=0.975,mean=0,sd=1)

[1] 1.959964

= for α=0.05

79

Quantiles:

21 /αz

normal distribution

= z0.95

Page 80: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

80

reading data: working directory

For reading or saving files, a simple file name identifies a file in the working directory. Files in other places can be specified by the path name.

getwd() gives the current working directory.

setwd("path") sets a specific directory as your

working directory.

Use setwd("path") to load and save data in the

directory of your choice.

Page 81: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

The standard way of storing statistical data is to put them in a rectangular form with rows corresponding to observations and columns corresponding to variables.

Spreadsheets are often used to store and manipulate data in this way, e.g. EXCEL.

The function read.table() can be used to read

data which has been stored in this way.

The first argument to read.table() identifies the

file to be read.

reading data

81

Page 82: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

reading data

82

Optional arguments to read.table() which can be

used to change its behaviour.

Setting header=TRUE indicates to R that the first row

of the data file contains names for each of the columns.

The argument skip= makes it possible to skip the

specified number of lines at the top of the file.

The argument sep= can be used to specify a character which separates columns. (Use sep=";" for csv files.)

The argument dec= can be used to specify a character

as decimal point.

Page 83: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

83

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarct'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

Page 84: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

84

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv")

Error in scan(file, what,...: line 2 did not have 2

elements # wrong separator

> mi = read.table("infarct data.csv",sep=";")

> summary(mi) # no variable names

> mi = read.table("infarct data.csv",sep=";",

header=TRUE)

> summary(mi) # with variable names

example data: infarct

Page 85: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

85

frequency tables

table(var1, var2) gives a table of the

absolute frequencies of all combinations of var1 and var2. var1 and var2 have to attain a finite number of values (frequency table, cross classification table,

contingency table). var1 defines the rows, var2 the columns. addmargins(table) adds the sums of rows and

columns. prop.table(table) gives the relative

frequencies, overall or with respect to rows or columns.

Page 86: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

86

frequency tables

> grp_sex=table(mi$grp,mi$sex)

> grp_sex

1 2

control 25 15

infarct 28 12

> addmargins(grp_sex)

1 2 Sum

control 25 15 40

infarct 28 12 40

Sum 53 27 80

Page 87: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

87

frequency tables

> prop.table(grp_sex)

1 2

control 0.3125 0.1875

infarct 0.3500 0.1500

> prop.table(grp_sex,margin=1)

1 2

control 0.625 0.375

infarct 0.700 0.300 # rows sums to 1

> prop.table(grp_sex,margin=2)

1 2

control 0.4716981 0.5555556

infarct 0.5283019 0.4444444 # columns sum to 1

Page 88: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

88

hands-on example

Load the dataset from the file bdendo.csv into the workspace. Generate a table of the variables d (case-control status) and dur (categorical duration of oestrogen therapy). Generate a table of the variables d (case-control status) and agegr (age group). Compare the two tables.

Page 89: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Installing packages,

the package "pROC"

tutorial 5

Page 90: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R packages

90

R consists of a base level of functionality together with a set of contributed libraries which provide extended capabilities.

The key idea is that of a package which provides a related set of software components, documentation and data sets.

Packages can be installed into R. This needs administrator rights.

Page 91: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

91

Package: pROC

Type: Package

Title: display and analyze ROC curves

Version: 1.7.1

Date: 2014-02-20

Encoding: UTF-8

Depends: R (>= 2.13)

Imports: plyr, utils, methods, Rcpp (>= 0.10.5)

Suggests: microbenchmark, tcltk, MASS, logcondens, doMC,

doSNOW

LinkingTo: Rcpp

Author: Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez

and Markus Müller.

Maintainer: Xavier Robin <[email protected]>

pROC – diagnostic testing

Page 92: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

installing packages

92

You can install R packages using the install.packages() command.

> install.packages("pROC")

Installing package(s) into

‘C:/Users/Amke/Documents/R/win-library/2.15’

(as ‘lib’ is unspecified)

downloaded 827 Kb

package ‘pROC’ successfully unpacked and MD5 sums

checked

The downloaded binary packages are in

C:\Users\Amke\AppData\Local\Temp\RtmpUJPoia\downl

oaded_packages

Page 93: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

93

Installing R packages using the menu:

Page 94: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

94

Page 95: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

using installed packages

95

When R is running, simply type:

> library(pROC)

This adds the R functions in the library to the search path. You can now use the functions and datasets in the package and inspect the documentation.

Page 96: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

96

cite packages

To cite the package pROC in publications use:

> citation("pROC")

...

Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique

Lisacek, Jean-Charles Sanchez and Markus Müller

(2011). pROC: an open-source

package for R and S+ to analyze and compare ROC

curves. BMC Bioinformatics, 12,

p. 77. DOI: 10.1186/1471-2105-12-77

<http://www.biomedcentral.com/1471-2105/12/77/>

...

Page 97: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

97

package pROC

The main function is roc(response, predictor). It creates the values necessary for an ROC curve.

response: disease status (as provided by gold standard)

predictor: continuous test result

(to be dichotomized)

For an roc object the plot(roc_obj) function produces an ROC curve.

Page 98: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

98

package pROC

The function coords(roc_obj,x,best.method,ret) calculates measures of test performance.

x: value for which measures are calculated (default: threshold) , x="best" gives the optimal threshold

best.method: if x="best", the method to determine the best threshold (e.g. "youden")

ret: Measures calculated. One or more of "threshold", "specificity", "sensitivity", "accuracy", "tn" (true negative count), "tp" (true positive count), "fn" (false negative count), "fp" (false positive count), "npv" (negative predictive value), "ppv" (positive predictive value)

(default: threshold, specificity, sensitivity)

Page 99: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

99

example data: aSAH

name label

gos6

Glasgow Outcome Score (GOS) at

6 months 1-5

outcome prediction of development 'good', 'poor' to be diagnozed

gender sex 'male', 'female'

age age years

wfns

World Federation of Neurological Surgeons

Score 1-5

s100b

S100 calcium binding protein

B μg/l biomarker

continuous test result

ndka Nucleoside diphosphate

kinase A μg/l biomarker

continous test result

aneurysmal subarachnoid haemorrhage

Page 100: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

100

> data(aSAH) # loads the data set "aSAH"

> head(aSAH)

> rocobj = roc(aSAH$outcome, aSAH$s100b)

> plot(rocobj)

> coords(rocobj, 0.55)

threshold specificity sensitivity

0.5500000 1.0000000 0.2682927

> coords(rocobj, x="best",best.method="youden")

threshold specificity sensitivity

0.2050000 0.8055556 0.6341463

# youden threshold is 0.20; according spec and sens

package pROC

Page 101: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

true positive

Measures of Test Performance

Outcomes of a diagnostic study for a dichotomous test result

positive negative

present

absent

test result

disease

false negative

false positive true negative

Page 102: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

102

> coords(rocobj,x="best",best.method="youden",

ret=c("threshold","specificity","sensitivity",

"tn","tp","fn","fp"))

threshold specificity sensitivity

0.2050000 0.8055556 0.6341463

tn tp fn fp

58.0000000 26.0000000 15.0000000 14.0000000

package pROC

tp: 26

positive negative

present

absent

test result

disease

fn: 15

fp:14 tn: 26

Page 103: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Statistical testing 1

tutorial 6

Page 104: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

statistical test functions

104

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

Page 105: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

105

One sample t test

The function t.test() performs different Student‘s t tests.

Parameters for the one sample t test are t.test(x,mu,alternative)

x: numeric vector of values which shall be tested

(assumed to follow a normal distribution)

mu: reference value µ0

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than µ0),

"greater" (alternative: expectation of x is larger than µ0)

Page 106: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

Blood Sugar Level and Myocardial Infarction

H0: ≤0 HA: >0

A study was carried out to assess whether the expected blood sugar level (BSL) of patients with myocardial

infarction µ is higher than the expected BSL of control

individuals, namely µ0=100 mg/100ml.

Page 107: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

107

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarkt'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

Page 108: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

108

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv",sep=";",

dec=",", header=TRUE)

>summary(mi$blood.sugar)

>summary(as.factor(mi$code))

>bloods_infarct=mi$blood.sugar[mi$code==1]

# Attention: two "="s!

# Extracts the blood sugar levels of only the cases.

>summary(bloods_infarct)

One sample t test

Page 109: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

109

>t.test(bloods_infarct,mu=100,alternative="greater")

One Sample t-test

data: bloods_infarct

t = -0.7824, df = 39, p-value = 0.7807

alternative hypothesis: true mean is greater than 100

95 percent confidence interval:

90.14572 Inf

sample estimates:

mean of x

96.875

# Blood sugar level of infarct patients is not

significantly higher than 100mg/100ml.

One sample t test

Page 110: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

110

hands-on example

Load the dataset from the file infarct data.csv into the workspace. Perform a two-sided one-sample t-test for cholesterol level in infarct patients. The reference value for the population is 180 mg/100ml. What is the result of the test?

Page 111: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Statistical testing 2

tutorial 7

Page 112: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

statistical test functions

112

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

Page 113: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

The function t.test() performs different Student‘s t tests.

Parameters for the two sample t test are:

t.test(x, y, alternative, var.equal)

x, y: numeric vectors of values which shall be compared

(assumed to follow a normal distribution)

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than expectation of y), "greater" (alternative: expectation of x is larger than expectation of y)

var.equal: Are the variances of x and y equal? (TRUE or FALSE (default); TRUE is the t test of the lecture)

113

Two sample t test

Page 114: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

The function wilcox.test() performs the Wilcoxon rank sum test and the Wilcoxon signed rank test.

Parameters for the Wilcoxon rank sum test are: wilcox.test(x, y, alternative)

x, y: numeric vectors of values which shall be compared

(need not follow a normal distribution)

alternative: similar to t.test

114

Wilcoxon rank sum test

Page 115: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

Blood Sugar Level and Myocardial Infarction

H0: 1≤2 HA: 1>2

A case-control study was carried out to assess whether the expected blood sugar level (BSL) of patients with

myocardial infarction µ1 is higher than the expected BSL of control individuals µ2.

Page 116: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

116

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarkt'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

Page 117: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

117

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv", sep=";",

dec=",", header=TRUE)

> summary(mi$blood.sugar)

> summary(as.factor(mi$code))

> bloods_infarct=mi$blood.sugar[mi$code==1]

> bloods_control=mi$blood.sugar[mi$code==0]

# Extracts the blood sugar levels of the cases

# and of the controls.

Two sample t test

Page 118: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

118

> t.test(bloods_infarct, bloods_control,

var.equal=TRUE, alternative="greater")

Two Sample t-test

data: bloods_infarct and bloods_control

t = 0.0305, df = 78, p-value = 0.4879

alternative hypothesis: true difference in means is

greater than 0

95 percent confidence interval:

-13.39077 Inf

sample estimates:

mean of x mean of y

96.875 96.625

# Expected BSL of infarct patients is not

significantly higher than expected BSL of controls.

Two sample t test

Page 119: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

119

> wilcox.test(bloods_infarct, bloods_control,

alternative="greater")

Wilcoxon rank sum test with continuity correction

data: bloods_infarct and bloods_control

W = 867.5, p-value = 0.2576

alternative hypothesis: true location shift is greater

than 0

# The Wilcoxon test can be applied if the BSL does not

# follow a normal distribution. Then the t test is not

# valid.

Wilcoxon rank sum test

Page 120: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

120

Pearson‘s chi-squared test

The function chisq.test() performs a Pearson‘s chi-squared test for count data. chisq.test(x)

x: n x m table (matrix) to be tested

Page 121: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

121

example data: low birth weight

name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

Page 122: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

122

> library(MASS)

> tab_bw_smok=table(birthwt$low, birthwt$smoke)

> tab_bw_smok

0 1

0 86 44

1 29 30

> chisq.test(tab_bw_smok)

Pearson's Chi-squared test with Yates'

continuity correction

data: tab_bw_smok

X-squared = 4.2359, df = 1, p-value = 0.03958

# The probability of having a baby with low birth

# weight is significantly higher for smoking mothers.

Pearson‘s chi-squared test

Page 123: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

123

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Plot a histogram of the variable insulin. Compare the insulin values between cases and controls (variable diabetes) using an appropriate test.

Page 124: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Correlation and linear regression,

low level graphics

tutorial 8

Page 125: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

125

Correlation

The function cor(x, y, method) computes the correlation between two paired random variables.

x, y: numeric vectors of values for which the correlation shall be calculated (must have the same length)

method: "pearson", "spearman" or "kendall"

Page 126: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

126

Test of correlation

The function cor.test(x, y, alternative, method) tests for correlation between paired random variables.

x, y: numeric vectors of values for which the correlation shall be tested (must have the same length)

alternative:

"two.sided" (alternative: correlation coefficient ≠ 0,

default),

"less" (alternative: negative correlation),

"greater" (alternative: positive correlation)

method: "pearson", "spearman" or "kendall"

Page 127: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

127

Linear regression (simple)

The function lm(formula, data) fits a linear model to data.

formula: y~x with y response variable and x explanatory variable (must have the same length)

data: optional, if not specified in formula, the dataframe containing x and y

Page 128: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

128

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarkt'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

Page 129: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

129

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv", sep=";",

dec=",", header=TRUE)

> plot(x=mi$height, y=mi$weight)

> cor(mi$height, mi$weight, method="pearson")

[1] 0.6307697

> cor(mi$height, mi$weight, method="spearman")

[1] 0.6281738

Correlation

Page 130: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

130

Correlation

Page 131: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

131

> cor.test(mi$height,mi$weight,method="pearson")

Pearson's product-moment correlation

data: mi$height and mi$weight

t = 7.1792, df = 78, p-value = 3.586e-10

alternative hypothesis: true correlation is not

equal to 0

95 percent confidence interval:

0.4771865 0.7469643

sample estimates:

cor

0.6307697

# Significant correlation between body height and

# body weight

Correlation

Page 132: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

132

> lm(mi$weight~mi$height)

Call:

lm(formula = mi$weight ~ mi$height)

Coefficients:

(Intercept) mi$height

-51.2910 0.7477

# Y = a + b × x + E # with Y: body weight, x: body height,

# a=-51.29, b=0.75

Linear regression

Page 133: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

graphics

133

R has extensive graphics facilities.

Graphic functions are differentiated in

high-level graphics functions

low-level graphics functions

The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.

Page 134: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

low-level graphics

134

Plots produced by high-level graphics facilities can be modified by low-level graphics commands.

Page 135: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

135

low-level functions

name function

points(x, y) adds points (the option type= can be used)

lines(x, y) adds lines (the option type= can be used)

text(x, y, labels, ...) adds text given by labels at coordinates (x,y); a typical use is: plot(x, y, type="n"); text(x, y, names)

abline(a, b) draws a line of slope b and intercept a

abline(h=y) draws a horizontal line at ordinate y

abline(v=x) draws a vertical line at abcissa x

rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom, and top limits are x1, x2, y1, and y2, respectively

polygon(x, y) draws a polygon with coordinates given by x and y

title( ) adds a title and optionally a sub-title

Page 136: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

136

> plot(x=mi$height, y=mi$weight)

> abline(a=-51.29, b=0.75, col="blue")

# Adds the regression line to the scatter plot.

> title("Regression of weight and height")

> text(x=185, y=65, labels="Kieler Woche",

col="green")

low-level functions

Page 137: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

137

low-level functions

Kieler Woche

Page 138: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

138

hands-on example

Load the dataset from the file correlation.csv into the workspace. Calculate the Pearson correlation coefficient between the variables x and y and test whether this coefficient is significantly different from 0. Generate a scatter plot.

Page 139: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Regression models

tutorial 9

Page 140: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

140

Linear regression (simple)

The function lm(formula, data) fits a linear model to data.

formula: y~x with y response variable and x explanatory variable (must have the same length)

data: optional, if not specified in formula, the dataframe containing x and y

Page 141: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

141

Linear regression (multiple)

The function lm(formula, data) fits a linear model to data.

formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

data: optional, if not specified in formula, the dataframe containing x1,…,xk and y

Page 142: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

142

Generalised linear model

The function glm(formula, family) fits a generalised linear model to data.

formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

family: specifies the link function; choose family=binomial for the logistic regression

Page 143: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

143

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarct'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

Page 144: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

144

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv", sep=";",

dec=",", header=TRUE)

> model_mi=glm(mi$code~mi$sex+mi$age+

mi$height+mi$weight+mi$blood.sugar+mi$diabet

+mi$chol+mi$trigl+mi$cig,family=binomial)

> summary(model_mi)

Generalised linear model

Page 145: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

145

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -34.60297 12.51757 -2.764 0.005704 **

mi$sex 0.23048 0.90885 0.254 0.799810

mi$age 0.10734 0.04161 2.580 0.009883 **

mi$height 0.14930 0.07838 1.905 0.056799 .

mi$weight -0.11508 0.06304 -1.826 0.067916 .

mi$blood.sugar -0.02246 0.01399 -1.605 0.108425

mi$diabet 2.05732 2.15947 0.953 0.340743

mi$chol 0.07294 0.02188 3.334 0.000855

***

mi$trigl -0.01936 0.01227 -1.578 0.114638

mi$cig 0.07686 0.04695 1.637 0.101603

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

Generalised linear model

Page 146: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

146

> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)

> summary(model_mi)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***

mi$age 0.08404 0.03255 2.582 0.009827 **

mi$chol 0.05564 0.01569 3.546 0.000391 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

# Model after backward selection

Generalised linear model

Page 147: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

147

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Perform a linear regression with the variable insulin as response and variables glucose, pressure, mass and triceps as explanatory variables. Apply a backwards selection to generate a reduced model.

Page 148: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

R

Survival analysis

tutorial 10

Page 149: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

149

Survival object

Before performing analysis the function Surv(time, event) has to create a survival object.

time:

if event occured: time of the event

if no event occured: last observation time

Since start of study (survival time)

event:

1: event

0: no event

Important: has to be numeric.

Page 150: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

150

Survival curves

The function survfit(formula, data) creates an estimated survival curve. Afterwards use the plot command.

formula: Let y be a Surv object.

y~1 for a Kaplan-Meier curve

y~x for several Kaplan-Meier curves stratified by x

data: optional, if not specified in formula, the dataframe containing x and y

Page 151: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

151

Log-Rank Test

The function survdiff(formula, rho, data) tests if there is a difference between two or more survival curves.

formula: y~x with

y: Surv object

x: group or stratifying variable

rho: a scalar parameter that controls the type of test.

rho=0 (default) for the Log-Rank Test (Modification of test in lecture)

data: optional, if not specified in formula, the dataframe containing x and y

Page 152: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

152

example data: survival

name

therapy

two chemotherapies: C1 and C2

time

if death occured: time of death if no death occured: last observation time

event

1: death 0: no death

Page 153: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

153

> install.packages("survival")

> library(survival)

> setwd("C:/Users/Präsentation/MLS")

> cancer=read.table("survival.csv",dec=",",sep=";",

header=TRUE)

> head(cancer)

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~1)

> summary(curve)

> plot(curve)

# One Kaplan-Meier curve for both therapies combined

# (with confidence bands)

Survival analysis

Page 154: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

154

Survival analysis

Page 155: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

155

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~cancer$therapy)

> summary(curve)

> plot(curve,lty=1:2)

> legend("topright",levels(cancer$therapy),lty=1:2)

# Two Kaplan-Meier curves, one for each therapy

# group (without confidence bands)

Survival analysis

Page 156: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

156

Survival analysis

Page 157: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

157

> survdiff(surv_object~cancer$therapy,rho=0)

Call:

survdiff(formula = surv_object ~ cancer$therapy,rho = 0)

N Obs Expected (O-E)^2/E (O-E)^2/V

cancer$therapy=C1 10 6 4.07 0.919 1.56

cancer$therapy=C2 10 6 7.93 0.471 1.56

Chisq= 1.6 on 1 degrees of freedom, p= 0.211

# No significant difference between the

# survival functions of the two therapies

Survival analysis

Page 158: R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics tutorial 1 . 2 what is R R is a free software programming language and software

158

hands-on example

loading: library(survival), the dataframe is called retinopathy. In this dataframe the variable futime is the time variable and the variable status the event variable. Plot a Kaplan-Meier curve.