The Basic principles of R / Rcmdr - i-med · The basic principles of R as our „ideal“...

18
Application of Statistical Methods in R -- Intro in R & Descriptive Statistics -- Barbara Kollerits & Claudia Lamina Medical University of Innsbruck, Division of Genetic Epidemiology Molekulare Medizin, SS 2016 2 The Basic principles of R / Rcmdr 1

Transcript of The Basic principles of R / Rcmdr - i-med · The basic principles of R as our „ideal“...

Application of Statistical Methods in R -- Intro in R & Descriptive Statistics --

Barbara Kollerits & Claudia LaminaMedical University of Innsbruck, Division of Genetic Epidemiology

Molekulare Medizin, SS 2016

2

The Basic principles of R / Rcmdr

1

The basic principles of R

as our „ideal“ statistics program

Why?

For free

Open source

Every researcher can implement new methods in R

and provide it to others as „packages“

Big community and fast help on the internet

R is object-oriented

4

■ Download at: http://www.r-project.org/

■ Select the „base“-version from the correct operating system (e.g. „Windows“) and install

■ Start R by clicking on rgui.exe. Then you see:

The basic principles of R

2

5

■ The R-console is used to give in your commands

■ The commands are evaluated immediately after „Enter“

■ All results are also shown there

Example:

R used as a simple

calculator

The basic principles of R

Some principles of R:

1. R is object-oriented: Almost everything that is created during one session can be saved as an object. The character string „<-“ assigns a name to an object.

Example: Save the result of the sum as an object:

> Summe<-sum(1,2,3,4,5,6)

> Summe

[1] 21

> Summe*2

[1] 42

Showing all the objects you created in your session:

> objects()

Summe: Objectname

sum: R-function

The basic principles of R

3

2. Functions in R:

Most functions in R have arguments (mandatory and/or optional).

If the function testFunction() has one mandatory argument and one optional,

it can be called by either:

testFunction(argument1=value1) or testFunction(value1)

testFunction(argument1=value1,argument2=value2)

Example: The function mean()can be used by calling only one vector x:

mean(x)

or mean(x, na.rm=T)

Optional argument, which tells R, how to deal with missing values

The basic principles of R

E.g.

> sum <- c(1,2,3,4,5,6)

> mean (sum)

> 3.5

8

3. Getting Help in R:

> ?mean

> help(mean)

help.start() # general help

help.search("linear models") or ??"linear models"

can be used if the exact wording is not known

Getting Help on the internet: there is a big R-community

http://www.rseek.org/ (R mailing list archive)

http://r-statistik.foren-city.de/ (A german R-forum)

Both commands open a window explaining

the mandatory and optional arguments of

the function, give background, references

and examples of usage

The basic principles of R

4

9

4. The R Workspace:

► All objects that are created during one session can (and should be) saved

in a R-Workspace.

► Save at the end of one session:

> save.image("C:\\...\\Test.RData")

► Retrieve at the beginning of the next session:

> load("C:\\... \\Test.RData")

► Saved are: Everything, that hase been saved as an object (datasets,

results from e.g. linear models etc…)

► Not saved in the Workspace: the commands used !!

save commands in an editor

The basic principles of R

10

The basic principles of R

5. Packages in R:

► Functions in R are provided in packages. Basic packages are preinstalled:

base, datasets, graphics, methods, stats,...

► To add other packages:

a. „Pakete“ -> Installiere Pakete choose a server choose the desired package OK

b. Load the package, e.g.: > library(genetics)

c. See http://cran.r-project.org/ for all available packages

► Some reasonable packages:

genetics, haplo.stats, nlme, gam, multtest, pwr, Hmisc, survival, design

► Packages consist of functions and sometimes example data sets.

5

11

The basic principles of R

We are using the package Rcmdr (R Commander ) as a Graphical User Interface (GUI):

After calling library(Rcmdr)the following window pops up:

12

The basic principles of R

Tabs to call functions, manage the data etc

Script-window: all things you do with the GUI are „translated“ into R commands, which can be managed (e.g. adding comments), repeated and saved.

Output-window: all output, e.g. the results of a statistical model are given in this window, which can be saved.

Notes, warnings and errors are reported here

6

13

Read in and save your data

14

Read in your data

How to get your data into R:

■ Put in new data in a spreadsheet

■ Load a saved R dataset

■ Import data from other data sources and formats (e.g. txt-file, Excel, SPSS etc.)

7

15

Read in your data

Edit your dataset:

■ Simply look at the data:

■ Edit the data by hand:

e.g.: Change or delete single values

■ Re-structure the dataset:

Edit or re-categorize variables

16

Example: Read in and edit a dataset

Example: Typing in new data in R/Rcmdr & edit this dataset:

■ Write the data from 7 patients in the dataset „patientdata“:

■ There is an error in the data:

Patient 4 is not 64, but 46 years old change

■ Read in data from new patients from the file „new_patients.txt“

■ Combine the old and new dataset into one dataset (dataset: „alldata“):

ID age sex

1 52 m

2 23 m

3 79 f

4 64 f

5 55 m

6 50 f

7 32 f

alldata

8

17

Example: Read in and edit a dataset

■ One analysis should be restricted to women: Create the subset patientdataF

18

Example: Read in and edit a dataset

■ Create a new age category variable „agecat“: 20-30 / 30-50 / 50-80

► This new categorical variable is stored as a factor-variable

► Here, the values "20-30", "30-50" and "50-80" are the levels of the factor

► Internally, factors are stored as numbers with a table holding the information of the levels.

► Some methods explicitly require categorical variables to be stored as factor variables, e.g. contingency tables or if you want to separate one analysis by a grouping variable

9

Example: Read in and edit a dataset

■ Where do the „borders“ of categories belong to?

■ Other possibility for categorization of numeric variables (fixed, not user-defined):

30 is categorized into category „0“

30 is categorized into category „1“

20

After a session (and in between): Save your work!

■ Save the output: everything, that has been created in the output-window is saved in a txt-file

■ Save the Skriptdatei: All commands, that have been created in this session are saved. It can be re-opened in each Text-Editor and comments can be added

the session can be repeated !

■ Save the Workspace: „Datendatei speichern“: Does not only save the dataset, but all datasets and all other objects (e.g. statistical models etc.) that have been created in this one session !

when you open it next time you can directly follow up without having to rerun anything!

Save your data

10

21

Descriptive Statistics in R

22

■ Import the dataset „Germansample.txt“ as „sample1“:

■ Get a first overview of the data: How many variables, how many individuals?

■ This dataset is a random subsample of a German study that was carried out in the first place to estimate risk factors for cardiovascular diseases. In this subsample we have parameters of obesity, behavior (smoking, alcohol consumption) and lipids.

Overview of the complete dataset

11

23

Overview of the complete dataset

■ Use „sample1“

■ Get a first overview of the data: How many variables, how many individuals?

■ Summary of the complete dataset:

Statistik Deskriptive Statistik Aktive Datenmatrix

■ You get a summary for each variable: location measures for numeric and tables

for factor variables possibility to check, if each variable is treated correctly(here: look at variable smoking)

■ Missing values are denoted as „NA“

24

First questions, we want to answer on a quantitative variable:

1. Is there a difference in WHR between e.g. men and women?

Descriptive statistics of a quantitative variable

2. Is a variable normally distributed: Checking normality assumption

Histograms and Q-Q plots

Transformation of a variable

3. How are two quantitative variables related?

Scatterplots and correlations

12

25

Descriptive Statistics for a quantitative variable

■ For the numeric variable Waist-hip-ratio WHR

► Statistik Deskriptive Statistik Zusammenfassung numerischer Variablen

► If „Zusammenfassung nach Gruppen“ is chosen, e.g. by sex:

► by agecat:

26

Descriptive Statistics for a quantitative variable

■ Graphical presentations: Boxplots for WHR, separated for gender and agecat

► Grafiken Boxplot ( for different groups, if wanted)

sex agecat

WH

R

WH

R

13

27

Checking the normality assumption

freq

uenc

y

freq

uenc

yWHR WHR

■ Graphical presentations: Histogram for WHR

► Grafiken Histogram

Number of breaks automatically: Number of breaks = 50:

Can also be used to check the distribution (normality assumption) of the variable

28

Checking the normality assumption

■ Graphical presentations: qq-plot Compare the distribution of a variable (WHR) with the normal distribution

► Grafiken Quantile comparison plot choose the distribution, with which you want to compare

14

29

Relation of two quantitative variables

■ Relate two quantitative variables to each other: Scatter-Plot and Correlation

Grafiken Streudiagramm Statistik Deskriptive Statistik (groups can be marked) Korrelationsmatrix

Choose Pearson:

or Spearman:

30

A question, we want to answer on a qualitative variable:

Does the smoking behavior differ between men and women ?

Define a variable as a factor variable

Descriptive statistics of a qualitative variable: Cross tables & barplots

15

31

Descriptive Statistics for a qualitative variable

■ So far, the variable smoking is not recognized as a factor variable.

► Smoking is coded 1,2,3 R „thinks“ this variable is numeric change to factor:

Datenmanagement Variablen bearbeiten Konvertierenumerische Variablen in Faktoren

► Use „Etiketten“ for factor levels:

32

Descriptive Statistics for a qualitative variable

■ For the categorical variables smoking,sex and agecat: Give out the absolute and relative frequencies:

► Statistik Deskriptive Statistik Häufigkeitsverteilung

► Mark smoking,sex and agecat and click OK

E.g.: absolute & relative frequencies for smoking

16

33

Descriptive Statistics for a qualitative variable

■ Barplot to illustrate the frequencies of smoking:

► Grafiken Balkendiagramm mark the variable smoking

34

Descriptive Statistics for a qualitative variable

■ Two-dimensional contingency tables: sex x smoking

► Statistik Kontingenztabelle Kreuztabelle

If you want to compare the smoking behavior between men and women, look at row percent.

Absolute numbers:

17

35

Descriptive Statistics for a qualitative variable

■ Two-dimensional contingency tables: sex x smoking, for different ages

► Repeat the contingency table for a specific subset (e.g. age<40)

► or: Statistik Kontingenztabelle

Mehrdimensionale Kreuztabelle

with agecat as control variable (Kontrollvariable)

gives out one sex x smoking contingency table for each agecat-category

Smoking behavior does not differ that much, if you only look at people < 40 years

18