The Basic principles of R / Rcmdr - i-med · The basic principles of R as our „ideal“...
Transcript of The Basic principles of R / Rcmdr - i-med · The basic principles of R as our „ideal“...
Application of Statistical Methods in R -- Intro in R & Descriptive Statistics --
Barbara Kollerits & Claudia LaminaMedical University of Innsbruck, Division of Genetic Epidemiology
Molekulare Medizin, SS 2016
2
The Basic principles of R / Rcmdr
1
The basic principles of R
as our „ideal“ statistics program
Why?
For free
Open source
Every researcher can implement new methods in R
and provide it to others as „packages“
Big community and fast help on the internet
R is object-oriented
4
■ Download at: http://www.r-project.org/
■ Select the „base“-version from the correct operating system (e.g. „Windows“) and install
■ Start R by clicking on rgui.exe. Then you see:
The basic principles of R
2
5
■ The R-console is used to give in your commands
■ The commands are evaluated immediately after „Enter“
■ All results are also shown there
Example:
R used as a simple
calculator
The basic principles of R
Some principles of R:
1. R is object-oriented: Almost everything that is created during one session can be saved as an object. The character string „<-“ assigns a name to an object.
Example: Save the result of the sum as an object:
> Summe<-sum(1,2,3,4,5,6)
> Summe
[1] 21
> Summe*2
[1] 42
Showing all the objects you created in your session:
> objects()
Summe: Objectname
sum: R-function
The basic principles of R
3
2. Functions in R:
Most functions in R have arguments (mandatory and/or optional).
If the function testFunction() has one mandatory argument and one optional,
it can be called by either:
testFunction(argument1=value1) or testFunction(value1)
testFunction(argument1=value1,argument2=value2)
Example: The function mean()can be used by calling only one vector x:
mean(x)
or mean(x, na.rm=T)
Optional argument, which tells R, how to deal with missing values
The basic principles of R
E.g.
> sum <- c(1,2,3,4,5,6)
> mean (sum)
> 3.5
8
3. Getting Help in R:
> ?mean
> help(mean)
help.start() # general help
help.search("linear models") or ??"linear models"
can be used if the exact wording is not known
Getting Help on the internet: there is a big R-community
http://www.rseek.org/ (R mailing list archive)
http://r-statistik.foren-city.de/ (A german R-forum)
Both commands open a window explaining
the mandatory and optional arguments of
the function, give background, references
and examples of usage
The basic principles of R
4
9
4. The R Workspace:
► All objects that are created during one session can (and should be) saved
in a R-Workspace.
► Save at the end of one session:
> save.image("C:\\...\\Test.RData")
► Retrieve at the beginning of the next session:
> load("C:\\... \\Test.RData")
► Saved are: Everything, that hase been saved as an object (datasets,
results from e.g. linear models etc…)
► Not saved in the Workspace: the commands used !!
save commands in an editor
The basic principles of R
10
The basic principles of R
5. Packages in R:
► Functions in R are provided in packages. Basic packages are preinstalled:
base, datasets, graphics, methods, stats,...
► To add other packages:
a. „Pakete“ -> Installiere Pakete choose a server choose the desired package OK
b. Load the package, e.g.: > library(genetics)
c. See http://cran.r-project.org/ for all available packages
► Some reasonable packages:
genetics, haplo.stats, nlme, gam, multtest, pwr, Hmisc, survival, design
► Packages consist of functions and sometimes example data sets.
5
11
The basic principles of R
We are using the package Rcmdr (R Commander ) as a Graphical User Interface (GUI):
After calling library(Rcmdr)the following window pops up:
12
The basic principles of R
Tabs to call functions, manage the data etc
Script-window: all things you do with the GUI are „translated“ into R commands, which can be managed (e.g. adding comments), repeated and saved.
Output-window: all output, e.g. the results of a statistical model are given in this window, which can be saved.
Notes, warnings and errors are reported here
6
13
Read in and save your data
14
Read in your data
How to get your data into R:
■ Put in new data in a spreadsheet
■ Load a saved R dataset
■ Import data from other data sources and formats (e.g. txt-file, Excel, SPSS etc.)
7
15
Read in your data
Edit your dataset:
■ Simply look at the data:
■ Edit the data by hand:
e.g.: Change or delete single values
■ Re-structure the dataset:
Edit or re-categorize variables
16
Example: Read in and edit a dataset
Example: Typing in new data in R/Rcmdr & edit this dataset:
■ Write the data from 7 patients in the dataset „patientdata“:
■ There is an error in the data:
Patient 4 is not 64, but 46 years old change
■ Read in data from new patients from the file „new_patients.txt“
■ Combine the old and new dataset into one dataset (dataset: „alldata“):
ID age sex
1 52 m
2 23 m
3 79 f
4 64 f
5 55 m
6 50 f
7 32 f
alldata
8
17
Example: Read in and edit a dataset
■ One analysis should be restricted to women: Create the subset patientdataF
18
Example: Read in and edit a dataset
■ Create a new age category variable „agecat“: 20-30 / 30-50 / 50-80
► This new categorical variable is stored as a factor-variable
► Here, the values "20-30", "30-50" and "50-80" are the levels of the factor
► Internally, factors are stored as numbers with a table holding the information of the levels.
► Some methods explicitly require categorical variables to be stored as factor variables, e.g. contingency tables or if you want to separate one analysis by a grouping variable
9
Example: Read in and edit a dataset
■ Where do the „borders“ of categories belong to?
■ Other possibility for categorization of numeric variables (fixed, not user-defined):
30 is categorized into category „0“
30 is categorized into category „1“
20
After a session (and in between): Save your work!
■ Save the output: everything, that has been created in the output-window is saved in a txt-file
■ Save the Skriptdatei: All commands, that have been created in this session are saved. It can be re-opened in each Text-Editor and comments can be added
the session can be repeated !
■ Save the Workspace: „Datendatei speichern“: Does not only save the dataset, but all datasets and all other objects (e.g. statistical models etc.) that have been created in this one session !
when you open it next time you can directly follow up without having to rerun anything!
Save your data
10
21
Descriptive Statistics in R
22
■ Import the dataset „Germansample.txt“ as „sample1“:
■ Get a first overview of the data: How many variables, how many individuals?
■ This dataset is a random subsample of a German study that was carried out in the first place to estimate risk factors for cardiovascular diseases. In this subsample we have parameters of obesity, behavior (smoking, alcohol consumption) and lipids.
Overview of the complete dataset
11
23
Overview of the complete dataset
■ Use „sample1“
■ Get a first overview of the data: How many variables, how many individuals?
■ Summary of the complete dataset:
Statistik Deskriptive Statistik Aktive Datenmatrix
■ You get a summary for each variable: location measures for numeric and tables
for factor variables possibility to check, if each variable is treated correctly(here: look at variable smoking)
■ Missing values are denoted as „NA“
24
First questions, we want to answer on a quantitative variable:
1. Is there a difference in WHR between e.g. men and women?
Descriptive statistics of a quantitative variable
2. Is a variable normally distributed: Checking normality assumption
Histograms and Q-Q plots
Transformation of a variable
3. How are two quantitative variables related?
Scatterplots and correlations
12
25
Descriptive Statistics for a quantitative variable
■ For the numeric variable Waist-hip-ratio WHR
► Statistik Deskriptive Statistik Zusammenfassung numerischer Variablen
► If „Zusammenfassung nach Gruppen“ is chosen, e.g. by sex:
► by agecat:
26
Descriptive Statistics for a quantitative variable
■ Graphical presentations: Boxplots for WHR, separated for gender and agecat
► Grafiken Boxplot ( for different groups, if wanted)
sex agecat
WH
R
WH
R
13
27
Checking the normality assumption
freq
uenc
y
freq
uenc
yWHR WHR
■ Graphical presentations: Histogram for WHR
► Grafiken Histogram
Number of breaks automatically: Number of breaks = 50:
Can also be used to check the distribution (normality assumption) of the variable
28
Checking the normality assumption
■ Graphical presentations: qq-plot Compare the distribution of a variable (WHR) with the normal distribution
► Grafiken Quantile comparison plot choose the distribution, with which you want to compare
14
29
Relation of two quantitative variables
■ Relate two quantitative variables to each other: Scatter-Plot and Correlation
Grafiken Streudiagramm Statistik Deskriptive Statistik (groups can be marked) Korrelationsmatrix
Choose Pearson:
or Spearman:
30
A question, we want to answer on a qualitative variable:
Does the smoking behavior differ between men and women ?
Define a variable as a factor variable
Descriptive statistics of a qualitative variable: Cross tables & barplots
15
31
Descriptive Statistics for a qualitative variable
■ So far, the variable smoking is not recognized as a factor variable.
► Smoking is coded 1,2,3 R „thinks“ this variable is numeric change to factor:
Datenmanagement Variablen bearbeiten Konvertierenumerische Variablen in Faktoren
► Use „Etiketten“ for factor levels:
32
Descriptive Statistics for a qualitative variable
■ For the categorical variables smoking,sex and agecat: Give out the absolute and relative frequencies:
► Statistik Deskriptive Statistik Häufigkeitsverteilung
► Mark smoking,sex and agecat and click OK
E.g.: absolute & relative frequencies for smoking
16
33
Descriptive Statistics for a qualitative variable
■ Barplot to illustrate the frequencies of smoking:
► Grafiken Balkendiagramm mark the variable smoking
34
Descriptive Statistics for a qualitative variable
■ Two-dimensional contingency tables: sex x smoking
► Statistik Kontingenztabelle Kreuztabelle
If you want to compare the smoking behavior between men and women, look at row percent.
Absolute numbers:
17
35
Descriptive Statistics for a qualitative variable
■ Two-dimensional contingency tables: sex x smoking, for different ages
► Repeat the contingency table for a specific subset (e.g. age<40)
► or: Statistik Kontingenztabelle
Mehrdimensionale Kreuztabelle
with agecat as control variable (Kontrollvariable)
gives out one sex x smoking contingency table for each agecat-category
Smoking behavior does not differ that much, if you only look at people < 40 years
18