Post on 25-Feb-2016
description
R INSTALLATIONR is an open source software package for statistical data analysis
R Installation • R download: http://www.r-project.org/
• Germany, Stefan Drees Bonn: http://cran.r-mirror.de/• This is the main program!
• RStudio (Auxiliary program for editing R files): http://www.rstudio.com/ide/download/
• Install the programs (R should be there already).
Working with the R command prompt (without RStudio)
• Practical session
Working with RStudio • Start RStudio
• „File -> New -> R Script“ Lets you edit a R command or R script (= small programme = several consecutive commands)
Working with RStudio
New R files
The command prompt
Select• Files• Plots• Packages (for advanced analyses)• Help
Working with RStudio• Commands and programs can be stored in R files
• Execute one command line:• Ctrl+Enter or• Button „Run“
• Execute several lines:• Mark lines and use „Ctrl+Enter“ or „Run“ button
WHAT IS STATISTICS?
What is statistics?• Statistics is a means to connect empirical knowledge and
theory and is constituted as follows:
• Data representation (Empirics)
• Methods for description, analysis, and interpretation of data, in order to allow predictions, conclusions and decisions (Statistical Theory)
What is statistics?
I. Descriptive Statistics
II. Probability theory
III. Test theory
DESCRIPTIVE STATISTICS
Descriptive Statistics• Basic concepts:
• Population: Collection of objects for which a conclusion shall be made (can be human beings but also a collection of atoms when applied in physics)
• Sample: a representative part/sub-set of the population
• Random sample: elements of the population drawn randomly and independently of each other
• Example: „Mietspiegel“ (= statistics of rents) for the city of Bonn• Population: all rooms, flats etc. for rent in Bonn ( too many to investigate all)• Sample: selected part; all flats from Poppelsdorf• Random sample: Investigation of n = 100, 200,… random objects from Bonn
Descriptive statistics
quantitative
Observational objects
discretecontinuous
Attributes / traits
Values
qualitative
patients, blood samples, DNA samples, houses, atoms
blood pressure, weight, age, blood group, number of siblings,
marital status, rent
Blood group, marital status
Number of siblings
blood pressure, weight, age, rent
Descriptive statistics• Scaling:
• Nominal scale: attribute values that are not directly comparable (sex, subject of studies, country of origin) (qualitative)
• Ordinal scale: attribute values that have a „natural“ order (grades, font sizes: tiny-small-medium-large-huge)
• Interval scale: difference between attribute values is interpretable (temperature in °C) (quantitative)
• To be distinguished:• Discrete attributes: Attribute values can be counted• Continuous attributes: All real numbers, or at least all numbers from an interval, are
possible
Descriptive statistics• Frequencies:
• Absolute frequency ni:• Number of obersvations with attribute value i (counts)
• Relative frequency hi:• Portion of elements with attribute value i • To be computed as absolute frequency devided by total number of
objects N: ni / N• Relative frequencies lie between 0 and 1• Relative frequencies have to add up to 1 (<- can be used to check
computation)
Descriptive statistics
AB0 blood group
1.00N = 50
0
1
2
5
6
19
17
1.00
0.00
0.01
0.03
0.09
0.10
0.38
0.39
Köln
other
A2B
A1B
B
A2
A1
0
value
7
6
5
4
3
2
1
absolute frequency ni
200
0
2
6
18
20
76
78
KölnBonn
relative frequency hi
Bonn
0.00
0.02
0.04
0.10
0.12
0.38
0.34
Nnk
1ii
Nnh i
i
1h0 i
1hk
1ii
tally sheet
Descriptive statistics• Frequencies:
• Cumulative frequency:
• Sum of all frequencies up to a given value i. • Denoted as for absolute frequencies and denoted asi for relative frequencies• Often used when values are subdivided into classes
• Classification:
• Arrangement of attribute values into disjoint groups, so called „classes“
• Classes are disjoint, i.e. non-overlapping, and neighbouring intervals of attribute values, which are defined by a lower and an upper bound. Neighbouring values implies that each value belongs to a class and does not lie outiside (completeness of the classification).
Descriptive statisticsheight
classification: • complete
• disjoint (each value belongs to only one class)
class limits:
• (160; 170] contains all values, that are > 160 but 170.
150 160 170 180 190 200
150 200height [cm]
height [cm]
( ]( ] ( ]( ]( ] (
Descriptive statisticsheight [cm]
relative hi
1,00
0.00
0.05
0.25
0.35
0.30
0.05
0.00
frequency
absolute ni
N=100
0
5
25
35
30
5
0
Cumulative frequency
relative Hi
1.00
1.00
0.95
0.70
0.35
0.05
0.00
absolute Ni
100
100
95
70
35
5
0
7
6
5
4
3
2
1
Class number i
> 200
(190; 200]
(180; 190]
(170; 180]
(160; 170]
(150; 160]
150
Class limits (ai-1; ai] Tally sheet
i
1kki nN
NNH i
i
DESCRIPTIVE STATISTICSGraphical representation
Descriptive statistics• Pie chart (R function: pie() )
• Shows absolute frequencies• Example: blood groups
Descriptive statistics• Bar chart (R function: barplot() )
• Shows relative frequencies• Example: blood groups
Descriptive statistics• Representation of cumulative frequencies with empirical
distribution function F
• Discrete trait: Number of Children
1.00
0.96
0.90
0.80
0.50
0.10
>4
4
3
2
1
0
Number of children
6
5
4
3
2
1
N = 50
2
3
5
15
20
5
relative hi
absoluteni
Frequencies
50
48
45
40
25
5
Cumulative frequencies
absoluteNi
relativeHi
0.04
0.06
0.10
0.30
0.40
0.10
1.00
Tally sheet
Descriptive statistics
H
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 4 >4
i
Number of children
hi
hi
hi
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 4 >4
Bar chart
F
F: Empirical distribution function
Since the attribute is quantitative discrete, we obtain a step
function
Descriptive statistics• Histogramms (R function: hist() )
• Construction:• Data is subdevided into classes
• Surface area of columns is proportional to the respective frequencies
• Columns are neighbouring since classes are neighbouring
Descriptive statisticsExample: Height [cm]
0
0,2
0,4
0,6
0,8
1
150 160 170 180 190 200height [cm]
hiHistogram
Descriptive statistics
••
•
•
••
0
0,2
0,4
0,6
0,8
150 160 170 180 190 200height [cm]
F
0
0,2
0,4
0,6
0,8
150 160 170 180 190 200height [cm]
f
empirical density function f
empirical distribution function F
(for continuous trait)
Descriptive statistics
0
0,2
0,4
0,6
0,8
150 160 170 180 190 200height [cm]
f
empirical density function f
••
•
•
••
0
0,2
0,4
0,6
0,8
150 160 170 180 190 200height [cm]
F
empirical distribution function F
Ff
fF
hi
hi
Descriptive Statistics• Note: Slides 23 and 26 both show empirical distribution
functions. In the first case, we obtain a step function since the trait under investigation is discrete.
DESCRIPTIVE STATISTICSMeasures of central tendency, dispersion and spread
Descriptive statistics• Measures of central tendency:
• A number to characterize the „center“ of the data
• Most important:• Mean• Median
Descriptive statistics• Median (R function: median() )
• Sample: Order according to: Ordered sample: • Median
x8=7
x7=6
x6=4
x5=19
x4=8
x3=3
x2 =9
x1=5
sample ranks
x(8)=19
x(1)=3
x(2)=4
x(3)=5
x(4)=6
x(5)=7
x(6)=8
x(7)=9
x7=6
x6=4
x5=19
x4=8
x3=3
x2 =9
x1=5
sample ranks
x(1)=3
x(2)=4
x(3)=5
x(4)=6
x(5)=8
x(6)=9
x(7)=19
n = 7 odd: n = 8 even:
Descriptive statistics• Mean (R function: mean() )
• Sample: • Sample size: n
• Mean
Descriptive statistics• Comparison of median and
mean:• Both samples have median 2500• and are the mean values• Mean can strongly be influenced by
a single value
Þ Median is more robust against extreme values („outliers“)
Nevertheless, the mean is more often used in practice since it has other desirable properties (see later).
i ordered and 1 2000 2000 1500
2 5000 15000 2000
3 4000 4000 2500
4 1500 1500 4000
5 2500 2500 5000 / 15000
Descriptive statistics
xx outlier
How to treat outliers?
Yes!2) Check value and correct
No!1) Discard
Descriptive statistics
• Measure the amount of variation of the data!
x
sample A
An
21
nx
xx
A
Ax
Bx
sample B
Bn
21
nx
xx
B
x
The mean (or median) is not sufficent to describe a sample
Descriptive statistics• Measures of dispersion and spread:
• Numbers to characterize the amount variation around the center (= mean)
• Most important:• Minimum, maximum, range (dispersion)• Empirical variance (spread)• Empirical standard deviation (spread)
Descriptive statistics
ranks
n=7n=7
x7=6
x6=4
x5=19
x4=8
x3=3
x2 =9
x1=5
samplex(1)=3
x(2)=4
x(3)=5
x(4)=6
x(5)=8
x(6)=9
x(7)=19
minimum: min = x(1)
maximum: max = x(n)
range: R = x(n) – x(1)
• range:
Descriptive statistics• Variance (R function: var() ):
• A measure to express the spread around the center (mean) by a single value
• The squared deviation of each attribute value from the mean is considered.
• Formula for the empirical variance from a sample of elements:
• The empircal standard deviation is just the square root of the variance, . (R function: sd() ).
Why devide by instead of ?
=
=n
=x4
=x3
=x2
=x1
xxnx
xx
n
2
1
Example:
100
4
270
2
75
532702751004
x4 =53 is not free,but given by other values when the mean is known.
s2 has (n-1) degrees of freedom (f)
n
1i
2i
2 xxf1s
Data • If you have data you want to analyse, please bring it
along!