R Installation

40
R INSTALLATION R is an open source software package for statistical data analysis

description

R Installation. R is an open source software package for statistical data analysis. R Installation . R d ownload : http://www.r-project.org / Germany, Stefan Drees Bonn: http://cran.r-mirror.de / This is the main program ! - PowerPoint PPT Presentation

Transcript of R Installation

Page 1: R Installation

R INSTALLATIONR is an open source software package for statistical data analysis

Page 2: R Installation

R Installation • R download: http://www.r-project.org/

• Germany, Stefan Drees Bonn: http://cran.r-mirror.de/• This is the main program!

• RStudio (Auxiliary program for editing R files): http://www.rstudio.com/ide/download/

• Install the programs (R should be there already).

Page 3: R Installation

Working with the R command prompt (without RStudio)

• Practical session

Page 4: R Installation

Working with RStudio • Start RStudio

• „File -> New -> R Script“ Lets you edit a R command or R script (= small programme = several consecutive commands)

Page 5: R Installation

Working with RStudio

New R files

The command prompt

Select• Files• Plots• Packages (for advanced analyses)• Help

Page 6: R Installation

Working with RStudio• Commands and programs can be stored in R files

• Execute one command line:• Ctrl+Enter or• Button „Run“

• Execute several lines:• Mark lines and use „Ctrl+Enter“ or „Run“ button

Page 7: R Installation

WHAT IS STATISTICS?

Page 8: R Installation

What is statistics?• Statistics is a means to connect empirical knowledge and

theory and is constituted as follows:

• Data representation (Empirics)

• Methods for description, analysis, and interpretation of data, in order to allow predictions, conclusions and decisions (Statistical Theory)

Page 9: R Installation

What is statistics?

I. Descriptive Statistics

II. Probability theory

III. Test theory

Page 10: R Installation

DESCRIPTIVE STATISTICS

Page 11: R Installation

Descriptive Statistics• Basic concepts:

• Population: Collection of objects for which a conclusion shall be made (can be human beings but also a collection of atoms when applied in physics)

• Sample: a representative part/sub-set of the population

• Random sample: elements of the population drawn randomly and independently of each other

• Example: „Mietspiegel“ (= statistics of rents) for the city of Bonn• Population: all rooms, flats etc. for rent in Bonn ( too many to investigate all)• Sample: selected part; all flats from Poppelsdorf• Random sample: Investigation of n = 100, 200,… random objects from Bonn

Page 12: R Installation

Descriptive statistics

quantitative

Observational objects

discretecontinuous

Attributes / traits

Values

qualitative

patients, blood samples, DNA samples, houses, atoms

blood pressure, weight, age, blood group, number of siblings,

marital status, rent

Blood group, marital status

Number of siblings

blood pressure, weight, age, rent

Page 13: R Installation

Descriptive statistics• Scaling:

• Nominal scale: attribute values that are not directly comparable (sex, subject of studies, country of origin) (qualitative)

• Ordinal scale: attribute values that have a „natural“ order (grades, font sizes: tiny-small-medium-large-huge)

• Interval scale: difference between attribute values is interpretable (temperature in °C) (quantitative)

• To be distinguished:• Discrete attributes: Attribute values can be counted• Continuous attributes: All real numbers, or at least all numbers from an interval, are

possible

Page 14: R Installation

Descriptive statistics• Frequencies:

• Absolute frequency ni:• Number of obersvations with attribute value i (counts)

• Relative frequency hi:• Portion of elements with attribute value i • To be computed as absolute frequency devided by total number of

objects N: ni / N• Relative frequencies lie between 0 and 1• Relative frequencies have to add up to 1 (<- can be used to check

computation)

Page 15: R Installation

Descriptive statistics

AB0 blood group

1.00N = 50

0

1

2

5

6

19

17

1.00

0.00

0.01

0.03

0.09

0.10

0.38

0.39

Köln

other

A2B

A1B

B

A2

A1

0

value

7

6

5

4

3

2

1

absolute frequency ni

200

0

2

6

18

20

76

78

KölnBonn

relative frequency hi

Bonn

0.00

0.02

0.04

0.10

0.12

0.38

0.34

Nnk

1ii

Nnh i

i

1h0 i

1hk

1ii

tally sheet

Page 16: R Installation

Descriptive statistics• Frequencies:

• Cumulative frequency:

• Sum of all frequencies up to a given value i. • Denoted as for absolute frequencies and denoted asi for relative frequencies• Often used when values are subdivided into classes

• Classification:

• Arrangement of attribute values into disjoint groups, so called „classes“

• Classes are disjoint, i.e. non-overlapping, and neighbouring intervals of attribute values, which are defined by a lower and an upper bound. Neighbouring values implies that each value belongs to a class and does not lie outiside (completeness of the classification).

Page 17: R Installation

Descriptive statisticsheight

classification: • complete

• disjoint (each value belongs to only one class)

class limits:

• (160; 170] contains all values, that are > 160 but 170.

150 160 170 180 190 200

150 200height [cm]

height [cm]

( ]( ] ( ]( ]( ] (

Page 18: R Installation

Descriptive statisticsheight [cm]

relative hi

1,00

0.00

0.05

0.25

0.35

0.30

0.05

0.00

frequency

absolute ni

N=100

0

5

25

35

30

5

0

Cumulative frequency

relative Hi

1.00

1.00

0.95

0.70

0.35

0.05

0.00

absolute Ni

100

100

95

70

35

5

0

7

6

5

4

3

2

1

Class number i

> 200

(190; 200]

(180; 190]

(170; 180]

(160; 170]

(150; 160]

150

Class limits (ai-1; ai] Tally sheet

i

1kki nN

NNH i

i

Page 19: R Installation

DESCRIPTIVE STATISTICSGraphical representation

Page 20: R Installation

Descriptive statistics• Pie chart (R function: pie() )

• Shows absolute frequencies• Example: blood groups

Page 21: R Installation

Descriptive statistics• Bar chart (R function: barplot() )

• Shows relative frequencies• Example: blood groups

Page 22: R Installation

Descriptive statistics• Representation of cumulative frequencies with empirical

distribution function F

• Discrete trait: Number of Children

1.00

0.96

0.90

0.80

0.50

0.10

>4

4

3

2

1

0

Number of children

6

5

4

3

2

1

N = 50

2

3

5

15

20

5

relative hi

absoluteni

Frequencies

50

48

45

40

25

5

Cumulative frequencies

absoluteNi

relativeHi

0.04

0.06

0.10

0.30

0.40

0.10

1.00

Tally sheet

Page 23: R Installation

Descriptive statistics

H

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 >4

i

Number of children

hi

hi

hi

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 >4

Bar chart

F

F: Empirical distribution function

Since the attribute is quantitative discrete, we obtain a step

function

Page 24: R Installation

Descriptive statistics• Histogramms (R function: hist() )

• Construction:• Data is subdevided into classes

• Surface area of columns is proportional to the respective frequencies

• Columns are neighbouring since classes are neighbouring

Page 25: R Installation

Descriptive statisticsExample: Height [cm]

0

0,2

0,4

0,6

0,8

1

150 160 170 180 190 200height [cm]

hiHistogram

Page 26: R Installation

Descriptive statistics

••

••

0

0,2

0,4

0,6

0,8

150 160 170 180 190 200height [cm]

F

0

0,2

0,4

0,6

0,8

150 160 170 180 190 200height [cm]

f

empirical density function f

empirical distribution function F

(for continuous trait)

Page 27: R Installation

Descriptive statistics

0

0,2

0,4

0,6

0,8

150 160 170 180 190 200height [cm]

f

empirical density function f

••

••

0

0,2

0,4

0,6

0,8

150 160 170 180 190 200height [cm]

F

empirical distribution function F

Ff

fF

hi

hi

Page 28: R Installation

Descriptive Statistics• Note: Slides 23 and 26 both show empirical distribution

functions. In the first case, we obtain a step function since the trait under investigation is discrete.

Page 29: R Installation

DESCRIPTIVE STATISTICSMeasures of central tendency, dispersion and spread

Page 30: R Installation

Descriptive statistics• Measures of central tendency:

• A number to characterize the „center“ of the data

• Most important:• Mean• Median

Page 31: R Installation

Descriptive statistics• Median (R function: median() )

• Sample: Order according to: Ordered sample: • Median

x8=7

x7=6

x6=4

x5=19

x4=8

x3=3

x2 =9

x1=5

sample ranks

x(8)=19

x(1)=3

x(2)=4

x(3)=5

x(4)=6

x(5)=7

x(6)=8

x(7)=9

x7=6

x6=4

x5=19

x4=8

x3=3

x2 =9

x1=5

sample ranks

x(1)=3

x(2)=4

x(3)=5

x(4)=6

x(5)=8

x(6)=9

x(7)=19

n = 7 odd: n = 8 even:

Page 32: R Installation

Descriptive statistics• Mean (R function: mean() )

• Sample: • Sample size: n

• Mean

Page 33: R Installation

Descriptive statistics• Comparison of median and

mean:• Both samples have median 2500• and are the mean values• Mean can strongly be influenced by

a single value

Þ Median is more robust against extreme values („outliers“)

Nevertheless, the mean is more often used in practice since it has other desirable properties (see later).

i ordered and 1 2000 2000 1500

2 5000 15000 2000

3 4000 4000 2500

4 1500 1500 4000

5 2500 2500 5000 / 15000

Page 34: R Installation

Descriptive statistics

xx outlier

How to treat outliers?

Yes!2) Check value and correct

No!1) Discard

Page 35: R Installation

Descriptive statistics

• Measure the amount of variation of the data!

x

sample A

An

21

nx

xx

A

Ax

Bx

sample B

Bn

21

nx

xx

B

x

The mean (or median) is not sufficent to describe a sample

Page 36: R Installation

Descriptive statistics• Measures of dispersion and spread:

• Numbers to characterize the amount variation around the center (= mean)

• Most important:• Minimum, maximum, range (dispersion)• Empirical variance (spread)• Empirical standard deviation (spread)

Page 37: R Installation

Descriptive statistics

ranks

n=7n=7

x7=6

x6=4

x5=19

x4=8

x3=3

x2 =9

x1=5

samplex(1)=3

x(2)=4

x(3)=5

x(4)=6

x(5)=8

x(6)=9

x(7)=19

minimum: min = x(1)

maximum: max = x(n)

range: R = x(n) – x(1)

• range:

Page 38: R Installation

Descriptive statistics• Variance (R function: var() ):

• A measure to express the spread around the center (mean) by a single value

• The squared deviation of each attribute value from the mean is considered.

• Formula for the empirical variance from a sample of elements:

• The empircal standard deviation is just the square root of the variance, . (R function: sd() ).

Page 39: R Installation

Why devide by instead of ?

=

=n

=x4

=x3

=x2

=x1

xxnx

xx

n

2

1

Example:

100

4

270

2

75

532702751004

x4 =53 is not free,but given by other values when the mean is known.

s2 has (n-1) degrees of freedom (f)

n

1i

2i

2 xxf1s

Page 40: R Installation

Data • If you have data you want to analyse, please bring it

along!