Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive...

34
Statistics I Chapter 2: Analysis of univariate data

Transcript of Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive...

Page 1: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Statistics IChapter 2: Analysis of univariate data

Page 2: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Numerical summary

Central tendency Location Spread Form

⇓ ⇓ ⇓mean quartiles range coeff. asymmetry

median percentiles interquartile range coeff. kurtosismode variance

standard deviationcoeff. of variation

Page 3: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Descriptive statistics

X What are they useful?

X Can we calculate them for all types of variables?

X Which are the most useful in each case?

X How can we use the calculator or Excel?

Page 4: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measures of central tendency

X The mean

X The median

X The mode

Page 5: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Central tendency: the (artithmetic) mean

The (artithmetic) meanThe mean is the average of all the data

x =

∑ni=1 xin

=x1 + . . .+ xn

n

I It is the most common measure of location

I It is the center of gravity of the data

I It can be calculated only for quantitative variables

Page 6: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The mean: example

For the experience of the 46 professionals of a computer company, Whichis the mean?

x =1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · ·+ 17 + 20

46= 7.5 anos

How can we calculate it using the absolute frequency table? and usingthe relative one?

Experience, xi absolute freq., ni relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,087

10 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022

Total 46 1

Page 7: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The mean with grouped data

This is the same formula but using the center of each interval.For the salary of the 46 professionals of a computer company, Which isthe mean?

Note: the mean salary using the raw data equals 17250.413

Page 8: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The mean: properties

X Linearity: If Y = a + bX ⇒ y = a + bx

If the 46 professionals’ salaries is increased by 2 %, How the meansalary changes?

Afterwards the salary is reduced in 100 dolars, Wich is the final meansalary?

X Disadvantages: Affected by extreme values (outliers)

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2

x =3 + 1 + 5 + 4 + 2

5= 3 y =

3 + 1 + 50 + 4 + 2

5= 12

Its value has been multiplied by 4!!When the data is skewed an alternative robust measure of centraltendency is more appropriate

Page 9: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Central tendency: the median

...is the most central datum

1 1 1 3 3 5 5 7 8 8 9

1. Order the data from smallest to largest

2. Include repetitions

3. The median is the physical centre

1 1 1 3 3 5 5 7 8 8 ⇒ M =3 + 5

2= 4

MedianOrdered list from smallest to largest: x(1), x(2), . . . , x(n)

M =

x((n+1)/2) if n odd

x(n/2)+x(n/2+1)

2 if n even

Page 10: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The media via the table of frequenciesExperience, xi ni fi Ni Fi

1 5 0,109 5 0,1092 4 0,087 9 0,1963 4 0,087 13 0,2834 4 0,087 17 0,3705 3 0,065 20 0, 435 < 0.5

M=6 4 0,087 24 0, 522 > 0.57 1 0,022 25 0,5438 4 0,087 29 0,6309 0 0 29 0,630

10 4 0,087 33 0,71711 2 0,043 35 0,76112 2 0,043 37 0,80413 2 0,043 39 0,84814 1 0,022 40 0,87015 1 0,022 41 0,89116 3 0,065 44 0,95717 1 0,022 45 0,97818 0 0 45 0,97810 0 0 45 0,97820 1 0,022 46 1,000

Page 11: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The meadian: properties

X Linearity: If Y = a + bX ⇒ My = a + bMx

If the 46 professionals’ salaries is increased by 2 %, How the mediansalary changes?

Afterwards the salary is reduced in 100 dolars, Wich is the finalmedian salary?

X Can we calculate the meadian with the education level data?

Can we calculate the meadian with the 0-1 position of responsabilityvariable?

X Advantage: Not affected by outliers

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2

Mx = 3 My = 3

When the data is skewed it is a better measure of central tendencythan the mean.

Page 12: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The median and the mean for asymmetric dataAnnual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E.

“La diferencia entre el salario medio y el mediano se explica porque en elcalculo del valor medio influyen notablemente los salarios muy altosaunque se refieran a pocos trabajadores.´´ (En la Nota de Prensa delINE de 28 de octubre de 2016)

Page 13: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Central tendency: the mode

...is the most frequent value

The mode of the variable experience in the 46 professionals example is 1year, with an absolute frequency of 5 employees.

The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.

Page 14: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Central tendency: the mode

Does this definition make sense with the education level data?

Does this definition make sense with the 0-1 position of responsabilityvariable?

Page 15: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Central tendency: the mode

Does this definition make sense with continuous data? ⇒ modal interval

Page 16: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

The mode: properties

X It can be calculated for both qualitative and quantitative variables.Indeed, it is the only descriptive measurement (mean, median, mode)that makes sense for nominal qualitative variables.

X Not affected by outliers

X There can be no mode.

X There can be more than one mode: bimodal–trimodal–plurimodal

What it can be indicate?

Page 17: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Location measures

X Quartiles

X Percentiles

Page 18: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Location measures: quartiles and percentiles

X Quartiles split the ranked data into four segments with an equalnumber of values per segment.

X Percentiles split the ranked data into a hundred segments with anequal number of values per segment.

1. Order the data from smallest to largest

2. Include repetitions

3. Select each quartile (percentile) according to:I The first quartil Q1 has position 1

4(n + 1).

I The second quartil Q2 (= median) has position 12(n + 1).

I The third quartil Q3 has position 34(n + 1).

I The k-th percentile Pk , has position k(n + 1)/100, k = 1, . . . , 99.

Page 19: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Quartiles: example

Page 20: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Percentiles: example

Page 21: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Masures of spread

X The range and the interquartile range

X The variance and the standard deviation

X The coefficient of variation

Page 22: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Variation: range and interquartile range (IQR)

I The Range is the simplest measure of variation

R = xmax − xmın

I Ignores the way the data is distributed

I Sensitive to outliers

Example: Given observations 3, 1, 5, 4, 2, R = 5− 1 = 4Example: Given observations 3, 1, 5, 4, 100, R = 100− 1 = 99

I The Interquartile range (IQR) can eliminate some outlier problems.Eliminate high and low observations and calculate the range of themiddle 50 % of the data

RIC = 3rd cuartil− 1st cuartil = Q3 − Q1

Page 23: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Variation: Interquartile range and boxplot

I Outliers are observations that fall

I below the value of Q1 − 1.5 · IQRI above the value of Q3 + 1.5 · IQR

I For extreme outliers, replace 1.5 by 3 in the above definition

25% 25% 25% 25%

12 24 31 42 58

xmin Q1 ((Q2))MEDIANA

Q3 xmax

RI=18

Page 24: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measure of variation: variance

I Average of squared deviations of values from the mean

I Population variance

σ2 =

∑Ni=1 (xi − µ)2

N

I Sample variance

σ2 =

∑ni=1 (xi − x)2

n=

faster to calculate︷ ︸︸ ︷∑ni=1 x

2i − n(x)2

n⇐ divided by n

I Sample quasi-variance (corrected sample variance)

s2 =

∑ni=1 (xi − x)2

n − 1=

∑ni=1 x

2i − n(x)2

n − 1⇐ divided by n − 1

I They are related via

σ2 =n − 1

ns2

I If a, b (b 6= 0) are real numbers and y = a + bx , then s2y = b2s2

x

Page 25: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measure of variation: standard deviation (SD)

I The most-commonly used measure of spread

I Population standard deviation, sample standard deviation andsample quasi-standard deviation are respectively

σ =√σ2 σ =

√σ2 s =

√s2

I Shows variation about the mean

I Has the same units as the original data, whilst variance is in units2

I Variance and SD are both affected by outliers

Page 26: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Calculating variance and standard deviationExample: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,Z : 11, 11, 11, 12, 19, 20, 20, 20

x =124

8= 15.5 y =

124

8= 15.5 z =

124

8= 15.5

n∑i=1

x2i = 112 + 122 + . . .+ 212 = 2000

n∑i=1

y 2i = 142 + 152 + . . .+ 172 = 1928

n∑i=1

z2i = 112 + 112 + . . .+ 202 = 2068

s2x =

∑ni=1 x

2i − n(x)2

n − 1=

2000− 8(15.5)2

8− 1=

78

7= 11.1429 ⇒ sx = 3.3381

s2y =

1928− 8(15.5)2

8− 1=

6

7= 0.8571 ⇒ sy = 0.9258

s2z =

2068− 8(15.5)2

8− 1=

146

7= 20.8571 ⇒ sz = 4.5670

Page 27: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Comparing standard deviationsExample cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20

● ● ●

● ●

● ● ● ●

● ● ●

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

z == 15.5 sz == 4.6

y == 15.5 sy == 0.9

x == 15.5 sx == 3.3

Page 28: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measure of variation: coefficient of variation (CV)

I Measures relative variation and is defined as

CV =s

|x |

I Is a unitless number (sometimes given in %’s)

I Shows variation relative to mean

Example: Stock A: Average price last year = 50, Standard deviation = 5Stock B: Average price last year = 100, Standard deviation = 5

CVA =5

50= 0.10 CVB =

5

100= 0.05

Both stocks have the same SDs, but stock B is less variable relative to its mean

price

Page 29: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Numerical summaries and frequency tables. Standarization.

I If the data is discrete then

x =

∑ki=1 xini

nand s2 =

∑ki=1 x

2i ni − nx2

n − 1

I If the data is continuous, we replace xi in the above difinition, by themid-points of class intervals

I To standardize variable x means to calculate

x − x

s

I If you apply this formula to all observations x1, . . . , xn and call thetransformed ones z1, . . . , zn, then the mean of the z ’s is zero with thestandard deviation of one

I Standarization = finding z-score

Page 30: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measures of form

X Fisher’s coefficient of asymmetry

X Fisher coefficient of kurtosis

X Empirical rule

Page 31: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Shape: comparing mode, mean and median

Three types of distributions:

I Skewed to the left Mean < Median < Mode

I Symmetric Mean = Median = Mode

I Skewed to the right Mode < Median < Mean

LEFT−SKEWEDx << M

SYMMETRICx == M

RIGHT−SKEWEDM << x

Note: The distribution in the middle is known as bell-shaped or normal

Page 32: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measures of form: Asymmetry

I Fisher’s coefficient of asymmetry → γ1 =1n

∑ni=1(xi−x)3

S3 . The data isskewed to the right (positive) if γ1 > 0, and vice versa.

Asimetría a la derecha

Fre

qu

en

cy

0 1 2 3 4 5 60

10

20

30

40

50

60

γ1

=

2.236

Asimetría a la izquierda

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

01

00

15

02

00

γ1

=

−1.401

Page 33: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Measures of form: kurtosis

I Fisher’s coefficient of kurtosis → γ2 =1n

∑ni=1(xi−x)4

S4 − 3

I For the standard normal, γ2 = 0. If γ2 > 0→ leptokurtic (sharperthan the standard normal) and platykurtic if γ2 < 0

Distribución Leptocúrtica

De

nsity

−2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Distribución Platicúrtica

De

nsity

−1.0 0.0 1.0 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Page 34: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X

Empirical rule

If the data is bell-shaped (normal), that is, symmetric and with lighttails, the following rule holds:

I 68 % of the data are in (x − 1s, x + 1s)

I 95 % of the data are in (x − 2s, x + 2s)

I 99.7 % of the data are in (x − 3s, x + 3s)

Note: This rule is also known as 68-95-99.7 ruleExample: We know that for a sample of 100 observations, the mean is40 and the quasi-standard deviation is 5. Assuming that the data isbell-shaped, give the limits of an interval that captures 95 % of theobservations.

95 % of xi ’s are in: (x ± 2s) = (40± 2(5)) = (30, 50)