Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of...

63
Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of...

Page 1: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Analysis

Harry R. Erwin, PhDSchool of Computing and

TechnologyUniversity of Sunderland

Page 2: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Resources• Rowntree, D. (1981) Statistics Without Tears. Harmondsworth: Penguin.

• Hinton, P.R. (1995) Statistics Explained. London: Routledge.

• Hatch, E.M. and Farhady, H. (1982) Research Design And Statistics For Applied Linguistics. Rowley Mass.: Newbury House.

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Page 3: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Module Outline

• Day One Lectures– Introduction– Using R– Probability (the laws of chance)

• Day Two Lectures– Data analysis (the gathering, display, and summarisation of data)

– Experimental design (planning and sampling)– Statistical inference (the drawing of conclusions from your data knowing probability)

– Data modelling (regression, ANOVA and ANCOVA)

Page 4: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Why the second day is important

• You don’t know which tests to use unless you know how your data are structured, so you do data analysis.

• Your experimental design is based on what you know beforehand of the data.

• Inference is the drawing of conclusions for your research—what can you prove.

• Modelling tells you what more detailed conclusions are supportable. This involves throwing out the factors that are not important.

Page 5: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Data Analysis

• Central tendency• Degrees of freedom• Variance• A worked example• Confidence intervals• Single sample

Page 6: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Measures of Central Tendency

• yvals<-read.table("yvalues.txt", header=T)• Attach(yvals)• Create a histogram of the data: hist(y)• Observe the mode, the most common value.• Arithmetic mean is (sum of data values)/number

• total <- sum(y)• n <- length(y)• ybar <- total/n• ybar• mean(y)

Page 7: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Median• The ‘middle value’• ysorted<-sort(y)• middleIndex<-ceiling(length(y)/2)• ysorted[middleIndex]• median(y)• set<-c(1,10,1000,10,1)• Geometric mean: exp(mean(log(set)))• Harmonic mean: 1/mean(1/set)• detach(yvals)• ls• rm(any variables you don’t need)

Page 8: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Measures of Spread

• In addition to describing the central point of a data set, we’re concerned with the data spread.

• Two measures:– Interquartile spread– Standard deviation/variance

Page 9: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Interquartile Range

• Break the data into four equal groups:– First through third quartiles

– The median is the second quartile, Q2

– The median of the low group is the first quartile or Q1

– The median of the high group is Q3

– The IQR is Q3-Q1

Page 10: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Box and Whiskers Plot (Tukey)

Outlier—outside 1.5 IQR

Q1 Q3Median

“Whiskers” extend to furthest non-outlier in both directions

IQR

Page 11: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Standard Deviation and Variance• Standard measure of spread (called std in R)

• Defined as the distance that an average value differs from the mean. The “squared” distance is used. (Remember geometry?) The square of the standard deviation is the variance. (Called var in R).

• When sample data (count = N) are used to compute estimates of both the mean and the variance, the latter is computed by dividing by N-1. If the variance is estimated by dividing by N, the result is biased low.

• The sample mean and standard deviation describe a bell-shaped curve very well if N is at least 30.

• For N<30, the t distribution applies.

Page 12: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Using R for this

• Data<-c(3,5,7,7,38)• mean(data)• std(data)• var(data)• median(data)• quantile(data)• fivenum(data)• boxplot(data)

Page 13: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Random Variables

• Imagine an experiment repeated many times. The notation for a random variable is X.

• The notation for a single value of X is x.

• You can define central tendency and spread just like you can for sample data. You can also predict their values.

• R gives you basic functions to compute these.

Page 14: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Plotting a random variable

• hist(rbinom(10000,2,0.5)) (coin flip)die<-c(1,2,3,4,5,6)for(i in 1:10000){+ a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))}

• hist(a,breaks=0:6+0.5) (die role)for(i in 1:10000){+ a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))+

+ sample(die,1,replace=TRUE,c(1,1,1,1,1,1))}• hist(a, breaks=0:12+0.5) (dice role)

Page 15: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Mean and Variance of Random Variables

• µ = sum over all possible values x of (x times the probability of x)

• Note this involves area and can deal with continuous probability like the normal distribution.

• This is the mean• The variance, 2 is the sum over all x of ((x- µ)2 times the probability of x)

• The standard deviation is .

Page 16: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Some Continuous Distributions

• The density function is the probability of a sample X lying between x and x+∆x.

• The density is labelled d'name' where name is used in R. For example dbinom or dnorm. The integral of the curve is called the cumulative probability distribution. So you get:– dnorm, the density function– pnorm, the cumulative probability function– qnorm, the inverse of the cumulative probability function

– rnorm, to draw random numbers from the distribution

Page 17: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Excursis: Degrees of Freedom

• Suppose you have a sample of five numbers (2,7,4,0,7) and their mean is 4. What is the sum of the five numbers?

• If you know the mean and four of the numbers, how many values can the fifth one have?

• This means that if you are calculating the sample standard deviation and you have the sample mean, you have one less data point than you think you do.

• df = sample size minus the number of parameters, p, you’ve estimated from the data. (Memorize!)

• variance = (sum of squares)/(degrees of freedom)

Page 18: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

A Worked Example

• gardens.txt in Data• Note that you can test whether two samples probably come from the same distribution (the null hypothesis). You do this by calculating the ratio of the variances, and apply the F test.

• In R, this is handled by applying var.test.• The chi2 and ANOVA tests comparing means assume equal variance, so you must check this first! If the F test tells you don’t have equal variance, don’t go any further.

Page 19: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Confidence Intervals

• Variance is used for testing hypotheses and for establishing confidence intervals (measures of unreliability)

• You want your measure of unreliability to– Go up if variance increases– Go down if the sample size increases

• SE (standard error) = sqrt(s2/n) has those properties.

• You write this as:– “the mean ozone concentration in Garden A was 3.0+/-0.365 pphm (1 s.e., n=10)”

Page 20: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

More on Confidence Intervals

• You can use the assumption of a normal distribution if n>= 30, but if you have a smaller sample, you usually use Student’s t-distribution.

• For the quantiles of this distribution, use qt() • For a 95% confidence interval, use t associated with alpha = 0.975. qt(0.975,9) = 2.262 standard errors, qt(0.995,9) = 3.249836, and qt(0.9975,9) = 3.689662.

• For Garden B (small sample)– “the mean ozone concentration in Garden B was 5.0+/-0.826 (95% C.I., n = 10).”

• There is a better way—bootstrapping—but it’s complex.

Page 21: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Single Sample

• Questions to answer:– What is the mean value?– Is the mean value significantly different from expectation or theory?

– What is the level of uncertainty associated with our estimate of the mean?

• To be reasonably certain, we need to know if the data are normally distributed, have outliers, or show serial correlation.

Page 22: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Worked Example

• Load das.txt and follow me.• summary()• plot()• boxplot()• hist()

Page 23: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Normal Distribution

• According to the central limit theorem, if you take a large set of samples from a population and take their means, the means will be normally distributed.

• Why is deep math.• The quartiles of the normal distribution are calculated by qnorm()

• Examples from book (55ff)

Page 24: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Testing Normality

• A normal distribution is very easy to use, but you need to check first.

• Use qqnorm() and qqline()• Examples (y)• Examples (speed)• Note non-normality. To test a mean when the distribution is non-normal, you don’t use Student’s t. Instead you use Wilcoxon’s signed rank test.

• library(ctest)• wilcox.text(speed, mu=990)

Page 25: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Student’s t

• Use if sample sizes are <30 and normally distributed.

• Use pt instead of pnorm; qt instead of qnorm

• Examples from book (67ff)

Page 26: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Test Statistics for the Mean

• If you have 30 or more samples (n), the distribution of (X-µ)/(s/√n) is approximately normal. You can test whether the mean you computed (X) is significantly different from µ by calculating that probability.

• If you have less than 30 samples, (X-µ)/(s/√n) follows Student’s t distribution, and you need to use that instead.

• Guess why ‘30’ is important…

Page 27: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Comparing two samples• To compare two variances, use Fisher’s F test, var.test(). Do this first!

• For comparing sample means with normal errors, Student’s t test, t.test() (can be used for paired data)

• For comparing sample means with nonnormal errors, Wilcoxon’s rank test, wilcox.test()

• For proportions, use the binomial test, binom.test() (binary data) or prop.test() (binomial proportions)

• For independence in contingency tables, chi-square test, chisq.test(), or Fisher’s exact test, fisher.test()

• For two correlated variables, cor.test()

Page 28: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Two Sample Examples

• Follow me on these. (73ff)

Page 29: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Using 2

• Lots of statistical data are in the form of counts

• Contingency tables show all the possible occurrences in a sample.

Blue eyes Brown eyes

Fair hair 38 11

Dark hair 14 51

The question is are these statistically different?

Page 30: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Completing the table

Blue eyes Brown eyes

Row totals

Fair hair 38 11 49

Dark hair 14 51 65

Column totals

52 62 114

Page 31: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Computing the probability of fair hair

and blue eyes• If and only if the two traits are independent, then the probability of the combination will equal the products of the probabilities of the individual cases.

• That can be estimated as about 22 cases.• Since the cell value is 38, the assumption of independence is at risk

• What is the chance of the observed frequencies occurring by chance?

Page 32: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

The 2 Test• The degrees of freedom in a contingency table equal (r-1)x(c-1), where r and c are the number of columns.

• Here, df = 1.• What certainty level do you want? 95% is typical.• qchi(0.95,1) = 3.841459• count<-matrix(c(38,14,11,51),nrow=2)• The data should be entered columnwise (like before)• To test, chisq.test(count)• Here, the correlation between fair hair and blue eyes is highly significant.

• If the expected frequencies are <= 5, use Fisher’s exact test instead, fisher.test(count) or combine cells.

Page 33: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Summary

• We have seen ways of – Describing data– Testing single sample data against null hypotheses

– Testing two sample data against null hypotheses

Page 34: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Experimental Design

Harry R. Erwin, PhDSchool of Computing and

TechnologyUniversity of Sunderland

Page 35: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Lecture Outline

• Experimental Design– The process of defining how to collect data that will allow you to falsify a hypothesis.

– How to do it.– Replication– Randomization

Page 36: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Categorical variables

• These take discrete values.• A complete experimental design investigates every combination. This is called a factorial design. This is required for reliable results.

• For example, if you have two categorical variables, A and B, with two states, 1 and 2, each, you have to explore A1B1, A1B2, A2B1, and A2B2.

Page 37: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Continuous Variables

• You have to sample at multiple values.

• For example, if an explanatory variable ranges between 1 and 10, you should run an experiment at 1 and another at 10, and a few between.

• This converts the continuous variable into a categorical variable.

Page 38: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Sampling

• You may not be able to control the values of the categorical and continuous variables. In natural experiments, you need to sample randomly.

• The goal of random sampling is to move systematic response into the error term

• Take care to avoid systematic sampling. If necessary, flip a coin or generate a random number.

Page 39: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Replication

• This means you repeat a measurement with a specific value of a categorical and/or continuous explanatory variable.

• This allows you to assess natural variability and measurement error.

• In many experiments, 30 replications is about the maximum necessary. Less may have to be accepted, but then take care in your analysis.

Page 40: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Randomization

• You randomize to eliminate systematic errors.

• Avoid correlating your measurements in time and space.

• Avoid doing things that might introduce systematic effects.

• Avoid allowing your judgment to affect when, where, and with what/whom you do a given experiment. Assign treatments randomly.

Page 41: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

The Design

• The elements of an experimental design are the experimental units.

• The treatments are assigned to the units. (Note that this translates continuous variables to categorical ones).

• The objective of the design is to compare the treatments.

Page 42: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Local Control

• Consider ways to reduce natural variability.

• One way is to group similar experimental units into blocks.

• Running all treatments on all blocks produces a complete randomised block design.

• If you have enough subjects, you can repeat the design. This increases replication.

Page 43: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Analyzing the Results

• You allocate total variability among the different sources: each factor, systematic effects, and natural variability/measurement error.

• This is done using analysis of variance.

Page 44: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Inference

Harry R. Erwin, PhDSchool of Computing and

TechnologyUniversity of Sunderland

Page 45: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Inference

• Statistical inference is the drawing of conclusions from specific data knowing probability.

• Basically, you are assessing the probability of a hypothesis given your data.

• A null hypothesis is plausible but probably not true.

• You show this by demonstrating that the probability of the data you collected being generated if the null hypothesis were true is very small.

• This is called ‘falsifying a hypothesis’.

Page 46: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

How Do You Falsify a Hypothesis?

• Discuss

Page 47: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

The Null Hypothesis

• You start with a null hypothesis—a statistical statement that you intend to show is very unlikely.

• This is usually that the observations are due to chance

• Testing can involve the mean, the variance, or a comparison between two (or more) samples where one has a treatment and the other doesn’t.

Page 48: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

The Test Statistic

• This will be a statistic that assesses the evidence against the null hypothesis.

• This may be a normal distribution (continuous data), a binomial distribution (coin flipping), or comparison to a second experiment with the treatment missing.

Page 49: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Calculating the p value

• This is the probability of your results assuming the null hypothesis.

Page 50: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Compare the p-value to a fixed significance

level, a• a is the probability of a false conclusion that you’re accepting. 0.05, 0.01, and 0.001 are typical.

• Choose a before calculating p. Otherwise you’re cheating.

Page 51: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Large sample significance test for

proportions• The null hypothesis corresponds to a binomial distribution with some probability p0 of the coin coming up heads.

• The alternate hypothesis depends on the direction of the effect (p greater than or smaller than p0 or both)

• The test statistic is – z = (pexp-p0)/((√p0(1-p0))/√n)

• This has the standard normal distribution• You can use qnorm() to calculate the values that correspond to your significance level

Page 52: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Tests for the population mean

• The sample spread from the mean is – (X-µ0)/(s/√n)

• For n>= 30, this is normally distributed

• Apply qnorm() to calculate the values for the significance level.

• If n<30, use Student’s t, qt().

Page 53: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Comparing Samples

• Discussed earlier– First compare the variances using Fisher’s F test.

– If they’re not significantly different, then compare the means.

• Back to the gardening example

Page 54: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Data Modelling

Harry R. Erwin, PhDSchool of Computing and

TechnologyUniversity of Sunderland

Page 55: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Modelling

• This is the process of defining a minimal model for the data.

• There are five kinds of models– Null model– Minimal adequate model– Current model– Maximal model– Saturated model

Page 56: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Parsimony

• Prefer:– A model with less parameters– A model with less explanatory variables

– A linear model– A model without a hump– A model without interactions

Page 57: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Regression Analysis

• Handles continuous data.• Fits a linear combination of explanatory variables to the data.– y = ax+b.

• x is the independent or predictor variable

• y is the dependent or response variable.

• This says the value of y is equal to ax+b plus an error term.

Page 58: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Example

• We’ll work an example using R. (125ff)

Page 59: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

ANOVA

• When we’re working with categorical variables, we use analysis of variance. The best model for the data is the one that minimizes the average error term. You minimize that by minimizing the error variance.

• Worked example (155ff)• One categorical variable results in one-way ANOVA.

• N variables results in N-way ANOVA, because we consider interactions between variables.

Page 60: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

ANCOVA

• A mix of variables results in a mixed approach called analysis of covariance.

• Example. (187ff)

Page 61: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Simplifying the Model

• You start with a model containing all variables and interactions and remove one by one the ones that aren’t significant. If deletion results in insignificant increase in deviance, leave it out, else leave it in.

• Example (103ff)

Page 62: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Other Actions

• You can transform the response and explanatory variables.

• You need to consider– Constancy of variance– Normality of errors– Additivity

• Check your models!

Page 63: Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Summary

• There’s a lot more to statistics. We’d need about four times as much time to cover introductory statistics adequately

• Crawley is a good reference if you’re planning to do a statistical analysis.

• If your analysis has any complexity, consult a working statistician. I am an experimental scientist, not a working statistician, but I do run a weekly statistical surgery. This semester, it meets in DGIC 109 from 2-3 pm on Wednesdays.

• Good luck!