Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of...

24
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Transcript of Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of...

Page 1: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Introduction to Statistics

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Page 2: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Resources

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Gentle, J (2002) Elements of Computational Statistics. Springer.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Page 3: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Who am I?

• Dr. Harry Erwin BS MA PhD MIET MBCS• My PhD was awarded in bioinformatics. Although my

research interests are in neuroscience, I've had the coursework and understand current research directions in computational biology and statistics. I’ve also had the coursework for a PhD in mathematics.

• I teach computing and neuroscience here at the University of Sunderland.

Page 4: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Doing Statistics

• Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.

• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.

• We’ll look at modelling in detail later.

Page 5: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Models

• Often expressed as a set of equations relating data elements.

• Can include probability distributions for the elements. If this is the case, you have a stochastic model.

• The model should be free to evolve based on data mining.

Page 6: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Common Stochastic Models

• Parameterized statistical distributions, such as the normal distribution, binomial distribution, or the chi-squared distribution.

• Sometimes more complicated, where you might need to use simulation, resampling, and visualization to determine the parameters of the model.

Page 7: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Structure-in-the-data

• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality

Page 8: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Visualization

• Multiple views are necessary, particularly for multivariate data.

• Be able to zoom in on the data as a few points can obscure the interesting structure.

• Scaling of the axes may be necessary, since our eyes are not perfect tools for detecting structure.

• Watch out for time-ordered or location-ordered data, particularly if time or location are not explicitly reported.

Page 9: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Plots

• Use simple plots to start with.

• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.

• There are a number of plotting tutorials, consult them.

Page 10: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Activities

• Data collection (ideally the statistician has a say on how they are collected)

• Description of a dataset– Averages

– Spreads

– Extreme points

• Inference within a model or collection of models• Model selection

Page 11: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

How to Do It

• Start by determining what sort of statistical analysis you will be doing. You need to know:– Which variable is the response variable?

– Which are the explanatory variables?

– What kind are the explanatory variables?

– What kind of response variable do you have?

• If you have multiple response variables, you need to do multivariate analysis (more advanced).

Page 12: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Basic Methods

• If all explanatory variables are continuous, plan on a regression analysis.

• If all explanatory variables are categorical, plan for an analysis of variance (ANOVA).

• If you have a mix, plan for an analysis of covariance (ANCOVA)

Page 13: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Effect of the Response Variable

• If the response variable is continuous, then plan on a normal regression, ANOVA, or ANCOVA.

• If the response variable is a proportion, do a logistic regression.

• If a count, you need a log linear model.• If binary, you need a binary logistic analysis• If time to event or time at death, you will be doing a

survival analysis.

Page 14: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Variation

• You want to understand how the response is dependent on variation in the explanatory variables, but you are also interested in lack of dependence.

• Design the simplest model that explains the data adequately.

Page 15: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Significance

• You have to determine what the probability of a false alarm will be—that is, the chance that you will think something is significant which really is not.

• Typical values are 5%, 1%, and 0.1%.

• Don’t test every hypothesis. Some will be true by chance.

Page 16: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Good and Bad Hypotheses

• ‘There are vultures in the local park.’

• ‘There are no vultures in the local park.’

• Which is testable?

• Discuss…

Page 17: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Answer

• The ‘null hypothesis’ is testable.

• ‘There are no vultures in the local park.’

• You test it by taking measurements and showing that if the null hypothesis were true, the chance of those measurements would be close to zero.

• Discuss further…

Page 18: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Experimental Design

• Replication– Increases reliability, so be thorough. Often the

answer is ‘30’.– Discuss why.

• Randomization– Reduces systematic bias, so do it properly– Almost never done properly– Discuss why.

Page 19: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Controls

• “No controls, no conclusions.”• A ‘control experiment’ is one where you don’t

apply the treatment or don’t enable the part of your experiment that is supposed to produce the different outcome.

• You’re comparing the results when the treatment is applied to the results with no treatment.

Page 20: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Replication

• Must be independent

• Not part of a time series

• Not grouped together in space

• Of an appropriate spatial scale

• Covers the normal variation in initial conditions.

Page 21: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Error Types

Null hypothesis actually true

Null hypothesis actually false

Accept null hypothesis

Correct(no paper but no embarrassment)

Type II () error(further experiments can change this)

Reject null hypothesis

Type I () error(can result in a paper you have to withdraw)

Correct(a publishable paper)

Page 22: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Typical and values

• You usually want the probability of rejecting the null hypothesis () when it is true to be less than 5%.

• You usually want the probability of accepting the null hypothesis () when it is false to be less than 20%.

• The power of a test is 1- , or greater than 80% in this case.• Rule of Thumb: the number of replicates to reject the null

hypothesis with probability 80% is about 8s2/d2, where s2 is the variance in the response and d is the size of the difference to be detected in a single sample.

Page 23: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Inference

• Strong inference– A clear hypothesis– An acceptable test

• Weak inference– Natural experiments

• Conclusions from natural experiments are hypotheses. Can still produce good papers.

• Discuss

Page 24: Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

How Long to Go On?

• To stop the experiment as soon as a pleasing result is obtained?

• To keep going until the theoretically correct result is obtained?

• Discuss.

• Gregor Mendel’s experiments with peas.