Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

38
Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Page 1: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistics in MATLAB

COMM2MHarry R. Erwin, PhD

University of Sunderland

Page 2: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Resources

• http://www.mathworks.com/access/helpdesk/help/pdf_doc/stats/stats.pdf This can be found in the COMM2M Lectures folder as STATS.PDF.

• Higham and Higham, 2000, MATLAB Guide, SIAM.

• James E. Gentle, 2002, Elements of Computational Statistics, Springer.

• Wendy L. Martinez & Angel R.  Martinez, 2002, Computational Statistics Handbook with MATLAB, Chapman & Hall/CRC.

• Michael J. Crawley, 2005, Statistics: An Introduction Using R, Wiley. Our Statistics Study Group is working through this.

Page 3: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Doing Computational Statistics

• Usually you do computational statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.

• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.

Page 4: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistical Models

• Often expressed as a set of equations relating data elements.

• Can include probability distributions for the elements. If this is the case, you have a stochastic model.

• The model should be free to evolve based on data mining.

Page 5: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Common Stochastic Models

• Parameterized statistical distributions, such as the normal distribution, binomial distribution, or the chi-squared distribution.

• Sometimes more complicated, where you need to use simulation, resampling, and visualization to determine the parameters of the model.

Page 6: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Structure-in-the-data

• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality

Page 7: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Visualization

• Multiple views are necessary• Be able to zoom in on the data as a few points can obscure the interesting structure.

• Scaling of the axes may be necessary, since our eyes are not perfect tools for detecting structure.

• Watch out for time-ordered or location-ordered data, particularly if time or location are not explicitly reported.

Page 8: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Plots

• Use simple plots to start with.

• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.

Page 9: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistical Activities

• Data collection (ideally the statistician has a say on how they are collected)

• Description of a dataset– Averages– Spreads– Extreme points

• Inference within a model or collection of models

• Model selection

Page 10: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

How to Do It

• Start by determining what sort of statistical analysis should you do. You need to know:– Which variable is the response variable?

– Which are the explanatory variables?– What kind are the explanatory variables?

– What kind of response variable do you have?

Page 11: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Basic Method of Analysis

• If all explanatory variables are continuous, plan on a regression analysis.

• If all explanatory variables are categorical, plan for an analysis of variance (ANOVA).

• If you have a mix, plan for an analysis of covariance (ANCOVA)

Page 12: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Effect of Response Variable

• If the response variable is continuous, then plan on a normal regression, ANOVA, or ANCOVA.

• If the response variable is a proportion, do a logistic regression.

• If a count, you need a log linear model.

• If binary, you need a binary logistic analysis

• If time to event or time at death, you will be doing a survival analysis.

Page 13: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Variation

• You want to understand how the response is dependent on variation in the explanatory variables, but you are also interested in lack of dependence.

• Design the simplest model that explains the data adequately.

Page 14: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Significance

• You have to determine what the probability of a false alarm will be—that is, that you will think something is significant that really isn’t.

• Typical values are 5%, 1%, and 0.1%.

• Don’t test every hypothesis. Some will be true by chance.

Page 15: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Good and Bad Hypotheses

• ‘There are vultures in the local park.’

• ‘There are no vultures in the local park.’

• Which is testable?• The null hypothesis is testable. You test it by taking measurements and showing that if the null hypothesis is true, the chance of those measurements is nearly zero.

Page 16: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Experimental Design

• Replication– Increases reliability, so be thorough. Usually the answer is ‘30’.

• Randomization– Reduces bias, so do it properly– Almost never done properly– Discuss

Page 17: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Controls

• “No controls, no conclusions.”

• A control experiment is one where you don’t apply the treatment or don’t enable the part of your experiment that is supposed to produce the different outcome.

Page 18: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Replication

• Must be independent• Not part of a time series• Not grouped together in space• Of an appropriate spatial scale

• Covers the normal variation in initial conditons.

Page 19: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Error Types

Null hypothesis actually true

Null hypothesis actually false

Accept null hypothesis

Correct Type II () error

Reject null hypothesis

Type I () error

Correct

Page 20: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Typical and values• You usually want the probability of rejecting the null hypothesis () when it is true to be less than 5%.

• You usually want the probability of accepting the null hypothesis () when it is false to be less than 20%.

• The power of a test is 1- , or greater than 80% in this case.

• Rule of Thumb: the number of replicates to reject the null hypothesis with probability 80% is about 8s2/d2, where s2 is the variance in the response and d is the size of the difference to be detected in a single sample.

Page 21: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Inference

• Strong inference– A clear hypothesis– An acceptable test

• Weak inference– Natural experiments

• Conclusions from natural experiments are hypotheses.

Page 22: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

How Long to Go On?

• To stop the experiment as soon as a pleasing result is obtained?

• To keep going until the theoretically correct result is obtained?

• Discuss.

Page 23: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistics in MATLAB

• MATLAB has some useful statistical tools you can use to do all this (although most computational statistics is done using FORTRAN, SAS, R, or S-Plus).

• Supports the usual range of statistical tasks, including both analysis and visualization.

• Following is an overview of the capabilities of the MATLAB statistics toolbox.

Page 24: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistics Capabilities

• Probability distributions• Descriptive statistics• Linear and non-linear models• Hypothesis testing• Multivariate statistics• Plotting• Statistical process control, • Design of experiments, and • Hidden Markov models.

Page 25: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Random number generators

• There are functions in the Statistics Toolbox that return random output.

• These allow the user to observe probability distributions, evaluate statistical tests, and use resampling techniques.

Page 26: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Probability distributions

• These are used to display possible probability distributions and create histograms.

• MATLAB provides the pdf, cdf, cdf-1, a random number generator, and mean and variance estimators for each distribution.

Page 27: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Continuous Distributions Provided• Beta• Exponential• Extreme value• Gamma• Lognormal• Normal• Rayleigh• Uniform• Weibull

Page 28: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Continuous Statistical Distributions

• Chi-square• Non-central Chi-square• F• Non-central F• t• Non-central t

Page 29: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Discrete distributions

• Binomial• Discrete uniform• Geometric• Hypergeometric• Negative binomial• Poisson

Page 30: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Descriptive statistics

• mean• median• variance• standard deviation• Grouped data

Page 31: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Linear and non-linear models

• ANOVA• Covariance analysis (ANCOVA)• Multiple linear regression• Quadratic response surface models• Stepwise regression• GLM• Robust and nonparametric methods• Nonlinear least squares• Regression and Classification Trees (CART)

Page 32: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Hypothesis testing

• Null hypothesis• Alternative hypotheses• Significance level• p-value• Confidence intervals• A number of tests are provided (this is a hard area)

Page 33: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Multivariate statistics

• Principal components analysis• Factor analysis• MANOVA• Cluster analysis• Multidimensional scaling

Page 34: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Plotting and Visualization

• Box plots• Distribution plots• Scatter plots

Page 35: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Statistical process control

• Quality of manufactured goods– Control charts– Capability studies

Page 36: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Design of experiments

• Full factorial designs• Fractional factorial designs• Response surface designs• D-optimal designs

Page 37: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Hidden Markov Models

• Concepts• Markov chains• Analysis of hidden Markov models (HMMs).

Page 38: Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.

Conclusions

• MATLAB provides a basic engineering toolkit for these statistical activities.

• Not as broad as R or S-plus, but compatible with data collected or generated by other toolkits.

• Supports all activities well.• More specialized work (e.g., Bayesian analysis) requires either your own extensions or more specialized toolkits.