Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

20
Outliers Chapter 5.3 Data Screening

Transcript of Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Page 1: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

Chapter 5.3 Data Screening

Page 2: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers can Bias a Parameter Estimate

Page 3: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

…and the Error associated with that Estimate

Page 4: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Outlier – case with extreme value on one variable or multiple variables

• Why?– Data input error– Not a population you meant to sample– From the population but has really long tails and

very extreme values

Page 5: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Outliers – Two Types• Univariate – for basic univariate statistics– Use these when you have ONE DV or Y variable.

• Multivariate – for some univariate statistics and all multivariate statistics– Use these when you have multiple continuous

variables or lots of DVs.

Page 6: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Univariate• In a normal z-distribution anyone who has a z-

score of +/- 3 is less than .2% of the population.

• Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

Page 7: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as

well if it’s one column or 500, so let’s just do that.

Page 8: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Multivariate– Now we need some way to measure distance from

the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!)

• Mahalanobis distance– Creates a distance from the centroid (mean of

means)

Page 9: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Mahalanobis• Centroid is created by plotting the 3D picture

of the means of all the means and measuring the distance– Similar to Euclidean distance

Page 10: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Mahalanobis• No set cut off rule – Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

NOTE: DF here has NOTHING to do with the DF for hypothesis testing.

Page 11: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• So do I delete them?• Yes: they are far away from the middle!• No: they may not affect your analysis!• It depends: I need the sample size!• SO?!– Try it with and without them. See what happens.

FISH!

Page 12: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Important side notes:– For ANOVA, t-tests, correlation: you will use a fake

regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.

Page 13: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Important side notes:– For regression based tests: you can run the real

regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different.• You will also use other regression based values for this

analysis.

Page 14: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Important side note:– Many functions in R have their own data screening

options. This guide is for global screening not specific to one analysis.

Page 15: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• First, figure out the factor columns, as all columns need to be int or num.– filledin_none[ , -c(1,2)] – Use that dataset code in the next function.

Page 16: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Mahalanobis function• mahalanobis(– Dataset name,– colMeans(dataset name, na.rm = TRUE),– cov(datasetname, use = “pairwise.complete.obs)– )

Page 17: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• mahal = mahalanobis(filledin_none[ , -c(1,2)], colMeans(filledin_none[ , -c(1,2)],

na.rm = TRUE),cov(filledin_none[ , -c(1,2)],

use="pairwise.complete.obs"))

Page 18: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Now, let’s get rid of people with bad scores– But what is a bad score?– Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

• Oh, let’s make R do it.

Page 19: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• Use the qchisq function, which finds the cut off score for you.– qchisq(1-pvalue, Number of columns)

• cutoff = qchisq(.999,ncol(dataset)) • cutoff = qchisq(.999,ncol(filledin_none[ , -

c(1,2)]))

Page 20: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Outliers

• So, let’s see how many are bad– summary(mahal < cutoff)

• Let’s get rid of those peeps– noout = filledin_none[ mahal < cutoff, ]