STAT 101: Day 5 Descriptive Statistics II 1/30/12

33
STAT 101: Day 5 Descriptive Statistics II 1/30/12 • One Quantitative Variable (continued) • Quantitative with a Categorical Variable • Two Quantitative Variables Section 2.3, 2.4, 2.5 Professor Kari Lock Morgan Duke University

description

STAT 101: Day 5 Descriptive Statistics II 1/30/12. One Quantitative Variable (continued) Quantitative with a Categorical Variable Two Quantitative Variables. Section 2.3, 2.4, 2.5. Professor Kari Lock Morgan Duke University. Clicker Registration. - PowerPoint PPT Presentation

Transcript of STAT 101: Day 5 Descriptive Statistics II 1/30/12

Page 1: STAT 101: Day 5 Descriptive Statistics II 1/30/12

STAT 101: Day 5Descriptive Statistics II

1/30/12

• One Quantitative Variable (continued)• Quantitative with a Categorical Variable• Two Quantitative Variables

Section 2.3, 2.4, 2.5 Professor Kari Lock MorganDuke University

Page 2: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Clicker Registration

To register your clicker, just press the letter that appears next to your name, then press the second letter that appears next to your name

Page 3: STAT 101: Day 5 Descriptive Statistics II 1/30/12

What are The Odds That Stats Would Be This Popular?- New York Times, 1/26/12

There are billions of bytes generated daily, not just from the Internet but also from sciences like genetics and astronomy. Companies like Google and Facebook, as well as product marketers, risk analysts, spies, natural philosophers and gamblers are all scouring the info, desperate to find a new angle on what makes us and the world tick. …

What no one has are enough people to figure out the valuable patterns that lie inside the data. …

Page 4: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Salary2 4 6 8 10

NHL2010Salaries Dot Plot

m=$1,250,000

x=$2,210,000

Mean is “pulled” in the direction of skewness

Measures of Center

Page 5: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Standard Deviation•The sample standard deviation, s, measures the spread of a distribution. The larger s is, the more spread out the distribution is

Standard deviation is always ≥ 0.

2

1

1

n

ii

x xs

n

R: sd()

Page 6: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Freq

uenc

y

-15 -10 -5 0 5 10 15

050

150

Freq

uenc

y

-15 -10 -5 0 5 10 15

050

150

Standard Deviation

1s

4s

Both of these distributions are bell-shaped

Page 7: STAT 101: Day 5 Descriptive Statistics II 1/30/12

The 95% Rule•If a distribution is symmetric and bell-shaped, then approximately 95% of the data values will lie within 2 standard deviations of the mean

Freq

uenc

y

-3 -2 -1 0 1 2 3

050

150

Freq

uenc

y

-15 -10 -5 0 5 10 15

050

150

1s

4s

Page 8: STAT 101: Day 5 Descriptive Statistics II 1/30/12

The 95% Rule

The standard deviation for hours of sleep per night is closest to

a) ½b) 1c) 2d) 4e) I have no idea

2.03s

Page 9: STAT 101: Day 5 Descriptive Statistics II 1/30/12

z-score•A z-score is unit-free measure of extremity of a data point. It tells us how many standard deviations away from the mean a value is

•Values farther from 0 are more extreme

•95% of all z-scores fall between -2 and 2

x xzs

Page 10: STAT 101: Day 5 Descriptive Statistics II 1/30/12

z-scoreWhich is better, an ACT score of 28 or a combined SAT score of 2100?

•ACT: mean = 21, sd = 5•SAT: mean = 1500, sd = 325•Assume ACT scores and SAT scores have approximately symmetric and bell-shaped distributions

(a) ACT score of 28(b) SAT score of 2100(c) I don’t know

Page 11: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Other Measures of Location

Maximum = largest data value

Minimum = smallest data value

Quartiles:Q1 = median of the values below m.Q3 = median of the values above m.

Page 12: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Five Number Summary•Five Number Summary:

Min MaxQ1 Q3m

25% 25% 25% 25%

R: summary()

Page 13: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Percentile•The Pth percentile is the value of a quantitative variable which is greater than P percent of the data

•We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better

•We could also have used percentiles:– ACT score of 28: 91st percentile– SAT score of 2100: 97th percentile

Page 14: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Five Number Summary•Five Number Summary:

Min MaxQ1 Q3m

25% 25% 25% 25%

0th percentile

100th percentile

50th percentile

75th percentile

25th percentile

Page 15: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Five Number Summary

The distribution of number of hours you spend studying each week is

(a) Symmetric(b) Right-skewed(c) Left-skewed(d) Impossible to tell

> summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00

Page 16: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Measures of Spread• Range = Max – Min

• Interquartile Range (IQR) = Q3 – Q1

•Is the range resistant to outliers?a) Yesb) No

•Is the IQR resistant to outliers?a) Yesb) No

Page 17: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Outliers• Outliers can be informally identified by

looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles

• A data value is an outlier if it is

Smaller than Q1 – 1.5(IQR)

orLarger than Q3 + 1.5(IQR)

Page 18: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Boxplot

MedianQ1

Q3

middle 50% of data

• Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier

Outliers

R: boxplot(study_hours, ylab=“Hours spent studying”)

Page 19: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Boxplot

Which boxplot goes with the histogram of waiting times for the bus?

Histogram of Bus

Bus

Freq

uenc

y

0 5 10 15 20

010

20

(a) (b) (c)

Page 20: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Summary: One Quantitative Variable•Summary Statistics

– Center: mean, median– Spread: standard deviation, range, IQR– Percentiles– 5 number summary

• Visualization– Dotplot– Histogram– Boxplot

• Other concepts– Shape: symmetric, skewed, bell-shaped– Outliers, resistance– z-scores

Page 21: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Quantitative and Categorical Relationships

• Boxplots are particularly useful for comparing distributions of a quantitative variable across different levels of a categorical variable

Page 22: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Side-by-Side Boxplots

boxplot(gpa~parent_degree, ylab="GPA", xlab="Parents' Highest Degree")

Do students whose parents had more of an education have higher GPAs?

Page 23: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Side-by-Side BoxplotsDoes GPA differ by major?

Page 24: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Side-by-Side BoxplotsDo students who’ve had AP statistics do better in STAT 101?

NO!

Page 25: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Side-by-Side Boxplots

Page 26: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Quantitative Statistics by a Categorical Variable

• Any of the statistics we use for a quantitative variable can be looked at separately for each level of a categorical variable

• Mean hours per week spent studying by major:

Page 27: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Summary: One Quantitative and One Categorical

•Summary Statistics– Any summary statistics for quantitative variables,

broken down by each level of the categorical variable

• Visualization– Side-by-side boxplots

Page 28: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Scatterplot•A scatterplot is a graph of the relationship between two quantitative variables. Each dot represents one case.

R: plot(study_hours, gpa)

Page 29: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Direction of Association•A positive association means that values of one variable tend to be higher when values of the other variable are higher

•A negative association means that values of one variable tend to be lower when values of the other variable are higher

•Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable

Page 30: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Cars Data - Handout•Quantitative Variables:

– Weight (pounds)– City MPG– Fuel capacity (gallons)– Page number (in Consumer Reports)– Time to go ¼ mile (in seconds)– Acceleration time from 0 to 60 mph

•Relationships– Weight vs. CityMPG – Weight vs. FuelCapacity – PageNum vs. Fuel Capacity– Weight vs. QtrMile – Acc060 vs. QtrMile – CityMPG vs. QtrMile

Page 31: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Correlation•The sample correlation, r, measures the strength and direction of linear association between two quantitative variables

sX : sample standard deviation of XsY : sample standard deviation of Y

1

11

i in

i X Y

x x y yrn s s

R: cor(X,Y)

Page 32: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Car Correlations

2000

2500

3000

3500

4000

CityMPG16 18 20 22 24 26 28 30 32

Cars99 Scatter Plot

2000

2500

3000

3500

4000

FuelCap10 12 14 16 18 20 22 24

Cars99 Scatter Plot

22002400260028003000320034003600380040004200

14 15 16 17 18 19 20QtrMile

Cars99 Scatter Plot

0

50

100

150

200

250

10 12 14 16 18 20 22 24FuelCap

Cars99 Scatter Plot

56789

10111213

QtrMile14 15 16 17 18 19 20

Cars99 Scatter Plot

1618

20

2224

26

2830

14 15 16 17 18 19 20QtrMile

Cars99 Scatter Plot

2000

2500

3000

3500

4000

CityMPG16 18 20 22 24 26 28 30 32

Cars99 Scatter Plot

2000

2500

3000

3500

4000

FuelCap10 12 14 16 18 20 22 24

Cars99 Scatter Plot

22002400260028003000320034003600380040004200

14 15 16 17 18 19 20QtrMile

Cars99 Scatter Plot

0

50

100

150

200

250

10 12 14 16 18 20 22 24FuelCap

Cars99 Scatter Plot

56789

10111213

QtrMile14 15 16 17 18 19 20

Cars99 Scatter Plot

1618

20

2224

26

2830

14 15 16 17 18 19 20QtrMile

Cars99 Scatter Plot

What are the properties of correlation?

(-.91)

(.89)

(-.08)

(-.45)

(.99)

(.51)

Page 33: STAT 101: Day 5 Descriptive Statistics II 1/30/12

Correlation•-1 ≤ r ≤ 1

•positive association: r > 0•negative association: r < 0•no linear association: r 0

•The closer r is to ±1, the stronger the linear association

•r does not depend on the units of measurement

•The correlation between X and Y is the same as the correlation between Y and X