Describing Relationships: Scatter Plots and Correlation ● The world is an indivisible whole...

Post on 24-Dec-2015

217 views 1 download

Transcript of Describing Relationships: Scatter Plots and Correlation ● The world is an indivisible whole...

Describing Relationships: Scatter Plots and Correlation

● The world is an indivisible whole (butterfly effect and chaos theory; quantum entanglement, etc.) Things (and people) exist in relationships

● Here we study relationships between two quantitative variables (e.g., IQ test score and school GPA.)

● Graphical description: scatter plots – Look for direction, form, strength, and outliers

● Numerical measure: correlation coefficient– Definition– Direction and strength of linear relationships

Scatter Plots

Convention: Dependent variable on the vertical axis, independent (explanatory) variable on the horizontal axis.

Scatter Plot Example: IQ score and GPA

Scatter Plot Example: Wealth and Health

What to Look For in a Scatter Plot

Functional form: Nonlinear? Linear? Direction: Positive? Negative? Strength: How clear is the pattern?

What to Look For in a Scatter Plot: Direction

Outliers

How Strong Is the Relationship?

Two scatter plots of the same data, using different scales. Visual impressions are not very reliable!

The Correlation Coefficient ● Numerical measure of direction and strength of a

linear relationship● Linear relationships are particularly important

– Simplest, easiest to understand

– Some nonlinear relationships can be transformed into linear by transforming variables (e.g. Using square terms for curvilinear patterns: Age and voting)

– A “first order approximation” of arbitrary relationships.

Measuring Linear Correlation with r

● Direction: Does the scatter plot slope upward or downward? – Positive r indicates a positive relationship, negative r

indicates a negative relationship.● Strength: How strong is the association? How closely

does a non-horizontal straight line fit the points of a scatter plot?– the stronger the relationship, the larger the

magnitude of r.● Formula:

r = 1n−1 ∑i=1

n

x i−x

s xy i− y

s y

Strength of linear correlation

Strength and Statistical Significance

● A strong relationship seen in the sample may indicate a strong relationship in the population, or the sample results may be due to chance and the relationship in the population is not strong or is zero. (We'll test whether the relationship is “significant” in the context of linear regression)

● “Statistical significance” does not imply the relationship is strong enough to be considered “practically important”. (“Non-zero” is not necessarily “big” in size.)

– Even weak relationships may be labeled statistically significant if the sample size is very large.

– Even very strong relationships may not be labeled statistically significant if the sample size is very small.

Properties of the Correlation Coefficient● r is always between -1 and 1 ● r > 0: as one variable changes, the other variable

tends to change in the same direction● r < 0: as one variable changes, the other variable

tends to change in the opposite direction● r=+1: A perfect positive linear relationship:

y=a+bx, b>0● r=-1: A perfect negative linear relationship

y=a+bx, b<0● r=0: No linear relationship (the scatter plot points are

best fit by a horizontal line)● Limitations:

– Outliers can inflate or deflate correlations– Correlation can be spurious due to confounding

The Effect of Outliers on the Correlation

In this figure graphing the relationship between the length of leg bone and an upper arm bone in 6 fossil specimen of an extinct beast, moving one point in the figure changes r from .994 to .64!

The Effect of Outliers on the Correlation

Using Software

● Stata:

– For scatter plot: Graphics-->Two way graph– For r: “correlate”, “pwcorr” (pair-wise)– Example: “sysuse lifeexp”; relationship

between safewater access and life expectancy, country level data.

notes;

twoway (scatter lexp safewater);

correlate lexp safewater; pwcorr; pwcorr,sig ● “Correlation and Regression Demo” applet at the

book's website (explore the effect of outliers, for example.)

Statistical versus Deterministic Relationships● y=a+bx is a deterministic relationship: knowing the

value of x meaning knowing the value of y (assume we know the intercept and the slope). The correlation coefficient is 1 (or -1) in this case.

– e.g. Distance traveled = speed x time. Fixing speed, there's a deterministic relationship between distance and time.

● In social science data deterministic relationships are rare. e.g., time studying and exam grade.

● y=a+bx + describes such imperfect linear relationships between two variables: the value of y is not completely determined by x (and the parameters a and b), but is also affected by something else,

● We discuss the fitting of straight lines to imperfect scatter plot data next time.