Transparency in Government - Gov 2.0 and what it means for Science Journalists
Statistics for Science Journalists
description
Transcript of Statistics for Science Journalists
![Page 1: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/1.jpg)
STEVE DOIGCRONKITE SCHOOL OF JOURNALISM
Statistics for Science Journalists
![Page 2: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/2.jpg)
Journalists hate math
Definition of journalist: A do-gooder who hates math.
“Word person, not a numbers person.”1936 JQ article noting habitual numerical errors in
newspapers Japanese 6th graders more accurate on math test
than applicants to Columbia’s Graduate School of Journalism
20% of journalists got more than half wrong on 25-question “math competency test” (Maier)
18% of 5,100 stories examined by Phil Meyer had math errors
![Page 3: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/3.jpg)
Bad examples abound
Paulos: 300% decrease in murdersDetroit Free Press (2006): Compared ACS to
Census data to get false drop in median income
KC Star (2000): Priests dying of AIDS at 4 times the rate of all Americans
Delaware ZIP Code of infant deathNYT: 51% of women without spouses
![Page 4: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/4.jpg)
Common problems
Numbers that don’t add upMaking the reader do the mathFailure to ask “Does this make sense?”Over-precisionIgnoring sampling error marginsImplying that correlation equals causation
![Page 5: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/5.jpg)
Dangers of journalistic innumeracy
Misleads math-challenged readers/viewersHurts credibility among math-capable
readers/viewersLeads to charges of bias, even when cause is
ignoranceMakes reporters vulnerable to being used for
the agendas of others
![Page 6: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/6.jpg)
Common Research Methods
Randomized experiments: Measure deliberate manipulation of the environment
Observational studies: Measure the differences that occur naturally
Meta-analyses: Quantitative review of multiple studies
Case Study: Descriptive in-depth examination of one or a few individuals
![Page 7: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/7.jpg)
Simple Measures...
...don’t exist!
![Page 8: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/8.jpg)
Measurement Variability
Variable measurements include unpredictable errors or discrepancies that aren’t easily explained.
Natural variability is the result of the fact that individuals and other things are different.
![Page 9: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/9.jpg)
Reasons for variable measures
Measurement errorNatural variability between
individualsNatural variability over time in a
single individual
![Page 10: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/10.jpg)
Some Pitfalls in Studies
![Page 11: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/11.jpg)
Deliberate Bias?
If you found a wallet with $20, would you:“Keep it?” (23% would keep it)“Do the honest thing and return it?” (13% would keep it)
![Page 12: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/12.jpg)
Unintentional Bias?
“Do you use drugs?”“Are you religious?”
![Page 13: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/13.jpg)
Desire to Please?
People routinely say they have voted when they actually haven’t, that they don’t smoke when they do, and that they aren’t prejudiced.
One study six months after an election:96% of actual voters said they voted.40% of non-voters said they voted.
![Page 14: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/14.jpg)
Asking the uninformed?
Washington Post poll : “Some people say the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”
24% said yes19% said norest had no opinion
![Page 15: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/15.jpg)
Asking the uninformed?
Later Washington Post poll: “President Clinton says the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”
36% of Democrats agreed16% of Republicans agreedrest had no opinion
![Page 16: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/16.jpg)
Unnecessary Complexity?
“Do you support our soldiers in Iraq so that terrorists won’t strike the U.S. again?”
![Page 17: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/17.jpg)
Question Order
“About how many times a month do you normally go out on a date?”
“How happy are you with life in general?”
![Page 18: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/18.jpg)
Sampling
![Page 19: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/19.jpg)
Margin of Error
95% of the time, a random sample’s characteristics will differ from the population’s by no more than about
where N= sample size
n1
![Page 20: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/20.jpg)
Two Important Concepts about Error Margin
The larger the sample, the smaller the margin of sampling error.
The size of the population being surveyed doesn’t matter.*
*Unless the sample is a significant fraction of the population.
![Page 21: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/21.jpg)
Sampling realities
Bigger sample means more cost (money and/or time)
Diminishing return on error margin improvement as sample increases. N=100: +/- 10 percentage points N=400: +/- 5 percentage points N=900: +/- 3.3 percentage points
Sample needs only to be large enough to give a reasonable answer.
Sampling error affects subsamples, too.
![Page 22: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/22.jpg)
Describing data sets
![Page 23: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/23.jpg)
Three Useful Features of a Set of Data
The CenterThe VariabilityThe Shape
![Page 24: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/24.jpg)
The Center
Mean (average): Total of the values, divided by the number of values
Median: The middle value of an ordered list of values
Mode: The most common valueOutliers: Atypical values far from the center
![Page 25: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/25.jpg)
Yankees’ Baseball Salaries
Average: $7,404,762Median: $2,500,000Mode: $500,000 (also the minimum)Outlier: $27.5 million (Alex Rodriguez)
![Page 26: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/26.jpg)
The Variability
Some measures of variability:Maximum and minimum: Largest and
smallest valuesRange: The distance between the largest and
smallest valuesQuartiles: The medians of each half of the
ordered list of valuesStandard deviation: Think of it as the average
distance of all the values from the mean.
![Page 27: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/27.jpg)
What is “normal”?
Don’t consider the average to be “normal”Variability is normalAnything within about 3 standard deviations
of the mean is “normal”
![Page 28: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/28.jpg)
Bell-Shaped “Normal” Curve
![Page 29: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/29.jpg)
Some Characteristics of a Normal Distribution
Symmetrical (not skewed)One peak in the middle, at the meanThe wider the curve, the greater the standard
deviationArea under the curve is 1 (or 100%)
mean
![Page 30: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/30.jpg)
Percentiles
Your percentile for a particular measure (like height or IQ) is the percentage of the population that falls below you.
Compared to other American males:My height (5’ 11”): 75th percentileMy weight (230 lbs.): 85th percentileMy age (66): 88th percentile
Therefore, I am older and heavier than I am tall.
![Page 31: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/31.jpg)
Standardized Scores
A standardized score (also called the z-score) is simply the number of standard deviations a particular value is either above or below the mean.
The standardized score is: Positive if above the meanNegative if below the mean
Useful for defining data points as outliers.
![Page 32: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/32.jpg)
The Empirical Rule
For any normal curve, approximately:68% of values within one StdDev of the mean95% of values within two StdDevs of the
mean99.7% of values within three StdDevs of the
mean
![Page 33: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/33.jpg)
![Page 34: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/34.jpg)
Outlier
A value that is more than three standard deviations above or below the mean.
![Page 35: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/35.jpg)
Correlation
![Page 36: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/36.jpg)
Strength of Relationship
Correlation (also called the correlation coefficient or Pearson’s r) is the measure of strength of the linear relationship between two variables.
Think of strength as how closely the data points come to falling on a line drawn through the data.
![Page 37: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/37.jpg)
Features of Correlation
Correlation can range from +1 to -1Positive correlation: As one variable
increases, the other increasesNegative correlation: As one variable
increases, the other decreasesZero correlation means the best line
through the data is horizontalCorrelation isn’t affected by the units of
measurement
![Page 38: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/38.jpg)
Positive Correlations
r = +.1 r = +.4
r = +.8 r = +1
![Page 39: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/39.jpg)
Negative Correlations
r = -.1
r = -.4
r = -.8 r = -1
![Page 40: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/40.jpg)
Zero correlation
r = 0 r = 0
![Page 41: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/41.jpg)
Number of PointsDoesn’t Matter
r = .8 r = .8
![Page 42: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/42.jpg)
Important!
Correlation does not imply causation.
![Page 43: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/43.jpg)
Correlation of variables
When considering relationships between measurement variables, there are two kinds: Explanatory (or independent) variable: The variable
that attempts to explain or is purported to cause (at least partially) differences in the…
Response (or dependent or outcome) variableOften, chronology is a guide to distinguishing
them (examples: baldness and heart attacks, poverty and test scores)
![Page 44: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/44.jpg)
Some reasons why two variables could be related
The explanatory variable is the direct cause of the response variable
Example: pollen counts and percent of population suffering allergies, intercourse and babies
![Page 45: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/45.jpg)
Some reasons two variables could be related
The response variable is causing a change in the explanatory variable
Example: hotel occupancy and advertising spending, divorce and alcohol abuse
![Page 46: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/46.jpg)
Some reasons two variables could be related
The explanatory variable is a contributing -- but not sole -- cause
Example: birth complications and violence, gun in home and homicide, hours studied and grade, diet and cancer
![Page 47: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/47.jpg)
Some reasons two variables could be related
Both variables may result from a common cause
Example: SAT score and GPA, hot chocolate and tissues, storks and babies, fire losses and firefighters, WWII fighter opposition and bombing accuracy
![Page 48: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/48.jpg)
Some reasons two variables could be related
Both variables are changing over timeExample: divorces and drug offenses, divorces
and suicides
![Page 49: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/49.jpg)
Some reasons two variables could be related
The association may be nothing more than coincidence
Example: clusters of disease, brain cancer from cell phones
![Page 50: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/50.jpg)
So how can we confirm causation?
The only way to confirm is with a designed (randomized double-blind) experiment.
But non-statistical evidence of a possible connection may include:
A reasonable explanation of cause and effect.A connection that happens under varying
conditions.Potential confounding variables ruled out.
![Page 51: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/51.jpg)
Regression
![Page 52: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/52.jpg)
Linear Regression
In addition to figuring the strength of the relationship, we can create a simple equation that describes the best-fit line (also called the “least-squares” line) through the data.
This equation will help us predict one variable, given the other.
![Page 53: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/53.jpg)
Best-fit (“least-squares”) Line
![Page 54: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/54.jpg)
Best-fit Line??? (much variance)
![Page 55: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/55.jpg)
Best-fit Line! (least variance)
![Page 56: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/56.jpg)
Remember 9th Grade Algebra?
x = horizontal axis y = vertical axis
Equation for a line:
y = slope * x + intercept
or as it often is stated:
y = mx + b
![Page 57: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/57.jpg)
Regression in data journalism
Public school test scoresCheating in school test scoresTenure of white vs. black coaches in NBARacial bias in picking jurorsRacial profiling in traffic stops
![Page 58: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/58.jpg)
Confusion of the inverse
![Page 59: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/59.jpg)
Confusion of the Inverse
Confusing these two:Probability of actually having a condition,
given a positive test for itProbability of having a positive test, given
actually having the condition
When the incidence of some disease or condition is very low, and the test for it is not perfect, there will be a high probability that a positive test result is false positive.
![Page 60: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/60.jpg)
Definitions
Base rate: The probability that someone has a disease or condition, without knowing any test results.
Test Sensitivity: Proportion of people who correctly test positive when they have the disease or condition (true positive)
Test Specificity: Proportion of people who correctly test negative when they don’t have the disease or condition (true negative)
![Page 61: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/61.jpg)
Drug Tests
Consider this scenario:Base rate: 1% of population to be tested uses
dangerous drugsYou use a test that’s 99% accurate in both
sensitivity and specificity10,000 people are tested
![Page 62: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/62.jpg)
Drug Tests
Test Test TotalPositive Negative
Users 100
Not 9,900
Total 10,000
![Page 63: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/63.jpg)
Drug Tests
Test Test TotalPositive Negative
Users 99 1 100
Not 9,900
Total 10,000
![Page 64: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/64.jpg)
Drug Tests
Test Test TotalPositive Negative
Users 99 1 100
Not 9,801 9,900
Total 9,802 10,000
![Page 65: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/65.jpg)
Drug Tests
Test Test TotalPositive Negative
Users 99 1 100
Not ??? 9,801 9,900
Total 9,802 10,000
![Page 66: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/66.jpg)
Drug Tests
Test Test TotalPositive Negative
Users 99 1 100
Not 9,801 9,900
Total 198 9,802 10,000
(50% of positives are FALSE!)
99
![Page 67: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/67.jpg)
Confidence intervals and p-values
![Page 68: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/68.jpg)
Confidence Intervals
Like the error margin around poll results
A confidence interval is a tradeoff between certainty and accuracy, like shooting at targets of different sizes
The bigger the sample, the smaller the confidence interval at the 95% level
When comparing results, if confidence intervals overlap, the results are NOT statistically significant
![Page 69: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/69.jpg)
P-values
P-value is the probability that the sample result is significantly different from the true result (i.e., wrong)
95% confidence interval (p < 0.05) is the most commonly used interval in social science research
Hard science, particularly medicine, often needs tighter confidence intervals and smaller p-values, like p<0.01
Studies are going to be wrong about 5% of the time (and you won’t know when)
On the other hand, they probably won’t be very wrong.
![Page 70: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/70.jpg)
How to read a research study
Pay attention to the method: Observational, randomized double-blind experiment, meta-analysis, case study
Note the sample sizeDon’t ignore the confidence intervalsConsider the p-value as the probability you’re
writing about something that isn’t trueRemember correlation doesn’t necessarily mean
causation. Consider the quality of the journal (peer reviewed?)Who paid for the research?
![Page 71: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/71.jpg)
Newsroom math bibliography
“Numbers in the Newsroom”, by Sarah Cohen, IRE
“News and Numbers”, by Victor Cohn and Lewis Cope
“Precision Journalism (4th edition)”, by Phil Meyer
“Innumeracy”, by John Allen Paulos“A Mathematician Reads the Newspaper,” by
John Allen Paulos“Damned Lies and Statistics,” by Joel Best
![Page 72: Statistics for Science Journalists](https://reader036.fdocuments.in/reader036/viewer/2022062315/56816683550346895dda296b/html5/thumbnails/72.jpg)
Questions?