Post on 26-Dec-2015
1
TR 555 Statistics “Refresher”Lecture 1: Probability Concepts
References:– Penn State University, Dept. of Statistics
Statistical Education Resource Kit a collection of resources used by faculty in Penn State's
Department of Statistics in teaching introductory statistics courses.
Page maintained by Laura J. Simon, Sept. 2003 – Statistics: Making Sense of Data (MIT)
William Stout, John Marden and Kenneth Travers http://www.introductorystatistics.com/ Sept. 2003
– Tom Maze, stat course prepared for KDOT, 2003
2
Outline
Overview of statistics Types of data Describing data numerically and graphically Probability and random variables
3
Probability and Statistics
Probably is the likelihood of an event occurring relative to all other events
– Example: If a coin is flipped, what is the probability of getting a heads
– 0.5Given that the last flip was a heads what is the probability that the next will be
heads– 0.5
Statistics is the measurement and modeling of random variables– Example:
If our state averages 200 fatal crashes per year, what is the probability of having one crash today. Poisson distribution – = average per time period. 200/365 = 0.55
– P(1 = x) = ((t)x/x!)e-t=(0.55*1)1/1!)e-0.55(1)= 0.32
4
Data Collection
Designing experiments– Does aspirin help reduce the risk of heart
attacks?
Observational studies– Polls - Clinton’s approval rating
5
Variable Types
Deterministic– Assume away variation and randomness– Known with certainty– One to one mapping of independent variable to
dependent variable
Relationship
X1
Y1
6
Variable Types Continued
Random or Stochastic– Recognized uncertainty of an event– One to one distribution mapping of independent
variable to dependent variable
Probability that it could be any of these values
Most Likely Less LikelyLess Likely
7
Population
The set of data (numerical or otherwise) corresponding to the entire collection of units about which information is sought
8
Sample
A subset of the population data that are actually collected in the course of a study.
9
WHO CARES?
In most studies, it is difficult to obtain information from the entire population. We rely on samples to make estimates or inferences related to the population.
10
Organization and Description of Data
Qualitative vs. Quantitative data Discrete vs. Continuous Data Graphical Displays Measures of Center Measures of Variation
11
Qualitative (Categorical) Data
The raw (unsummarized) data are merely labels or categories
Quantitative (Numerical) Data
The raw (unsummarized) data are numerical
12
Qualitative Data Examples
Class Standing (Fr, So, Ju, Sr) Section # (1,2,3,4,5,6) Automobile Make (Ford, Chevrolet, Nissan) Questionnaire response (disagree, neutral,
agree)
13
Quantitative Data Examples (measures)
Voltage Height Weight SAT Score Number of students arriving late for class Time to complete a task
14
Discrete Data
Only certain values are possible (there are gaps between the possible values)
Continuous Data
Theoretically, any value within an interval is possible with a fine enough measuring device
15
Discrete Data Examples
Number of students late for class Number of crimes reported to SC police Number of times the word number is used
(generally, discrete data are counts)
16
Discrete Variable ModelPoisson Distribution
(0.55*t)x/x!)e-0.55(t)
01
23
45
67
89
1011
1213
1415
# of Fatal Crashes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pro
babi
lty
Probability of # of Fatals per one day
17
Continuous Data Examples
Voltage Height Weight Time to complete a homework assignment
18
Continuous Variable ModelExponential Distribution
0 0.8 1.6 2.4 3.2 4 4.7 5.5
Time till the first fatal accident
0
0.1
0.2
0.3
0.4
0.5
0.6
Pro
babi
lity
Fatality Probability Density Function
Probability of first Fatal at time t = e-t
19
Continuous Probability Function
0 0.8 1.6 2.4 3.2 4 4.7 5.5
Days
0
0.2
0.4
0.6
0.8
1
1.2
Cum
mul
ativ
e P
roba
bilit
y
Cummulative Probability till first fatal
Cumulative Probability of Time Till First Fatal t = 1 - e-t
20
Nominal Data
A type of categorical data in which objects fall into unordered categories, for example:– Hair color
blonde, brown, red, black, etc.
– Race Caucasian, African-American, Asian, etc.
– Smoking status smoker, non-smoker
21
Ordinal Data
A type of categorical data in which order is important. For example …– Class
fresh, sophomore, junior, senior, super senior
– Degree of illness none, mild, moderate, severe, …, going, going, gone
– Opinion of students about riots ticked off, neutral, happy
22
Binary Data
A type of categorical data in which there are only two categories.
Binary data can either be nominal or ordinal, for example …
– Smoking status smoker, non-smoker
– Attendance present, absent
– Class lower classman, upper classman
23
Interval and Ratio Data
Interval– Interval is important, but no meaningful zero– e.g, temperature in farenheit
Ratio– has a meaningful zero value– e.g., temperature in Kelvin, crash rate
24
Who Cares?
The type(s) of data collected in a study determine the type of statistical analysis used.
25
Proportions
Categorical data are commonly summarized using “percentages” (or “proportions”).– 11% of students have a tattoo– 2%, 33%, 39%, and 26% of the students in class
are, respectively, freshmen, sophomores, juniors, and seniors
26
Averages
Measurement data are typically summarized using “averages” (or “means”).– Average number of siblings Fall 1998 Stat 250
students have is 1.9.– Average weight of male Fall 1998 Stat 250
students is 173 pounds.– Average weight of female Fall 1998 Stat 250
students is 138 pounds.
27
Descriptive statistics
Describing data with numbers:measures of location
28
Mean
Another name for average. If describing a population, denoted as , the
greek letter “mu”. If describing a sample, denoted as x, called “x-
bar”. Appropriate for describing measurement data. Seriously affected by unusual values called
“outliers”.
_
29
Calculating Sample Mean
nX
X iFormula:
That is, add up all of the data points and divide by the number of data points.
Data (# of classes skipped): 2 8 3 4 1
Sample Mean = (2+8+3+4+1)/5 = 3.6
Do not round! Mean need not be a whole number.
30
Population Mean
The mean of a random variable X is called the population mean and is denoted
It is also called the expected value of X or the expectation of X and is denoted E(X).
ii xfxXE )(
31
Median
Another name for 50th percentile. Appropriate for describing measurement
data. “Robust to outliers,” that is, not affected
much by unusual values.
32
Calculating Sample Median
Order data from smallest to largest.
If odd number of data points, the median is the middle value.
Data (# of classes skipped): 2 8 3 4 1
Ordered Data: 1 2 3 4 8
Median
33
Calculating Sample Median
Order data from smallest to largest.
If even number of data points, the median is the average of the two middle values.
Data (# of classes skipped): 2 8 3 4 1 8
Ordered Data: 1 2 3 4 8 8
Median = (3+4)/2 = 3.5
34
Mode
The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most
useful for categorical data or discrete data with only a few number of possible values.
35
Most appropriate measure of location
Depends on whether or not data are “symmetric” or “skewed”.
Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.
36
Symmetric and Unimodal
37
Symmetric and Bimodal
38
Skewed Right
0 100 200 300 400
0
10
20
Number of Music CDs
Fre
quen
cy
Number of Music CDs of Spring 1998 Stat 250 Students
39
Skewed Left
40
Choosing Appropriate Measure of Location
If data are symmetric, the mean, median, and mode will be approximately the same.
If data are multimodal, report the mean, median and/or mode for each subgroup.
If data are skewed, report the median.
41
Descriptive statistics
Describing data with numbers: measures of variability
42
Range
The difference between largest and smallest data point.
Highly affected by outliers.
Best for symmetric data with no outliers.
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
0
10
20
GPA
Fre
quen
cy
GPAs of Spring 1998 Stat 250 Students
43
Interquartile range
The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values.
IQR = Q3-Q1 Robust to outliers or
extreme observations. Works well for skewed data.
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
0
10
20
GPA
Fre
quen
cy
GPAs of Spring 1998 Stat 250 Students
44
Variance
1n
2)x(x2s
1. Find difference between each data point and mean.
2. Square the differences, and add them up.
3. Divide by one less than the number of data points.
45
Variance
If measuring variance of population, denoted by 2 (“sigma-squared”).
If measuring variance of sample, denoted by s2 (“s-squared”).
Measures average squared deviation of data points from their mean.
Highly affected by outliers. Best for symmetric data.
Problem is units are squared.
46
Population Variance
The variance of a random variable X is called the population variance and is denoted
2
ii xfx22
47
Standard deviation
Sample standard deviation is square root of sample variance, and so is denoted by s.
Units are the original units. Measures average deviation of data points
from their mean. Also, highly affected by outliers.
48
Population Standard Deviation
The population standard deviation is the square root of the population variance and is denoted
ii xfx22
49
What is the variance or standard deviation?
(MPH)
50
Variance or standard deviation
Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 06.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75
Females: s = 11.32 mph and s2 = 11.322 = 128.1 mph2
Males: s = 17.39 mph and s2 = 17.392 = 302.5 mph2
51
Coefficient of Variation (COV) – not covariance!
Ratio of sample standard deviation to sample mean multiplied by 100.
Measures relative variability, that is, variability relative to the magnitude of the data.
Unitless, so good for comparing variation between two groups.
52
Coefficient of variation (MPH)
Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 106.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75
Females: CV = (11.32/91.23) x 100 = 12.4
Males: CV = (17.39/106.79) x 100 = 16.3
53
Choosing Appropriate Measure of Variability
If data are symmetric, with no serious outliers, use range and standard deviation.
If data are skewed, and/or have serious outliers, use IQR.
If comparing variation across two data sets, use coefficient of variation.
54
Descriptive Statistics
Summarizing data using graphs
55
Which graph to use?
Depends on type of data Depends on what you want to illustrate Depends on available statistical software
56
Bar Chart
Summarizes categorical data. Horizontal axis represents categories, while vertical
axis represents either counts (“frequencies”) or percentages (“relative frequencies”).
Used to illustrate the differences in percentages (or counts) between categories.
Middle Oldest Only Youngest
10
20
30
40
Birth Order
Per
cent
Birth Order of Spring 1998 Stat 250 Students
n=92 students
57
Histogram
Divide measurement up into equal-sized categories. Determine number (or percentage) of measurements
falling into each category. Draw a bar for each category so bars’ heights represent
number (or percent) falling into the categories. Label and title appropriately.
18 19 20 21 22 23 24 25 26 27
0
10
20
30
40
50
Age (in years)
Fre
quen
cy (
Cou
nt)
Age of Spring 1998 Stat 250 Students
n=92 students
58
Use common sense in determining number of categories to use.
(Trial-and-error works fine, too.)
Number of ranges (see Tufte)
18 23 28
0
10
20
30
40
50
60
Age (in years)
Fre
quen
cy (
Cou
nt)
Age of Spring 1998 Stat 250 Students
n=92 students
2 3 4
0
1
2
3
4
5
6
7
GPA
Fre
quen
cy (
Co
unt)
GPAs of Spring 1998 Stat 250 Students
n=92 students
59
Dot Plot
Summarizes measurement data.
Horizontal axis represents measurement scale.
Plot one dot for each data point.
160150140130120110100908070Speed
Fastest Ever Driving Speed
Women126
Men100
226 Stat 100 Students, Fall '98
60
Stem-and-Leaf Plot
Summarizes measurement data.
Each data point is broken down into a “stem” and a “leaf.”
First, “stems” are aligned in a column.
Then, “leaves” are attached to the stems.
61
Boxplot
smallest observation = 3.20 Q1 = 43.645
Q2 (median) = 60.345
Q3 = 84.96 largest observation = 124.27
0 10 20 30 40 50 60 70 80 90 100 110 120 130
. . . . .
62
Box Plot
“Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile.
– Whiskers are useful for identifying outliers.
“Outliers,” or extreme observations, are denoted by asterisks.
– Generally, data points falling beyond the whiskers are considered outliers.
Useful for comparing two distributions
0
1
2
3
4
5
6
7
8
9
10
Hou
rs o
f sle
ep
Amount of sleep in past 24 hours
of Spring 1998 Stat 250 Students
63
Using Box Plots to Compare
female male
60
110
160
Gender
Fast
est
Speed (
mph)
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998
64
Scatter Plots
Summarizes the relationship between two measurement variables.
Horizontal axis represents one variable and vertical axis represents second variable.
Plot one point for each pair of measurements.
22 23 24 25 26 27 28 29 30 31
22
23
24
25
26
27
28
29
30
31
Left foot (in cm)
Rig
ht fo
ot (
in c
m)
Foot sizes of Spring 1998 Stat 250 students
n=88 students
65
No relationship
52 57 62
22
23
24
25
26
27
28
29
30
31
32
Head circumference (in cm)
Left fore
arm
(in
cm
)
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
n=89 students
66
Closing comments
Many possible types of graphs. Use common sense in reading graphs. When creating graphs, don’t summarize your
data too much or too little. When creating graphs, label everything for
others. Remember you are trying to communicate something to others!
67
Probability
You’ll probably like it!
68
Before we begin …
What is the probability that 2 or more people share the same birthday if …– 5 people are in the sample?– 23 people?– 50 people?– This class?
69
Probability Properties
The probability of an event “A” (the proportion of times the event is expected to occur in repeated experiments), is denoted P(A).
All probabilities are between 0 and 1.(i.e. 0 < P(A) < 1)
The sum of the probabilities of all possible outcomes must be 1.
70
Probability Basics
Given that a crash has occurred, what is the probability that it is a fatal crash?– Possible events – Fatal, injury, and property
damage onlyFatal 37,000 P(F) = 0.58%Injury 2,026,000 P(I) = 32.16%PDO 4,226,000 P(D) = 67.08%Total Crashes 6,300,000
71
Complement
The complement of an event A, denoted by A, is the set of outcomes that are not in A
A means A does not occur
P(A) = 1 - P(A)Some texts use Ac to denote the complement of A
72
Union
The union of two events A and B, denoted by A U B, is the set of outcomes that are in A, or B, or both
If A U B occurs, then either A or B or both occur
73
Intersection
The intersection of two events A and B, denoted by AB, is the set of outcomes that
are in both A and B.
If AB occurs, then both A and B occur
74
Combinations of Events
Union of fatal speed related and run-off the road crashes
Single Vehicle Crash
Speed RelatedCrashes
Intersection of Fatal and Run-off the Road Crashes
All Fatal Crashes (37,795)
21,052
13,357
75
Addition Law
P(A U B) = P(A) + P(B) - P(AB)
(The probability of the union of A and B is the probability of A plus the probability of B minus the probability of the intersection of A and B)
76
Mutually Exclusive Events
Two events are mutually exclusive if their intersection is empty.
Two events, A and B, are mutually exclusive if and only if P(AB) = 0
P(A U B) = P(A) + P(B)
77
Conditional Probability
The probability of event A occurring, given that event B has occurred, is called the conditional probability of event A given event B, denoted P(A|B)
78
Multiplication Rule
General form P(A/B) = P(A,B)/P(B)e.g., what is the probability of a single vehicle
accident given that it was speed related?
79
Conditional Probability Example
Total fatal crashes - 37,795 Total speed related crashes – 13,357 Total single vehicle crashes – 21,052 Total single vehicle, speed related crashes - 8,600 If the crash was speed related, what is the probability that it was a
single vehicle crash?– P(sv/sp) = 8600/13357 = 64.38%
If the crash was speed related, what is the probability that it was not a single vehicle crash?
– P(sv/sp) = 1 – 0.6438 = 35.62%
Single VehicleCrashes
Speed RelatedCrashes
21,05213,357
All FatalCrashes37,795
SR+SV8,600
80
Conditional Probability Example (Cont)
Probability that a fatal crash was speed related = P(sp) – 13,357/ 37,795 = 35.34%
Probability that a fatal crash was a single vehicle = P(sv) – 21,052/37,795 = 55.70%
Probability that a fatal crash is both speeding related and a single vehicle = P(sv,sp)
– 8,600/37,795 = 22.74%
Single VehicleCrashes
Speed RelatedCrashes
21,05213,357
All FatalCrashes37,795
SR+SV8,600
81
Bayes’ Theorem
P(A/B)P(B) = P(B/A)P(A)P(B/A) = P(A/B)P(B)/P(A)P(sv) = 55.70%P(sp) = 35.34%P(sv/sp) = 64.38%P(sp/sv) = ?P(sp/sv) =
((0.6438)*(0.3534))/0.5570 = 0.3854
Single VehicleCrashes
Speed RelatedCrashes
21,05213,357
All FatalCrashes37,795
SR+SV8,600
82
Bayes’ Theorem Problem
Given– There were 11,696 off-road fixed object fatal crashes
involving a single vehicle– There were 13,357 fatal crashes involving a speeding vehicle– There were 8,600 fatal crashes involving speeding and single
vehicles– There were 5,400 fatal crashes involving single vehicles,
speeding, and off-road fixed object crashes– The total number of fatal crashes is 37,795– Given that a crash is speeding related, what is the probability
that it will be an off-road single vehicle crash
83
Bayes’ Problem Answer
What we need to know P(or,sv/sp)What we know
– P(or,sv) = 30.95%– P(sp) = 35.34%– P(sv,sp) = 55.70%– P(sv,sp) = 22.75%– P(sp,or,sv) = 14.29%– P(or,sv/sv) = 55.56%
84
Answer Continued
Multiplication Rule– P(sp/or,sv)P(or,sv) = P(sp,or,sv)– P(sp/or,sv) = P(sp,or,sv)/P(or,sv)– 46.17% =0.1429/0.3095
Bayes’ Theorem– P(or,sv/sp)= (P(sp/or,sv)*P(or,sv))/P(sp)– 40.43% = (0.4617*0.3095)/0.3534
85
Independence
Two events A and B are independent if
P(A|B) = P(A)
or
P(B|A) = P(B)
or
P(AB) = P(A)P(B)
86
Probability Concepts
RandomnessIndependence
87
Thought Question 1
What does it mean to say that a deck of cards is “randomly” shuffled? Every ordering of the cards is equally likely
There are 8 followed by 67 zeros possible orderings of a 52 card deck
Every card has the same probability to end up in any specified location
88
The question continued
A 52 card deck is randomly shuffled How often will the tenth card down from
the top be a Club? 1/4 of the time Every card has the same chance to end up
10th. There are 13 clubs and 13 / 52 = 1/4
89
Law of Large Numbers
Relative frequency of an event gets closer to true probability as number of trials gets larger
90
Probability values
Probabilities are between 0 and 1 Total probabilities of all possible
outcomes = 1 Probability = 1
means an event always happens
Probability = 0 means an event never happens
91
Does a prior event matter?
A fair coin is flipped four times. First three flips are heads What’s the probability that the fourth flip
is heads? 1/2 assuming flips are independent
Results of first three flips don’t matter
92
Independence
The chance that B happens is not affected by whether A had happened.
93
Does prior event matter?
Ten card drawn without replacement from 52 card deck.
2 Aces are among these 10 cards What’s the probability the tenth card is an
Ace? 2/42 = 1/21
After ten draws, 42 cards remain, 2 of them are Aces
94
Dependence
The chance that B happens is affected by whether A has happened.
95
Sequence of Events
You guess at five True False questions. What’s the probability you get them right?
96
Five right in five guesses
For each question, Pr(correct) = 1/2 Multiply probabilities
(1/2) x (1/2) x (1/2) x (1/2) x (1/2) = 1/32 = 0.031
97
Card Example
Two cards are taken from normal 52 card deck.
What’s the probability that both are Hearts?
Note - there’s dependence between the two cards
Answer = (13/52) x (12/51) = 1/17 = 0.059
98
The Birthday Problem
What is the probability that at least two people in this class share the same birthday?
99
Assumptions
Only 365 days each year. Birthdays are evenly distributed throughout
the year, so that each day of the year has an equal chance of being someone’s birthday.
100
Take group of 5 people….
Let A = event no one in group shares same birthday.
Then AC = event at least 2 people share same birthday.
P(A) = 365/365 × 364/365 × 363/365 × 362/365 × 361/365
= 0.973
P(AC) = 1 - 0.973 = 0.027
That is, about a 3% chance that in a group of 5 people at least two people share the same birthday.
101
Take group of 23 people….
Let A = event no one in group shares same birthday.
Then AC = event at least 2 people share same birthday.
P(A) = 365/365 × 364/365 × … × 343/365
= 0.493
P(AC) = 1 - 0.493 = 0.507
That is, about a 50% chance that in a group of 23 people at least two people share the same birthday.
102
Take group of 50 people….
Let A = event no one in group shares same birthday.
Then AC = event at least 2 people share same birthday.
P(A) = 365/365 × 364/365 × … × 316/365
= 0.03
P(AC) = 1 - 0.03 = 0.97
That is, “virtually certain” that in a group of 50 people at least two people share the same birthday.
103
Two-way Tables
And various probabilities...
104
Two-way table of counts
Rows: gender Columns: pierced ears N Y All M 71 19 90 F 4 84 88 All 75 103 178 Cell Contents -- Count
105
Joint (“”) probabilities
Rows: gender Columns: pierced ears N Y All M 71 19 90 39.89 10.67 50.56 F 4 84 88 2.25 47.19 49.44
All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Tbl
106
Row conditional probabilities
Rows: gender Columns: pierced ears N Y All M 71 19 90 78.89 21.11 100.00 F 4 84 88 4.55 95.45 100.00 All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Row
107
Column conditional probabilities
Rows: gender Columns: pierced ears N Y All M 71 19 90 94.67 18.45 50.56 F 4 84 88 5.33 81.55 49.44 All 75 103 178 100.00 100.00 100.00 Cell Contents -- Count % of Col
108
Expected Value
Coincidences
109
Roulette Color Bet
18 black, 18 red, and 2 green numbers Bet on one of black or red If correct , win $1 If wrong, lose $1
110
Is the bet fair?
Fair game : expected value is 0 Expected value =
sum of (outcome x prob) Exp Val. = (+1)(18/38)+(-1)(20/38) = -2/38 Not fair since expected value is not 0.
111
Color Bet versus Number bet
Both have same expected value How are the bets the same? Long run result is same How are they different? Short run results can be quite different
112
Prob of Five Straight Losses
Color Bet = (20/38)5 = 0.04 , 4% Number Bet = (37/38)5 = 0.88, 88%
113
A Spectacular Coincidence ?
Many states draw four digit lottery numbers
Several years ago Mass. and N.H. both drew the same number on the same night
Associated Press wrote that this was a spectacular 1 in 100 million coincidence
114
Was Associated Press Right ?
Only if number picked is specified in advance of the draws.
Chance both pick the same pre-specified number, for example 2963, is (1/10,000) (1/10,000)
This is 1 in 100 million But the match could have been on any of
10,000 possibilities
115
The correct analysis
First state could have picked any number Chance the second state matches is
1/10,000 Answer for two specific states is 1/10,000 But there were 15 states doing this almost
every night .
116
The prob that the 15 states all differ
First state can be any number Prob second state differs = 9,999/10,000 Prob third state is unique = 9,998/10,000 And so on, for 15 states Multiply these prob.'s to get probability
that all 15 differ Answer is about 0.99 that all picked
different numbers
117
Prob at least two states are same
Opposite from all different Prob at least two the same = 1-Prob(all
differ) 1 - 0.99 = 0.01 About 1 in 100 ; a far cry from 1 in 100
million