Data sampling and probability
-
Upload
avjinder-avi-kaler -
Category
Education
-
view
203 -
download
0
Transcript of Data sampling and probability
DATA SAMPLING AND PROBABILITY Avjinder Singh Kaler and Kristi Mai
Multiplication Rule: Complements and Conditional Probability
Counting
Types of Sampling Methods
Summarizing Data
Statistical Graphs
Probability Distributions
Normal and Standard Normal Distribution
A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.
denotes the conditional probability of event B occurring, given that event A has already occurred, and it can be found by dividing the probability of events A and B both occurring by the probability of event A:
( | )P B A
( and )( | )
( )
P A BP B A
P A
Refer to Table 4-1 to find the following:
a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject had a positive test result, given that the subject actually uses drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔).
a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject actually uses drugs, given that the he/she had a positive test result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).
Solution:
a) P positive test result subject uses drugs =P subject uses drugs and had a positive test result
P(subject uses drugs)
P positive test result subject uses drugs =44
10050
100
=44
50= 0.88
b) P subject uses drugs positive test result =P subject uses drugs and had a positive test result
P(positive test result)
𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 =44
134= 0.328
Table 4-1 Pre-Employment Drug Screening Results
Positive Test Result Negative Test Result
Subject Uses Drugs 44 (True Positive) 6 (False Negative)
Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)
For a sequence of two events in which the first event can occur 𝑚
ways and the second event can occur 𝑛 ways, the events together
can occur a total of 𝑚 ∗ 𝑛 ways.
Example:
For a two-character code consisting of a letter followed by a digit, the
number of different possible codes is 26 ∗ 10 = 260.
The factorial symbol ! denotes the product of decreasing positive
whole numbers.
For example,
By special definition, 0! = 1.
4! 4 3 2 1 24
n! = Number of different permutations (order counts) of n different items can
be arranged when all n of them are selected. (This factorial rule reflects the
fact that the first item may be selected in n different ways, the second item
may be selected in n – 1 ways, and so on.)
Example:
The number of ways that the five letters {a, b, c, d, e} can be arranged is as
follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120
Requirements:
1. There are n different items available. (This rule does not apply if some of
the items are identical to others.)
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be different sequences.
(The permutation of ABC is different from CBA and is counted separately.)
If the preceding requirements are satisfied, the number of permutations (or
sequences) of r items selected from n available items (without replacement) is
!
( )!n r
nP
n r
If the five letters {a, b, c, d, e} are available and three of them are to be selected without replacement, the number of different permutations is as follows:
𝑛𝑃𝑟 =𝑛!
(𝑛 − 𝑟)!=
5!
(5 − 3)!= 60
Requirements:
1. There are n items available, and some items are identical to others.
2. We select all of the n items (without replacement).
3. We consider rearrangements of distinct items to be different sequences.
If the preceding requirements are satisfied, and if there are n1 alike, n2 alike,
. . . nk alike, the number of permutations (or sequences) of all items selected
without replacement is
1 2
!
! ! !k
n
n n n
If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of them are to be selected without replacement, the number of different permutations is as follows:
𝑛!
𝑛1! 𝑛2! ⋯ 𝑛𝑘!=
10!
4! 2! 2!=
3,628,800
24 ∗ 2 ∗ 2= 37,800
Requirements:
1. There are n different items available.
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be the same. (The
combination of ABC is the same as CBA.)
If the preceding requirements are satisfied, the number of combinations of r
items selected from n different items is
!
( )! !n r
nC
n r r
In the Pennsylvania Match 6 Lotto, winning the jackpot requires you select six different numbers from 1 to 49. The winning numbers may be drawn in any order. Find the probability of winning if one ticket is purchased.
! 49!Number of combinations: 13,983,816
! ! 43!6!
1winning
13,983,816
n r
nC
n r r
P
When different orderings of the same items are to be counted separately, we have a permutation problem, but when different orderings are not to be counted separately, we have a combination problem.
Permutations are for lists (order matters) and combinations are for groups (order doesn’t matter).
Data – collections of observations, such as measurements, genders,
or survey responses
Population – the complete collection of all individuals to be studied
Sample – sub-collection of population the data comes from
Census – the collection of data from every member of the population
planning studies, designing experiments, and
obtaining data
organizing, summarizing, analyzing, interpreting,
drawing conclusions about, and presenting data
The Gallup corporation collected data from 1013 adults in the United States. Results showed that 66% of the respondents worried about identity theft.
The population consists of all 241,472,385 adults in the United States.
The sample consists of the 1013 polled adults.
The objective is to use the sample data as a basis for drawing a conclusion about the whole population.
Simple random sample
Random sample
Systematic sampling
Convenience sampling
Stratified sampling
Cluster sampling
A sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.
Members from the population are selected in such a way that each individual member in the population has an equal chance of being selected.
Select some starting point and then select every kth element in the population.
Use results that are easy to get.
Subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup (or stratum).
Divide the population area into sections (or clusters). Then randomly select some of those clusters. Now choose all members from selected clusters.
When working with large data sets, it is often helpful to
organize and summarize data by constructing a table called
a frequency distribution.
Shows how a data set is partitioned among all of several
categories (or classes) by listing all of the categories along
with the number (frequency) of data values in each of them
All categories/classes and the number of observations in
that given category/class
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Lower Class
Limits
are the smallest numbers that can
actually belong to different classes.
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Upper Class
Limits
are the largest numbers that can
actually belong to different classes.
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Boundaries
are the numbers used to separate
classes, but without the gaps created
by class limits.
49.5
69.5
89.5
109.5
129.5
149.5
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Midpoints
are the values in the middle of the
classes and can be found by adding
the lower class limit to the upper class
limit and dividing the sum by 2.
59.5
79.5
99.5
119.5
139.5
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
2
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Width
is the difference between two
consecutive lower class limits or two
consecutive lower class boundaries.
20
20
20
20
20
relative frequency = class frequency
sum of all frequencies
includes the same class limits as a frequency distribution, but the
frequency of a class is replaced with a relative frequencies (a
proportion) or a percentage frequency ( a percent)
percentage
frequency
class frequency
sum of all frequencies 100% =
IQ Score Frequency Relative Frequency
50-69 2 2.6%
70-89 33 42.3%
90-109 35 44.9%
110-129 7 9.0%
130-149 1 1.3%
Cu
mu
lative
Fre
qu
en
cie
s IQ Score Frequency Cumulative Frequency
50-69 2 2
70-89 33 35
90-109 35 70
110-129 7 77
130-149 1 78
The frequencies start low, then increase to higher frequencies until reaching a maximum, and then decrease to low again.
The distribution is approximately symmetric
• frequencies preceding the maximum being roughly a mirror image of those that follow the maximum
Numerical in nature
Consists of numbers representing counts or measurements
Have a unit and can be used arithmetically
Quantitative data can be further described by distinguishing between discrete and continuous types.
Examples:
• The weights of supermodels
• The ages of respondents
the number of possible values is either a finite number or a
‘countable’ number (i.e. the number of possible values is 0,
1, 2, 3, . . .).
Example:
The number of eggs that a hen lays
infinitely many possible values that correspond to some
continuous scale that covers a range of values without gaps,
interruptions, or jumps
Example:
The amount of milk that a cow produces;
e.g. 2.343115 gallons per day
consists of names or labels (representing categories)
Example:
• The gender (male/female) of professional athletes.
• Shirt numbers on professional athletes uniforms - substitutes for names.
• Uses bars of equal width to show
frequencies of categorical, or
qualitative, data
• Vertical scale represents frequencies or
relative frequencies.
• Horizontal scale identifies the different
categories of qualitative data.
A multiple bar graph has two or more sets of bars and is used to
compare two or more data sets.
A bar graph for qualitative data, with the bars arranged in descending order according to frequencies
A graph depicting qualitative data as slices of a circle, in which the size of each slice is proportional to frequency count
a variable (typically represented by 𝑥) that has a single numerical value, determined by chance, for each outcome of a given procedure
Can be discrete or continuous – just like data
Discrete Random Variable either a finite number of values or countable number of values, where “countable” refers to the fact that there might be infinitely many values, but that they result from a counting process
Continuous Random Variable has infinitely many values, and those values can be associated with measurements on a continuous scale without gaps or interruptions.
a description that gives the probability for each value of the random variable
often expressed in the format of a graph, table, or formula
Note:
If a probability is very small, it is represented as 0+ in tables
(i.e. it is very small, yet positive)
1. There is a numerical random variable x and its values are associated with corresponding probabilities.
2. The sum of all probabilities must be 1.
3. Each probability value must be between 0 and 1 inclusive.
1P x
0 1P x
The probability histogram is very similar to a relative frequency histogram, but the vertical scale shows probabilities.
According to the range rule of thumb, most values should lie within 2 standard deviations of the mean.
We can therefore identify “unusual” values by determining if they lie outside these limits:
Maximum usual value =
Minimum usual value =
2
2
We found for families with two children, the mean number of girls is 1.0 and the standard deviation is 0.7 girls.
Use those values to find the maximum and minimum usual values for the number of girls.
Solution:
maximum usual value 2 1.0 2 0.7 2.4
minimum usual value 2 1.0 2 0.7 0.4
Rare Event Rule for Inferential Statistics
If, under a given assumption (such as the assumption that a coin is fair), the probability of a particular observed event (such as 992 heads in 1000 tosses of a coin) is extremely small, we conclude that the assumption is probably not correct.
Using Probabilities to Determine When Results Are Unusual
Unusually high # of successes: x successes among n trials is an unusually high number of successes if
.
Unusually low # of successes : x successes among n trials is an unusually low number of successes if
( or fewer) 0.05P x
( or more) 0.05P x
A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:
1. The total area under the curve must equal 1.
2. Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)
Because the total area under the density curve is equal to 1, there is a correspondence between area and probability.
A continuous random variable has a uniform distribution if its values are spread evenly over the range of probabilities. The graph of a uniform distribution results in a rectangular shape.
Given the uniform distribution illustrated, find the probability that a randomly selected voltage level is greater than 124.5 volts.
Shaded area
represents voltage
levels greater than
124.5 volts.
21
2
( )2
x
ef x
A continuous R.V. has a normal distribution if it has a graph that is
symmetric and bell-shaped and if the R.V. can be described by the
following equation:
The standard normal distribution is a normal probability distribution with μ = 0 and σ = 1. The total area under its density curve is equal to 1.
Represents how much a given value, 𝑥, deviates/varies from the center of a set of data
This value can help to assess how “extreme” a particular data value is based on the distribution the value is supposed to follow
This score can also be used to convert sample data (sample statistics) to a measure of relative standing so that we may be able to compare sample to one another.
Basic “Idea” Behind Formulas for Z-Scores:
𝑍 =𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
If the z-score is positive (+), the specific value falls above the center value.
If the z-score is negative (-), the specific value falls below the center value.
“Usual” values have z-scores between -2 and 2.
“Unusual” values have z-scores less than -2 or greater than 2.
We can find areas (probabilities) for different regions under a normal model using StatCrunch.
A bone mineral density test can be helpful in identifying the presence of osteoporosis.
The result of the test is commonly measured as a z score, which has a normal distribution with a mean of 0 and a standard deviation of 1.
A randomly selected adult undergoes a bone density test.
Find the probability that the result is a reading less than 1.27.
The probability of random adult having a bone density less than 1.27 is 0.8980.
( 1.27) 0.8980P z
Using the same bone density test, find the probability that a randomly selected person has a result above –1.00 (which is considered to be in the “normal” range of bone density readings.
The probability of a randomly selected adult having a bone density above –1 is 0.8413.
A bone density reading between –1.00 and –2.50 indicates the subject has osteopenia. Find this probability.
The probability of a randomly selected adult having osteopenia is 0.1525.
denotes the probability that the z score is between a and b.
denotes the probability that the z score is greater than a.
denotes the probability that the z score is less than a.
( )P a z b
( )P z a
( )P z a
Finding the 95th Percentile
1.645
5% or 0.05
(z score will be positive)
Using the same bone density test, find the bone density scores that separates the bottom 2.5% and find the score that separates the top 2.5%.
For the standard normal distribution, a critical value is a z score separating unlikely values from those that are likely to occur.
Notation:
The expression zα denotes the z score with an area of α to its right.
Find the value of z0.025.
The notation z0.025 is used to represent the z score with an area of 0.025 to its right.
Referring back to the bone density example,
z0.025 = 1.96.
• Complete HW1 and HW2 on MLP