Data sampling and probability

DATA SAMPLING AND PROBABILITY Avjinder Singh Kaler and Kristi Mai

Multiplication Rule: Complements and Conditional Probability

Counting

Types of Sampling Methods

Summarizing Data

Statistical Graphs

Probability Distributions

Normal and Standard Normal Distribution

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.

denotes the conditional probability of event B occurring, given that event A has already occurred, and it can be found by dividing the probability of events A and B both occurring by the probability of event A:

( | )P B A

( and )( | )

( )

P A BP B A

P A

Refer to Table 4-1 to find the following:

a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject had a positive test result, given that the subject actually uses drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔).

a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject actually uses drugs, given that the he/she had a positive test result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).

Solution:

a) P positive test result subject uses drugs =P subject uses drugs and had a positive test result

P(subject uses drugs)

P positive test result subject uses drugs =44

10050

100

=44

50= 0.88

b) P subject uses drugs positive test result =P subject uses drugs and had a positive test result

P(positive test result)

𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 =44

134= 0.328

Table 4-1 Pre-Employment Drug Screening Results

Positive Test Result Negative Test Result

Subject Uses Drugs 44 (True Positive) 6 (False Negative)

Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)

For a sequence of two events in which the first event can occur 𝑚

ways and the second event can occur 𝑛 ways, the events together

can occur a total of 𝑚 ∗ 𝑛 ways.

Example:

For a two-character code consisting of a letter followed by a digit, the

number of different possible codes is 26 ∗ 10 = 260.

The factorial symbol ! denotes the product of decreasing positive

whole numbers.

For example,

By special definition, 0! = 1.

4! 4 3 2 1 24

n! = Number of different permutations (order counts) of n different items can

be arranged when all n of them are selected. (This factorial rule reflects the

fact that the first item may be selected in n different ways, the second item

may be selected in n – 1 ways, and so on.)

Example:

The number of ways that the five letters {a, b, c, d, e} can be arranged is as

follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120

Requirements:

1. There are n different items available. (This rule does not apply if some of

the items are identical to others.)

2. We select r of the n items (without replacement).

3. We consider rearrangements of the same items to be different sequences.

(The permutation of ABC is different from CBA and is counted separately.)

If the preceding requirements are satisfied, the number of permutations (or

sequences) of r items selected from n available items (without replacement) is

!

( )!n r

nP

n r

If the five letters {a, b, c, d, e} are available and three of them are to be selected without replacement, the number of different permutations is as follows:

𝑛𝑃𝑟 =𝑛!

(𝑛 − 𝑟)!=

5!

(5 − 3)!= 60

Requirements:

1. There are n items available, and some items are identical to others.

2. We select all of the n items (without replacement).

3. We consider rearrangements of distinct items to be different sequences.

If the preceding requirements are satisfied, and if there are n1 alike, n2 alike,

. . . nk alike, the number of permutations (or sequences) of all items selected

without replacement is

1 2

!

! ! !k

n

n n n

If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of them are to be selected without replacement, the number of different permutations is as follows:

𝑛!

𝑛1! 𝑛2! ⋯ 𝑛𝑘!=

10!

4! 2! 2!=

3,628,800

24 ∗ 2 ∗ 2= 37,800

Requirements:

1. There are n different items available.

2. We select r of the n items (without replacement).

3. We consider rearrangements of the same items to be the same. (The

combination of ABC is the same as CBA.)

If the preceding requirements are satisfied, the number of combinations of r

items selected from n different items is

!

( )! !n r

nC

n r r

In the Pennsylvania Match 6 Lotto, winning the jackpot requires you select six different numbers from 1 to 49. The winning numbers may be drawn in any order. Find the probability of winning if one ticket is purchased.

! 49!Number of combinations: 13,983,816

! ! 43!6!

1winning

13,983,816

n r

nC

n r r

P

When different orderings of the same items are to be counted separately, we have a permutation problem, but when different orderings are not to be counted separately, we have a combination problem.

Permutations are for lists (order matters) and combinations are for groups (order doesn’t matter).

Data – collections of observations, such as measurements, genders,

or survey responses

Population – the complete collection of all individuals to be studied

Sample – sub-collection of population the data comes from

Census – the collection of data from every member of the population

planning studies, designing experiments, and

obtaining data

organizing, summarizing, analyzing, interpreting,

drawing conclusions about, and presenting data

The Gallup corporation collected data from 1013 adults in the United States. Results showed that 66% of the respondents worried about identity theft.

The population consists of all 241,472,385 adults in the United States.

The sample consists of the 1013 polled adults.

The objective is to use the sample data as a basis for drawing a conclusion about the whole population.

Simple random sample

Random sample

Systematic sampling

Convenience sampling

Stratified sampling

Cluster sampling

A sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Members from the population are selected in such a way that each individual member in the population has an equal chance of being selected.

Select some starting point and then select every kth element in the population.

Use results that are easy to get.

Subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup (or stratum).

Divide the population area into sections (or clusters). Then randomly select some of those clusters. Now choose all members from selected clusters.

When working with large data sets, it is often helpful to

organize and summarize data by constructing a table called

a frequency distribution.

Shows how a data set is partitioned among all of several

categories (or classes) by listing all of the categories along

with the number (frequency) of data values in each of them

All categories/classes and the number of observations in

that given category/class

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Lower Class

Limits

are the smallest numbers that can

actually belong to different classes.

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Upper Class

Limits

are the largest numbers that can

actually belong to different classes.

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Boundaries

are the numbers used to separate

classes, but without the gaps created

by class limits.

49.5

69.5

89.5

109.5

129.5

149.5

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Midpoints

are the values in the middle of the

classes and can be found by adding

the lower class limit to the upper class

limit and dividing the sum by 2.

59.5

79.5

99.5

119.5

139.5

𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡

2

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Width

is the difference between two

consecutive lower class limits or two

consecutive lower class boundaries.

20

20

20

20

20

relative frequency = class frequency

sum of all frequencies

includes the same class limits as a frequency distribution, but the

frequency of a class is replaced with a relative frequencies (a

proportion) or a percentage frequency ( a percent)

percentage

frequency

class frequency

sum of all frequencies 100% =

IQ Score Frequency Relative Frequency

50-69 2 2.6%

70-89 33 42.3%

90-109 35 44.9%

110-129 7 9.0%

130-149 1 1.3%

Cu

mu

lative

Fre

qu

en

cie

s IQ Score Frequency Cumulative Frequency

50-69 2 2

70-89 33 35

90-109 35 70

110-129 7 77

130-149 1 78

The frequencies start low, then increase to higher frequencies until reaching a maximum, and then decrease to low again.

The distribution is approximately symmetric

• frequencies preceding the maximum being roughly a mirror image of those that follow the maximum

Numerical in nature

Consists of numbers representing counts or measurements

Have a unit and can be used arithmetically

Quantitative data can be further described by distinguishing between discrete and continuous types.

Examples:

• The weights of supermodels

• The ages of respondents

the number of possible values is either a finite number or a

‘countable’ number (i.e. the number of possible values is 0,

1, 2, 3, . . .).

Example:

The number of eggs that a hen lays

infinitely many possible values that correspond to some

continuous scale that covers a range of values without gaps,

interruptions, or jumps

Example:

The amount of milk that a cow produces;

e.g. 2.343115 gallons per day

consists of names or labels (representing categories)

Example:

• The gender (male/female) of professional athletes.

• Shirt numbers on professional athletes uniforms - substitutes for names.

• Uses bars of equal width to show

frequencies of categorical, or

qualitative, data

• Vertical scale represents frequencies or

relative frequencies.

• Horizontal scale identifies the different

categories of qualitative data.

A multiple bar graph has two or more sets of bars and is used to

compare two or more data sets.

A bar graph for qualitative data, with the bars arranged in descending order according to frequencies

A graph depicting qualitative data as slices of a circle, in which the size of each slice is proportional to frequency count

a variable (typically represented by 𝑥) that has a single numerical value, determined by chance, for each outcome of a given procedure

Can be discrete or continuous – just like data

Discrete Random Variable either a finite number of values or countable number of values, where “countable” refers to the fact that there might be infinitely many values, but that they result from a counting process

Continuous Random Variable has infinitely many values, and those values can be associated with measurements on a continuous scale without gaps or interruptions.

a description that gives the probability for each value of the random variable

often expressed in the format of a graph, table, or formula

Note:

If a probability is very small, it is represented as 0+ in tables

(i.e. it is very small, yet positive)

1. There is a numerical random variable x and its values are associated with corresponding probabilities.

2. The sum of all probabilities must be 1.

3. Each probability value must be between 0 and 1 inclusive.

1P x

0 1P x

The probability histogram is very similar to a relative frequency histogram, but the vertical scale shows probabilities.

According to the range rule of thumb, most values should lie within 2 standard deviations of the mean.

We can therefore identify “unusual” values by determining if they lie outside these limits:

Maximum usual value =

Minimum usual value =

2

2

We found for families with two children, the mean number of girls is 1.0 and the standard deviation is 0.7 girls.

Use those values to find the maximum and minimum usual values for the number of girls.

Solution:

maximum usual value 2 1.0 2 0.7 2.4

minimum usual value 2 1.0 2 0.7 0.4

Rare Event Rule for Inferential Statistics

If, under a given assumption (such as the assumption that a coin is fair), the probability of a particular observed event (such as 992 heads in 1000 tosses of a coin) is extremely small, we conclude that the assumption is probably not correct.

Using Probabilities to Determine When Results Are Unusual

Unusually high # of successes: x successes among n trials is an unusually high number of successes if

.

Unusually low # of successes : x successes among n trials is an unusually low number of successes if

( or fewer) 0.05P x

( or more) 0.05P x

A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

1. The total area under the curve must equal 1.

2. Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)

Because the total area under the density curve is equal to 1, there is a correspondence between area and probability.

A continuous random variable has a uniform distribution if its values are spread evenly over the range of probabilities. The graph of a uniform distribution results in a rectangular shape.

Given the uniform distribution illustrated, find the probability that a randomly selected voltage level is greater than 124.5 volts.

Shaded area

represents voltage

levels greater than

124.5 volts.

21

2

( )2

x

ef x

A continuous R.V. has a normal distribution if it has a graph that is

symmetric and bell-shaped and if the R.V. can be described by the

following equation:

The standard normal distribution is a normal probability distribution with μ = 0 and σ = 1. The total area under its density curve is equal to 1.

Represents how much a given value, 𝑥, deviates/varies from the center of a set of data

This value can help to assess how “extreme” a particular data value is based on the distribution the value is supposed to follow

This score can also be used to convert sample data (sample statistics) to a measure of relative standing so that we may be able to compare sample to one another.

Basic “Idea” Behind Formulas for Z-Scores:

𝑍 =𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎

𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒

If the z-score is positive (+), the specific value falls above the center value.

If the z-score is negative (-), the specific value falls below the center value.

“Usual” values have z-scores between -2 and 2.

“Unusual” values have z-scores less than -2 or greater than 2.

We can find areas (probabilities) for different regions under a normal model using StatCrunch.

A bone mineral density test can be helpful in identifying the presence of osteoporosis.

The result of the test is commonly measured as a z score, which has a normal distribution with a mean of 0 and a standard deviation of 1.

A randomly selected adult undergoes a bone density test.

Find the probability that the result is a reading less than 1.27.

The probability of random adult having a bone density less than 1.27 is 0.8980.

( 1.27) 0.8980P z

Using the same bone density test, find the probability that a randomly selected person has a result above –1.00 (which is considered to be in the “normal” range of bone density readings.

The probability of a randomly selected adult having a bone density above –1 is 0.8413.

A bone density reading between –1.00 and –2.50 indicates the subject has osteopenia. Find this probability.

The probability of a randomly selected adult having osteopenia is 0.1525.

denotes the probability that the z score is between a and b.

denotes the probability that the z score is greater than a.

denotes the probability that the z score is less than a.

( )P a z b

( )P z a

( )P z a

Finding the 95th Percentile

1.645

5% or 0.05

(z score will be positive)

Using the same bone density test, find the bone density scores that separates the bottom 2.5% and find the score that separates the top 2.5%.

For the standard normal distribution, a critical value is a z score separating unlikely values from those that are likely to occur.

Notation:

The expression zα denotes the z score with an area of α to its right.

Find the value of z0.025.

The notation z0.025 is used to represent the z score with an area of 0.025 to its right.

Referring back to the bone density example,

z0.025 = 1.96.

• Complete HW1 and HW2 on MLP

Data sampling and probability

Education

Transcript of Data sampling and probability