Some definitions In Statistics. A sample: Is a subset of the population.
-
Upload
molly-davidson -
Category
Documents
-
view
222 -
download
3
Transcript of Some definitions In Statistics. A sample: Is a subset of the population.
Some definitions
In Statistics
A sample:
Is a subset of the population
In statistics:
One draws conclusions about the population based on data collected from a sample
Reasons:
Cost
It is less costly to collect data from a sample then the entire population
Accuracy
Accuracy
Data from a sample sometimes leads to more accurate conclusions then data from the entire population
Costs saved from using a sample can be directed to obtaining more accurate observations on each case in the population
Types of Samples
different types of samples are determined by how the sample is selected.
Convenience Samples
In a convenience sample the subjects that are most convenient to the researcher are selected as objects in the sample.
This is not a very good procedure for inferential Statistical Analysis but is useful for exploratory preliminary work.
Quota samples
In quota samples subjects are chosen conveniently until quotas are met for different subgroups of the population.
This also is useful for exploratory preliminary work.
Random Samples
Random samples of a given size are selected in such that all possible samples of that size have the same probability of being selected.
Convenience Samples and Quota samples are useful for preliminary studies. It is however difficult to assess the accuracy of estimates based on this type of sampling scheme.
Sometimes however one has to be satisfied with a convenience sample and assume that it is equivalent to a random sampling procedure
Population
Sample
Case
Variables
X
Y
Z
Some other definitions
A population statistic (parameter):
Any quantity computed from the values of variables for the entire population.
A sample statistic:
Any quantity computed from the values of variables for the cases in the sample.
Since only cases from the sample are observed
– only sample statistics are computed– These are used to make inferences about
population statistics– It is important to be able to assess the accuracy
of these inferences
To download lectures1. Go to the stats 244 web site
a) Through PAWS or
b) by going to the website of the department of Mathematics and Statistics -> people -> faculty -> W.H. Laverty -> Stats 244-. Lectures.
2. Then a) select the lecture
b) Right click and choose Save as
To print lectures1. Open the lecture using MS Powerpoint
2. Select the menu item File -> Print
The following dialogue box appear
In the Print what box, select handouts
Set Slides per page to 6 or 3.
6 slides per page will result in the least amount of paper being printed
1 2
3 4
5 6
3 slides per page leaves room for notes.
1
2
3
Organizing and describing Data
Techniques for continuous variables
The Grouped frequency table:The Histogram
To Construct
• A Grouped frequency table
• A Histogram
1. Find the maximum and minimum of the observations.
2. Choose non-overlapping intervals of equal width (The Class Intervals) that cover the range between the maximum and the minimum.
3. The endpoints of the intervals are called the class boundaries.
4. Count the number of observations in each interval (The cell frequency - f).
5. Calculate relative frequencyrelative frequency = f/N
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Verbal IQ Math IQ70 to 80 1 180 to 90 6 290 to 100 7 11
100 to 110 6 4110 to 120 3 4120 to 130 0 1
In this example the upper endpoint is included in the interval. The lower endpoint is not.
Histogram – Verbal IQ
0
1
2
3
4
5
6
7
8
70 to 80 80 to 90 90 to100
100 to110
110 to120
120 to130
Histogram – Math IQ
0
2
4
6
8
10
12
70 to 80 80 to 90 90 to100
100 to110
110 to120
120 to130
Example
• In this example we are comparing (for two drugs A and B) the time to metabolize the drug.
• 120 cases were given drug A.
• 120 cases were given drug B.
• Data on time to metabolize each drug is given on the next two slides
Drug A22.6 17.8 18.8 10.5 6.5 11.831.5 6.3 7.2 3.5 4.7 5.17.2 11.4 12.9 12.7 5.3 18.0
13.0 6.4 6.3 20.1 7.4 4.111.2 8.1 13.6 25.3 2.5 9.06.4 5.7 4.3 11.2 18.7 6.54.8 3.2 7.5 2.0 5.6 15.43.5 13.4 14.1 1.8 2.3 3.9
11.9 7.8 21.9 22.0 7.9 4.84.1 16.8 7.4 5.1 6.8 6.36.7 9.0 8.8 20.1 12.3 4.36.7 8.9 10.5 7.0 10.1 17.46.0 10.5 12.6 6.0 14.9 11.37.7 13.1 14.9 8.0 19.2 2.7
11.7 6.4 6.2 6.0 10.8 30.011.7 21.9 2.9 3.8 9.3 3.18.5 6.3 5.2 13.6 14.9 10.9
30.0 6.2 3.8 8.5 11.8 3.37.2 5.4 9.7 9.8 12.7 28.3
10.0 17.2 19.6 33.5 1.5 6.4
Drug B4.2 12.8 3.2 7.8 3.2 8.8
10.4 5.4 5.0 5.1 5.1 14.18.2 6.0 4.9 5.9 17.0 2.5
13.4 4.3 2.7 10.3 20.9 15.310.5 6.0 14.3 12.4 8.1 5.25.6 7.3 9.6 4.7 4.8 7.8
19.0 5.9 10.6 6.3 9.3 11.44.5 10.2 2.8 9.4 24.1 9.2
25.9 10.4 12.9 4.5 2.6 10.63.2 2.7 4.2 3.3 13.7 3.75.5 4.6 2.7 7.5 5.1 5.07.8 3.5 5.4 12.6 8.8 8.56.0 2.9 4.4 4.1 5.0 12.15.3 3.0 5.7 3.0 9.7 8.54.8 4.6 7.7 4.8 4.1 6.9
10.8 13.4 5.8 5.3 7.7 12.15.4 8.3 4.1 9.3 8.3 8.0
25.2 2.9 11.5 8.8 5.9 4.16.6 15.1 12.3 10.9 6.0 2.35.1 4.0 5.1 7.4 16.0 2.8
Grouped frequency tablesClass interval Drug A Drug B
0 to 4 15 194 to 8 43 54
8 to 12 26 2612 to 16 15 1516 to 20 9 220 to 24 6 124 to 28 1 328 to 32 4 032 to 36 1 036 to 40 0 040 to 44 0 044 to 48 0 0
Histogram – drug A(time to metabolize)
0
10
20
30
40
50
60
Histogram – drug B(time to metabolize)
0
10
20
30
40
50
60
Some comments about histograms
• The width of the class intervals should be chosen so that the number of intervals with a frequency less than 5 is small.
• This means that the width of the class intervals can decrease as the sample size increases
• If the width of the class intervals is too small. The frequency in each interval will be either 0 or 1
• The histogram will look like this
• If the width of the class intervals is too large. One class interval will contain all of the observations.
• The histogram will look like this
• Ideally one wants the histogram to appear as seen below.
• This will be achieved by making the width of the class intervals as small as possible and only allowing a few intervals to have a frequency less than 5.
0
10
20
30
40
50
60
70
80
60 -
65
70 -
75
80 -
85
90 -
95
100
- 105
110
- 115
120
- 125
130
- 135
140
- 145
150
- 155
• As the sample size increases the histogram will approach a smooth curve.
• This is the histogram of the population
0
10
20
30
40
50
60
70
80
60 -
65
70 -
75
80 -
85
90 -
95
100
- 105
110
- 115
120
- 125
130
- 135
140
- 145
150
- 155
N = 25
01
23
45
67
89
10
60 - 70 70 - 80 80 - 90 90 - 100 100 -110
110 -120
120 -130
130 -140
140 -150
N = 100
0
5
10
15
20
25
30
60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150
N = 500
0
10
20
30
40
50
60
70
80
60 -
65
70 -
75
80 -
85
90 -
95
100
- 105
110
- 115
120
- 125
130
- 135
140
- 145
150
- 155
N = 2000
0
20
40
60
80
100
120
140
62 -
64
70 -
72
78 -
80
86 -
88
94 -
96
102
- 104
110
- 112
118
- 120
126
- 128
134
- 136
142
- 144
N = ∞
0
0.005
0.01
0.015
0.02
0.025
0.03
50 60 70 80 90 100 110 120 130 140 150
Comment: the proportion of area under a histogram between two points estimates the proportion of cases in the sample (and the population) between those two values.
Example: The following histogram displays the birth weight (in Kg’s) of n = 100 births
1 13
1011
1917
20
12
4
1 1
0
5
10
15
20
25
0.085to
0.113
0.113to
0.142
0.142to
0.17
0.17to
0.198
0.198to
0.227
0.227to
0.255
0.255to
0.283
0.283to
0.312
0.312to
0.34
0.34to
0.369
0.369to
0.397
0.397to
0.425
0.425to
0.454
0.454to
0.482
Find the proportion of births that have a birthweight less than 0.34 kg.
Proportion = (1+1+3+10+11+19+17)/100 = 0.62
The Characteristics of a Histogram
• Central Location (average)
• Spread (Variability, Dispersion)
• Shape
Central Location
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Spread, Dispersion, Variability
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Shape – Bell Shaped (Normal)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Shape – Positively skewed
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
Shape – Negatively skewed
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
Shape – Platykurtic
0
-3 -2 -1 0 1 2 3
Shape – Leptokurtic
0
-3 -2 -1 0 1 2 3
Shape – Bimodal
0
-3 -2 -1 0 1 2 3
The Stem-Leaf Plot
An alternative to the histogram
Each number in a data set can be broken into two parts
– A stem
– A Leaf
Example
Verbal IQ = 84
84
–Stem = 10 digit = 8
– Leaf = Unit digit = 4
LeafStem
Example
Verbal IQ = 104
104
–Stem = 10 digit = 10
– Leaf = Unit digit = 4
LeafStem
To Construct a Stem- Leaf diagram
• Make a vertical list of “all” stems
• Then behind each stem make a horizontal list of each leaf
Example
The data on N = 23 students
Variables
• Verbal IQ
• Math IQ
• Initial Reading Achievement Score
• Final Reading Achievement Score
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
We now construct:
a stem-Leaf diagram
of Verbal IQ
A vertical list of the stems8
9
10
11
12
We now list the leafs behind stem
8
9
10
11
12
86 104 86 105 118 96 90 95 105 84
94 119 82 80 109 111 89 99 94 99
95 102 102
8
9
10
11
12
86 104 86 105 118 96 90 95 105 84
94 119 82 80 109 111 89 99 94 99
95 102 102
8 6 6 4 2 0 9
9 6 0 5 4 9 4 9 5
10 4 5 5 9 2 2
11 8 9 1
12
8 0 2 4 6 6 9
9 0 4 4 5 5 6 9 9
10 2 2 4 5 5 9
11 1 8 9
12
The leafs may be arranged in order
8 0 2 4 6 6 9
9 0 4 4 5 5 6 9 9
10 2 2 4 5 5 9
11 1 8 9
12
The stem-leaf diagram is equivalent to a histogram
8 0 2 4 6 6 9
9 0 4 4 5 5 6 9 9
10 2 2 4 5 5 9
11 1 8 9
12
The stem-leaf diagram is equivalent to a histogram
Rotating the stem-leaf diagram we have
80 90 100 110 120
The two part stem leaf diagram
Sometimes you want to break the stems into two parts
for leafs 0,1,2,3,4
* for leafs 5,6,7,8,9
Stem-leaf diagram for Initial Reading Acheivement
1. 01234444455556666677789
2. 0
This diagram as it stands does not
give an accurate picture of the
distribution
We try breaking the stems into
two parts
1.* 012344444
1. 55556666677789
2.* 0
2.
The five-part stem-leaf diagram
If the two part stem-leaf diagram is not adequate you can break the stems into five parts
for leafs 0,1
t for leafs 2,3
f for leafs 4, 5
s for leafs 6,7
* for leafs 8,9
We try breaking the stems into
five parts
1.* 01
1.t 23
1.f 444445555
1.s 66666777
1. 89
2.* 0
Stem leaf Diagrams
Verbal IQ, Math IQ, Initial RA, Final RA
Some Conclusions
• Math IQ, Verbal IQ seem to have approximately the same distribution
• “bell shaped” centered about 100
• Final RA seems to be larger than initial RA and more spread out
• Improvement in RA
• Amount of improvement quite variable
Numerical Measures
• Measures of Central Tendency (Location)
• Measures of Non Central Location
• Measure of Variability (Dispersion, Spread)
• Measures of Shape
Measures of Central Tendency (Location)
• Mean
• Median
• Mode
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Central Location
Measures of Non-central Location
• Quartiles, Mid-Hinges
• Percentiles
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Non - Central Location
Measure of Variability (Dispersion, Spread)
• Variance, standard deviation
• Range
• Inter-Quartile Range
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Variability
Measures of Shape• Skewness
• Kurtosis
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
0
-3 -2 -1 0 1 2 3
0
-3 -2 -1 0 1 2 3
Measures of Central Location (Mean)
Summation Notation
Let x1, x2, x3, … xn denote a set of n numbers.
Then the symbol
denotes the sum of these n numbers
x1 + x2 + x3 + …+ xn
n
iix
1
Example
Let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the symbol
denotes the sum of these 5 numbers
x1 + x2 + x3 + x4 + x5
= 10 + 15 + 21 + 7 + 13
= 66
5
1iix
Meaning of parts of summation notation
n
mi
i in expression
Quantity changing in each term of the sum
Starting value for i
Final value for i
each term of the sum
Example
Again let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the symbol
denotes the sum of these 3 numbers
= 153 + 213 + 73
= 3375 + 9261 + 343
= 12979
34
33
32 xxx
4
2
3
iix
Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean of the n numbers is defined as:
n
xxxxx
n
xx nn
n
ii
13211
Example
Again let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the mean of the 5 numbers is:
5554321
5
1 xxxxxx
x ii
2.135
66
5
137211510
Interpretation of the Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean, , is the centre of gravity of those the n numbers.
That is if we drew a horizontal line and placed a weight of one at each value of xi , then the balancing point of that system of mass is at the point .
x
x
x1 x2x3 x4xn
x
107 15 2113
2.13x
In the Example
100 20
The mean, , is also approximately the center of gravity of a histogram
0
5
10
15
20
25
30
60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150
x
x
The Median
Let x1, x2, x3, … xn denote a set of n numbers.
Then the median of the n numbers is defined as the number that splits the numbers into two equal parts.
To evaluate the median we arrange the numbers in increasing order.
If the number of observations is odd there will be one observation in the middle.
This number is the median.
If the number of observations is even there will be two middle observations.
The median is the average of these two observations
Example
Again let x1, x2, x3, x3 , x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
The numbers arranged in order are:
7 10 13 15 21
Unique “Middle” observation – the median
Example 2
Let x1, x2, x3 , x4, x5 , x6 denote the 6 denote numbers:
23 41 12 19 64 8
Arranged in increasing order these observations would be:
8 12 19 23 41 64
Two “Middle” observations
Median
= average of two “middle” observations =
212
42
2
2319
Example
The data on N = 23 students
Variables
• Verbal IQ
• Math IQ
• Initial Reading Achievement Score
• Final Reading Achievement Score
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Total 2244 2307 35.1 48.3
Initial FinalVerbal Math Reading Reading
IQ IQ Acheivement AcheivementMeans 97.57 100.30 1.526 2.100
Computing the Median
Stem leaf Diagrams
Median = middle observation =12th observation
Summary
Initial FinalVerbal Math Reading Reading
IQ IQ Acheivement AcheivementMeans 97.57 100.30 1.526 2.100Median 96 97 1.5 1.9
Some Comments
• The mean is the centre of gravity of a set of observations. The balancing point.
• The median splits the obsevations equally in two parts of approximately 50%
• The median splits the area under a histogram in two parts of 50%
• The mean is the balancing point of a histogram
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
50%
50%
xmedian
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
• For symmetric distributions the mean and the median will be approximately the same value
50% 50%
xMedian &
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
50%
xmedian
• For Positively skewed distributions the mean exceeds the median
• For Negatively skewed distributions the median exceeds the mean
50%
• An outlier is a “wild” observation in the data
• Outliers occur because – of errors (typographical and computational)– Extreme cases in the population
• The mean is altered to a significant degree by the presence of outliers
• Outliers have little effect on the value of the median
• This is a reason for using the median in place of the mean as a measure of central location
• Alternatively the mean is the best measure of central location when the data is Normally distributed (Bell-shaped)