Statistics four

30
Statistics “Four” Mohamed Ahmed Hefny, MD.

Transcript of Statistics four

Page 1: Statistics four

Statistics “Four”Mohamed Ahmed Hefny,

MD.

Page 2: Statistics four
Page 3: Statistics four

Describing data with numeric

summary values

Page 4: Statistics four

Learning objectives

1. Explain what prevalence and incidence are.2. Explain what a summary measure of location is,

and show that you understand the meaning of, and the difference between, the mode, the median and the mean.

3. Be able to calculate the mode, median and mean for a set of values.

4. Explain what a percentile is, and calculate any given percentile value.

5. Explain what a summary measure of spread is, and show that you understand the difference between, and can calculate, the range, the interquartile range and the standard deviation.

Page 5: Statistics four
Page 6: Statistics four

Numbers, percentages and proportions

• When you present the results of an investigation, you will almost certainly need to give the numbers of the subjects involved; and perhaps also provide values for percentages.

• It is usually categorical data that are summarized with a value for percentage or proportion.

Page 7: Statistics four

Prevalence and the incidence rate

When suitable we can also summarize data by providing a value for the prevalence or the incidence rate of some condition. • Prevalence of a disease is the number of existing

cases in some population at a given time. In practice, the period prevalence is more often used.

• i.e. the prevalence of Breast Cancer in women in a place in 2010 was 3.1%. The prevalence figure will include existing cases, i.e. those who contracted the disease before 2010, and still had it, as well as those first getting the disease in 2010.

Page 8: Statistics four
Page 9: Statistics four

Incidence or inception rate of a disease is the number of new cases occurring per 1000, or per 10 000, of the population , during a given period, usually 12 months.

Page 10: Statistics four

Summary measures of location

A summary measure of location is a value around which most of the data values tend to congregate or center. There are three measures of location

• Mode• Median • Mean

Page 11: Statistics four

Mode

• The mode is that category or value in the data that has the highest frequency (i.e. occurs the most often). In this sense, the mode is a measure of common-ness or typical-ness.

• The mode is not particularly useful with metric continuous data where no two values may be the same. The other deficiency of this measure is that there may be more than one mode in a set of data.

Patients Number of inhaler use in last 24 hoursA 5

B 12

C 10

Page 12: Statistics four

Median

• If we arrange the data in ascending order of size, the median is the middlemost number in the set. Thus, half of the values will be equal to or less than the median value, and half equal to or above it. The median is thus a measure of central-ness.

• i.e. Age (in ascending order of years), for 5 individuals: 30 31 32 33 35. The middle value is 32, so the median age for these 5 people is 32 years.

Page 13: Statistics four

• Another way of determining the value of the median, If you have “n” values arranged in ascending order, then: the median = 1 / 2(n + 1)th value.

• i.e., if the ages of six people are: 30 31 32 33 35 36, then n = 6, therefore:

• 1 / 2(n + 1) = 1 / 2 × (6 + 1) = 1 / 2 × 7 = 3.5• Then, median is the 3.5th value. That is, it is the

value half way between the 3rd value of 32, and the 4th value of 33, or 32.5 years, which is the same result as before.

• An advantage of the median is that it is not much affected by skewness in the distribution, or by the presence of outliers. However, it discards a lot of information, because it ignores most of the values, apart from those in the center of the distribution.

Page 14: Statistics four

Mean

• The mean, or the arithmetic mean to give it its full name, is more commonly known as the average.

• One advantage of the mean over the median is that it uses all of the information in the data set.

• However, it is affected by skewness in the distribution, and by the presence of outliers in the data.

• This may, on occasion, produce a mean that is not very representative of the general mass of the data.

• Moreover, it cannot be used with ordinal data.

Page 15: Statistics four

Percentiles

• A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

• Percentiles are the values which divide an ordered set of data into 100 equal-sized groups.

• Notice that this makes the median the 50th percentile, since it divides the data values into two equal halves, 50 per cent above the median and 50 per cent below.

Page 16: Statistics four

Choosing the most appropriate measure

• How do you choose the most appropriate measure of location for some given set of data?

• The main thing to remember is that the mean cannot be used with ordinal data (because they are not real numbers), and that the median can be used for both ordinal and metric data (particularly when the latter is skewed).

Type of variable Summary measure of location

Mode Median Mean

Nominal Yes Yes No

Ordinal Yes No No

Metric Discrete Yes Yes, if distribution Yes

Metric Continuous No Is markedly skewed Yes

Choosing an appropriate measure of location

Page 17: Statistics four

Summary measures of spread

As well as a summary measure of location, a summary measure of spread or dispersion can also be very useful. There are three main measures in common use

• Range• Interquartile range• Standard Deviation

Page 18: Statistics four

Range

• The range is the distance from the smallest value to the largest. The range is not affected by skewness, but is sensitive to the addition or removal of an outlier value. i.e, the range of the 30 birth weights is (2.86 – 4.49 kg).

• The range is best written like this, rather than as the single-valued difference, i.e. as 1.6 kg, in this example, which is much less informative.

• The range can sometimes be misleading when there are extremely high or low values.

Page 19: Statistics four

The interquartile range (iqr)

• One solution to the problem of the sensitivity of the range to extreme value (outliers) is to remove a quarter (25 %) of the values off both ends of the distribution (which removes any troublesome outliers), and then measure the range of the remaining values. This distance is called the interquartile range, or iqr.

• The interquartile range is not affected either by outliers or skewness, but it does not use all of the information in the data set since it ignores the bottom and top quarter of values.

Page 20: Statistics four
Page 21: Statistics four

Standard Deviation

The Standard Deviation is a measure of how spread out numbers are.Its symbol is σ (the Greek letter sigma)The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?“

VarianceThe Variance is defined as: The average of the squared differences from the Mean.

Page 22: Statistics four

You and your friends have just measured the heights of your dogs (in millimeters):The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.Find out the Mean, the Variance, and the Standard Deviation.Your first step is to find the Mean

Mean =

600 + 470 + 170 + 430 +

300 = 1970

= 394

5 5

Page 23: Statistics four

So the mean (average) height is 394 mm. Let's plot this on the chart:

Page 24: Statistics four

To calculate the Variance, take each difference, square it, and then average the result:Now we calculate each dog's difference from the Mean:

So, the Variance is 21,704.

Page 25: Statistics four

And the Standard Deviation is just the square root of Variance, so:

Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean

So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

Page 26: Statistics four

• The smaller this mean distance is, the narrower the spread of values must be, and vice versa.

• This idea is the basis for what is known as the standard deviation, or SD

Page 27: Statistics four
Page 28: Statistics four

Type of variable Summary measure of location

Range Interquartile range Standard deviation

Nominal No No No

Ordinal Yes Yes No

Metric Yes Yes, if skewed Yes

Choosing an appropriate measure of spread

Page 29: Statistics four
Page 30: Statistics four

Thank You