Introduction to Statistics. Objectives: Understand certain statistical concepts terminology...

Post on 18-Jan-2018

229 views 0 download

description

Introduction Statistics - a set of concepts, rules, and procedures that help us to: – organize numerical information in the form of tables, graphs, and charts; – understand statistical techniques underlying decisions that affect our lives and well-being; and – make informed decisions.

Transcript of Introduction to Statistics. Objectives: Understand certain statistical concepts terminology...

Introduction to Statistics

Objectives:• Understand certain statistical concepts &

terminology• Describe types of measurement scales• Differentiate between descriptive and inferential

statistics• Identify measures of central tendency and

understand their uses. • Identify measures of dispersion

Introduction

• Statistics - a set of concepts, rules, and procedures that help us to:– organize numerical information in the form

of tables, graphs, and charts;– understand statistical techniques

underlying decisions that affect our lives and well-being; and

– make informed decisions.

Descriptive Statistics • Statistics is a branch of mathematics designed to allow

people to accomplish two goals:1. The first is to accurately describe data and trends in data (descriptive statistics).

Descriptive statistics: Collection, classification, analysis, and interpretation of data.

- Any method or formula which yields some number and tells us about a set of data is referred to as descriptive statistics.

2. The second is to make predictions on future behavior, based on current data (predictive statistics).

Predictive statistics: Using statistics generated from the sample in order to make predictions, this is also often called inferential statistics.

Terminology• Data - facts, observations, and information

that come from investigations. There are two types of data:

1. Measurement data sometimes called quantitative data -- the result of using some instrument to measure something (e.g., test score, weight).

data

• 2. Categorical data also referred to as frequency or qualitative data. Things are grouped according to some common property(ies) and the number of members of the group are recorded (e.g., males/females, vehicle type).

Variable

• property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer science, English, psychology, etc.

• Discrete Variable - a variable with a limited number of values (e.g., gender (male/female), employee (junior/senior).

Variable• Continuous Variable - a variable that can take on

many different values, in theory, any value between the lowest and highest points on the measurement scale.

• Independent Variable - a variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause-and-effect relationship, the independent variable is the cause and the dependent variable is the outcome or effect.

• Dependent Variable - a variable that is not under the experimenter's control -- the data. It is the variable that is observed and measured in response to the independent variable.

• Qualitative Variable - a variable based on categorical data.

• Quantitative Variable - a variable based on quantitative data.

Types of Measurement Scales

1. Nominal:

For qualitative data with distinct categories. For example the categories German, French, and Italian are categories but are not ordered in any way.

2. Ordinal: For quantitative data with distinct categories in which ordering (or ranking) is implied. A good example is the Likert scale that you see on many surveys:

1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree; 5=Strongly agree.

3. Interval: For quantitative data with an ordered scale in which the interval between data values is meaningful. For example the categories of rank in the military. Clearly a major is higher ranked than a captain, but how much higher? Does he have twice the authority of a captain? It is impossible to say. You can only say he is higher ranked.

4. Ratio: For quantitative data which have an inherently defined zero and the ratio of data values is meaningful. Weight in kilograms is a very good example since it has a definite ratio from one weight to another. 50kg is indeed twice as heavy as 25 kg.

Two Types of StatisticsTwo Types of Statistics• Descriptive statistics of a POPULATION• Relevant notation (Greek):

– mean– N population size– sum

• Inferential statistics of SAMPLES from a population.– Assumptions are made that the sample reflects the

population in an unbiased form. Roman Notation:– X mean– n sample size– sum

Measures of Central TendencyMeasures of Central Tendency

• These measures tap into the average distribution of a set of scores or values in the data. – Mean– Median– Mode

What is “Mean”?What is “Mean”?

The “mean” of some data is the average score or value, such as the average age of an MPA student or average weight of professors that like to eat donuts.

Inferential mean of a sample: X=(X)/nMean of a population: =(X)/N

Mean• The mean is the most common measure of

central tendency and the one that can be mathematically manipulated. It is defined as the average of a distribution is equal to the SX / N. Simply, the mean is computed by summing all the scores in the distribution (SX) and dividing that sum by the total number of scores (N).

Mean• The mean is the balance point in a distribution

such that if you subtract each value in the distribution from the mean and sum all of these deviation scores, the result will be zero.

Example: 2, 5, 8,10,12,17Mean = 54/6= 9-7, -4, -1, 1, 3, 8 then the sum is Zero.

Problem of being “mean”Problem of being “mean”• The main problem associated with the mean value

of some data is that it is sensitive to outliers (extreme values).

• Example, the average weight of 10 students might be affected if there was one who weighed 200 kg.

The Median

• Because the mean average can be sensitive to extreme values, the median is sometimes useful and more accurate.

• The median is simply the middle value among some scores of a variable. (no standard formula for its computation).

The median• The median is the score that divides the

distribution into halves; half of the scores are above the median and half are below it when the data are arranged in numerical order. The median is also referred to as the score at the 50th percentile in the distribution.

• When we have odd number of observations, the formula yields an integer that represents the value in a numerically ordered distribution corresponding to the median location. (For example, in the distribution of numbers (3 1 5 4 9 9 8) the median location is the 4th value.

• When applied to the ordered distribution (1 3 4 5

8 9 9), the value 5 is the median, three scores are above 5 and three are below 5.

The Median

• If there were only 6 values (1 3 4 5 8 9), the median location in this case is half-way between the 3rd and 4th scores (4 and 5) or 4.5.

What is the Median?Boxer Weight

Schmuggles 165Bopsey 213Pallitto 189Homer 187Schnickerson 165Levin 148Honkey-Doorey 251Zingers 308Boehmer 151Queenie 132Googles-Boop 199Calzone 227  194.6

Weight

132148151165165187189199213227251308

Rank order and choose middle value.

If the number of values is even then the median is the average between two in the middle

The Mode

• Mode - The mode of a distribution is simply defined as the most frequent or common response or value for a variable.

• Multiple modes are possible: bimodal or multimodal.

Figuring the ModeBoxer Weight

Schmuggles 165Bopsey 213Pallitto 189Homer 187Schnickerson 165Levin 148Honkey-Doorey 251Zingers 308Boehmer 151Queenie 132Googles-Boop 199Calzone 227

What is the mode?

Answer: ??

PercentilesPercentiles

• If we know the median, then we can go up or down and rank the data as being above or below certain thresholds.

• You may be familiar with standardized tests. 90th percentile, your score was higher than 90% of the rest of the sample.

To calculate the kth percentile (where k is any number between zero and one hundred), do the following steps:

1. Order all the values in the data set from smallest to largest.2. Multiply k percent by the total number of values, n.

This number is called the index.

3. If the index obtained in Step 2 is not a whole number, round it up to the nearest whole number and go to Step 4a. If the index obtained in Step 2 is a whole number, go to Step 4b.

4a. (Index is not a whole number)Count the values in your data set from left to right (from the smallest to the largest value) until you reach the number indicated by Step 3.

The corresponding value in your data set is the kth percentile.

4b. (Index is a whole number)Count the values in your data set from left to right until you reach the number indicated by Step 2.

The kth percentile is the average of that corresponding value in your data set and the value that directly follows it.

For example, suppose you have 25 test scores, and in order from lowest to highest they look like this: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for these (ordered) scores, start by multiplying 90% times the total number of scores, which gives 90% 25 = 0.90 25 = 22.5 (the index). Rounding up to the ∗ ∗nearest whole number, you get 23.

Counting from left to right (from the smallest to the largest value in the data set), you go until you find the 23rd value in the data set. That value is 98, and it is the 90th percentile for this data set.

Make sure. 22/25=0.88

Now say you want to find the 20th percentile. Start by taking 0.20 x 25 = 5 (the index); this is a whole number, so proceed from Step 3 to Step 4b, which tells you the 20th percentile is the average of the 5th and 6th values in the ordered data set (62 and 66). The 20th percentile then comes to (62 + 66) ÷ 2 = 64.

The median (the 50th percentile) for the test scores is the 13th score: 77.

Measures of DispersionMeasures of Dispersion

• Measures of dispersion tell us about variability in the data.

• Basic question: how much do values differ for a variable from the minimum to maximum, and distance among scores in between. We use:– Range– Standard Deviation– Variance

• Remember that we said in order to assemble information from data, i.e. to make an inference, we need to see variability in our variables.

• Measures of dispersion give us information about how much our variables vary from the mean, because if they don’t, this makes it difficult infer anything from the data. Dispersion is also known as the spread or range of variability.

The RangeThe Range• Range = highest- lowest

r = h – l – Where h is high and l is low

• In other words, the range gives us the value between the minimum and maximum values of a variable.

• Understanding this statistic is important in understanding your data, especially for management and diagnostic purposes.

Example:Problem:   Cheryl took 7 math tests in one

marking period. What is the range of her test scores?

  89,  73,  84,  91,  87,  77,  94

Solution:   Ordering the test scores from least to greatest, we get:

  73,  77,  84,  87,  89,  91,  94   highest - lowest = 94 - 73 = 21

The Standard Deviation The Standard Deviation • A standardized measure of distance from the

mean.

• Very useful and something you do read about when making predictions or other statements about the data.

Standard Deviation

• most popular and important measure of variability

• a measure of how far all of the individual scores in the distribution are from a standard (mean)

Standard Deviation

m eanm ean m eanlow variability

small SD

high variability

large SD

=square root=sum (sigma)X=score for each point in data_X=mean of scores for the variablen=sample size (number of observations or cases

S =

Formula for Standard DeviationFormula for Standard Deviation

1)-(n

2)( XX

Example: Calculate the SD for the following values: 4, 2, 5, 8, 6.1. Calculate the mean: 2. Calculate deviation from the mean for each value in the sample:4-5=-1, 2-5=-3, 5-5=0, 8-5=3, 6-5=13. Calculate sum of all these deviations and square it (=20)4. Calculate the standard deviation:

1)-(n

2)( XX

VarianceVariance

1)-(n

2)( XX

• Note that this is the same equation except for no square root taken.

• Its use is not often directly reported in research but instead is a building block for other statistical methods.

S2 =

Standard Deviation of the Mean or the standard error (SE)

• It is the variation in means of repeated samples.

• SE= Standard deviation divided by the square root of n.

Coefficient of Variation• It measures variability in relation to mean (or

average).• Used to compare the relative dispersion of

more than one data set. Data to be compared may be in the same units, in different units, with similar mean or with different mean.

CV= Standard deviation divided by mean and multiplied by percentage.

CV= S/M

Goal of Graphing?

1. Presentation of Descriptive Statistics2. Presentation of Evidence

3. Some people understand subject matter better with visual aids.

4. Provide a sense of the underlying data generating process (data pattern).

Normal Distribution

• Most widely used continuous distribution• Also known as the Gaussian distribution• Symmetric

Graphing Data: Histograms

Graphing Data: Bar Graph

Pie Charts:

Proportions of Donut-Eating Professors by Weight Class

130-150

151-185

186-210

211-240

241-270

271-310

311+

Line Graphs: A Time Series

Frequency Distribution Table

VAR00003

2 7.7 7.7 7.73 11.5 11.5 19.23 11.5 11.5 30.85 19.2 19.2 50.04 15.4 15.4 65.42 7.7 7.7 73.14 15.4 15.4 88.52 7.7 7.7 96.21 3.8 3.8 100.0

26 100.0 100.0

1.002.003.004.005.006.007.008.009.00Total

ValidFrequency Percent Valid Percent

CumulativePercent

Properties of a Distribution

• Shape– symmetric vs. skewed– unimodal vs. multimodal

• Central Tendency– where most of the data are??– mean, median, and mode

• Variability (spread)– how similar the scores are??– range, variance, and standard deviation

Representing a Distribution

• Often it is helpful to visually represent distributions in various ways.

• Graphs– continuous variables (histogram, line graph)– categorical variables (pie chart, bar chart)

• Tables– frequency distribution table.

Shape of a Distribution

• Symmetrical (normal)– scores are equally distributed about the central

tendency (i.e., mean)

Shape of a Distribution

• Skewed– extreme high or low scores can skew the

distribution in either direction

Negative skew Positive skew

Shape of a Distribution

• Unimodal

• Multimodal

Minor Mode Major Mode

Central Tendency

• Mode: the most frequent score– good for nominal scales (eye color)

• Median: the middle score– separates the bottom 50% and the top 50% of

the distribution– good for skewed distributions (net worth).

Central Tendency

• Mean: the arithmetic average– add all of the scores and divide by total number of

scores– This the preferred measure of central tendency

(takes all of the scores into account)

XN

X Xn

population sample

Central Tendency

• Is the mean always the best measure of central tendency?

• No, skew pulls the mean in the direction of the skew

Central Tendency and Skew

If negative skew:

Mode

Median

Mean

Central Tendency and Skew

If positive skew:

Mode

Median

Mean

Normal Distribution

• Gives us a picture of the variability and central tendency.

Normal Distribution

95.0)22(68.0)(50.0)( YPYPYP

Standard Deviation

• In a perfectly symmetrical (i.e. normal) distribution 2/3 of the scores will fall within +/- 1 standard deviation (suppose SD= 3.27)

6.4

+1-1

9.673.13