AN OVERVIEW OF STATISTICS. WHAT IS STATISTICS? What does a statistician do? Player Games Minutes...

Post on 11-Jan-2016

215 views 0 download

Tags:

Transcript of AN OVERVIEW OF STATISTICS. WHAT IS STATISTICS? What does a statistician do? Player Games Minutes...

AN OVERVIEW OF STATISTICS

WHAT IS STATISTICS?

What does a statistician do?

Player Games Minutes Points Rebounds FG%Player Games Minutes Points Rebounds FG%BobBob 34 34 32.7 32.7 2424 7.6 .552 7.6 .552AndyAndy 36 36 31.5 31.5 2121 8.4 .465 8.4 .465Larry 30Larry 30 33.0 33.0 1818 5.6 .493 5.6 .493Michael 31Michael 31 35.1 35.1 2929 6.1 .422 6.1 .422

Player Games Minutes Points Rebounds FG%Player Games Minutes Points Rebounds FG%BobBob 34 34 32.7 32.7 2424 7.6 .552 7.6 .552AndyAndy 36 36 31.5 31.5 2121 8.4 .465 8.4 .465Larry 30Larry 30 33.0 33.0 1818 5.6 .493 5.6 .493Michael 31Michael 31 35.1 35.1 2929 6.1 .422 6.1 .422

JOB OF A STATISTICIAN

• Collects numbers or data• Systematically organizes or arranges the data• Analyzes the data…extracts relevant

information to provide a complete numerical description

• Infers general conclusions about the problem using this numerical description

POLITICS

Forecasting and predicting winners of elections

Where to concentrate campaign appearances, advertising and $$…

If the election for president of the United States were held today, who would you be more likely to vote for?

Rudy Guiliani 45%Hilary Clinton 43%Someone else 2%Wouldn’t vote 4%

Unsure 6%

If the election for president of the United States were held today, who would you be more likely to vote for?

Rudy Guiliani 45%Hilary Clinton 43%Someone else 2%Wouldn’t vote 4%

Unsure 6%

• To market product…

• Interested in the average length of life of a light bulb

• Cannot test all the bulbs

INDUSTRY

USES OF STATISTICS

• Statistics is a theoretical discipline in its own right

• Statistics is a tool for researchers in other fields

• Used to draw general conclusions in a large variety of applications

COMMON PROBLEMDecision or prediction about a large body of

measurements which cannot be totally enumerated.

Examples

• Light bulbs (to enumerate population is destructive)

• Forecasting the winner of an election (population too big; people change their minds)

Solutions

Collect a smaller set of measurements that will (hopefully) be representative of the larger set.

DATA AND STATISTICS

Data consists of information coming from observations, counts, measurements, or responses.

Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.

A population is the collection of all outcomes, responses, measurement, or counts that are of interest.

A sample is a subset of a population.

Introduction to Probability Introduction to Probability and Statisticsand Statistics

Thirteenth EditionThirteenth Edition

Chapter 1

Describing Data with Graphs

Introduction to Statistical Terms Variable

o Something that can assume some type of value Data

consists of information coming from observations, counts, measurements, or responses.

Data Seto A collection of data values

Observationo the value, at a particular period, of a particular variable

An experimental unitexperimental unit is the individual or object on which a variable is measured.

A measurementmeasurement results when a variable is actually measured on an experimental unit.

A set of measurements, called datadata,, can be either a samplesample or a populationpopulation..

Example• Variable

– Time until a light bulb burns out

• Experimental unit – Light bulb

• Typical Measurements – 1500 hours, 1535.5 hours, etc.

Populations and Samples

• A Population is the set of all items or individuals of interest– Examples: All likely voters in the next election

All parts produced todayAll sales receipts for November

• A Sample is a subset of the population– Examples: 1000 voters selected at random for interview

A few parts selected for destructive testing

Every 100th receipt selected for audit

population

2,

sample

inference

2, sx

Sampling Techniques

Statistical Procedures

Parameters

Statistics

Parameters & StatisticsA parameter is a numerical description of a population characteristic.

A statistic is a numerical description of a sample characteristic.

Parameter Population

Statistic Sample

Univariate dataUnivariate data:: One variable is measured on a single experimental unit.

Bivariate dataBivariate data:: Two variables are measured on a single experimental unit.

Multivariate dataMultivariate data:: More than two variables are measured on a single experimental unit.

Nominalo for things that are mutually exclusive/non-overlappingo there is no order or rankingoFor example: gender (male or female), religion.

Ordinalo can be ordered, but not precisely.o For example : health quality (excellent, good, adequate, bad, terrible)

Intervalo involves measurements, but there is no meaningful zero.oFor example : temperature.

Ratioo involves measurements, it can be ranked and there are precise differences

between the ranks, as well as having a meaningful zero.oFor example: height, time, or weight

Qualitative

Discrete Continuous

Quantitative

Types of Variables

Types of Variables•Qualitative variablesQualitative variables measure a quality or characteristic on each experimental unit.

•Examples:Examples:•Hair color (black, brown, blonde…)•Make of car (Dodge, Honda, Ford…)•Gender (male, female)•State of birth (California, Arizona,….)

•Quantitative variablesQuantitative variables measure a numerical quantity on each experimental unit.

DiscreteDiscrete if it can assume only a finite or countable number of values.

ContinuousContinuous if it can assume the infinitely many values corresponding to the points on a line interval.

ExamplesExamples

• For each orange tree in a grove, the number of oranges is measured. – Quantitative discrete

• For a particular day, the number of cars entering a college campus is measured.– Quantitative discrete

• Time until a light bulb burns out– Quantitative continuous

Statistical MethodsStatistical Methods

Descriptive Statistics Inferential Statistics

• Utilizes numerical and graphical methods to look for patterns in the data set.

• The data can either be a representation of the entire population or a sample

Descriptive StatisticsDescriptive Statistics

Graphical Numerical

•Bar Chart•Pie Chart

•Bar/Pie Chart•Line Plot (Time Series)•Dotplot•Stem-and-Leaf Plot•Histogram•Ogive•Boxplot

Qualitative Quantitative

Note: Some graphs require a tabular representation (frequency distribution)

Qualitative Quantitative

•Central Tendency•Dispersion (Variability)

•Tables, frequency, percentage, cumulative percentage•Cross tabulation

Graphing Qualitative VariablesGraphing Qualitative Variables• Use a data distributiondata distribution to describe:

– What valuesWhat values of the variable have been measured– How oftenHow often each value has occurred

• “How often” can be measured 3 ways:– Frequency– Relative frequency = Frequency/n– Percent = 100 x Relative frequency

•Bar Chart•Pie Chart

Example• A bag of M&Ms contains 25 candies:• Raw Data:Raw Data:

Color Tally Frequency Relative Frequency

Percent

Red 3 3/25 = .12 12%

Blue 6 6/25 = .24 24%

Green 4 4/25 = .16 16%

Orange 5 5/25 = .20 20%

Brown 3 3/25 = .12 12%

Yellow 4 4/25 = .16 16%

m

m

mm

m

m

m m

m

m

mm m

m

m

mm

m

m

m

m

m

m

mmm

mm

m

m m

m m

m mm

m m m

m m

m m

mm

m

m

m m

m

Statistical Table:Statistical Table:

GraphsGraphsBar Chart

Pie Chart

Color

Fre

quency

GreenOrangeBlueRedYellowBrown

6

5

4

3

2

1

0

16.0%Green

20.0%Orange

24.0%Blue

12.0%Red

16.0%Yellow

12.0%Brown

Graphing Quantitative Variables

• Bar/Pie Chart• Line Plot (Time Series)• Dotplot• Stem-and-Leaf Plot• Histogram• Ogive• Boxplot

Graphing Quantitative Variables (1)Graphing Quantitative Variables (1)

• A single quantitative variable measured for different population segments or for different categories of classification can be graphed using a bar bar or pie chartpie chart.

A Big Mac hamburger costs $4.90 in Switzerland, $2.90 in the U.S. and $1.86 in South Africa.

A Big Mac hamburger costs $4.90 in Switzerland, $2.90 in the U.S. and $1.86 in South Africa.

Country

Cost

of a B

ig M

ac

($)

South AfricaU.S.Switzerland

5

4

3

2

1

0

• A single quantitative variable measured over time is called a time seriestime series. It can be graphed using a lineline or bar chartbar chart.

Sept Oct Nov Dec Jan Feb Mar

178.10 177.60 177.50 177.30 177.60 178.00 178.60

CPI: All Urban Consumers-Seasonally Adjusted

Graphing Quantitative Variables (2)Graphing Quantitative Variables (2)

• The simplest graph for quantitative data• Plots the measurements as points on a horizontal axis,

stacking the points that duplicate existing points.• Example:Example: The set 4, 5, 5, 7, 6

4 5 6 7

Graphing Quantitative Variables (3) -DotplotGraphing Quantitative Variables (3) -Dotplot

Stem and Leaf Plots (4)Stem and Leaf Plots (4)

• A simple graph for quantitative data • Uses the actual numerical values of each data point.

– Divide each measurement into two parts: the stem and the leaf.

– List the stems in a column, with a vertical line to their right.

– For each measurement, record the leaf portion in the same row as its matching stem.

– Order the leaves from lowest to highest in each stem.

– Provide a key to your coding.

– Divide each measurement into two parts: the stem and the leaf.

– List the stems in a column, with a vertical line to their right.

– For each measurement, record the leaf portion in the same row as its matching stem.

– Order the leaves from lowest to highest in each stem.

– Provide a key to your coding.

Example : Stem-and-Leaf Plot

The prices ($) of 18 brands of walking shoes:

90 70 70 70 75 70 65 68 60

74 70 95 75 70 68 65 40 65

4 0

5

6 0 5 5 5 8 8

7 0 0 0 0 0 0 4 5 5

8

9 0 5

Relative Frequency Histograms (5)Relative Frequency Histograms (5)• A relative frequency histogramrelative frequency histogram for a quantitative data set is a

bar graph in which the height of the bar shows “how often” (measured as a proportion or relative frequency) measurements fall in a particular class or subinterval.

• Divide the range of the data into 5-125-12 subintervalssubintervals of equal length.

• Calculate the approximate widthapproximate width of the subinterval as Range/number of subintervals.

• Round the approximate width up to a convenient value.• Use the method of left inclusionleft inclusion, including the left

endpoint, but not the right in your tally.• Create a statistical tablestatistical table including the subintervals, their

frequencies and relative frequencies.

• Draw the relative frequency histogramrelative frequency histogram, plotting the subintervals on the horizontal axis and the relative frequencies on the vertical axis.

• The height of the bar represents– The proportionproportion of measurements falling in that

class or subinterval.– The probabilityprobability that a single measurement, drawn

at random from the set, will belong to that class or subinterval.

Relative Frequency Histograms (5) : Relative Frequency Histograms (5) : cont’dcont’d

Example 1

The ages of 50 tenured faculty at a state university.• 34 48 70 63 52 52 35 50 37 43 53 43 52 44

• 42 31 36 48 43 26 58 62 49 34 48 53 39 45

• 34 59 34 66 40 59 36 41 35 36 62 34 38 28

• 43 50 30 43 32 44 58 53

• We choose to use 6 intervals.

• Minimum class width = (70 – 26)/6 = 7.33

• Convenient class width = 8

• Use 6 classes of length 8, starting at 25.

Range

Age Tally Frequency Relative Frequency

Percent

25 to < 33 1111 5 5/50 = .10 10%

33 to < 41 1111 1111 1111 14 14/50 = .28 28%

41 to < 49 1111 1111 111 13 13/50 = .26 26%

49 to < 57 1111 1111 9 9/50 = .18 18%

57 to < 65 1111 11 7 7/50 = .14 14%

65 to < 73 11 2 2/50 = .04 4%

Ages

Rela

tive fre

quency

73655749413325

14/50

12/50

10/50

8/50

6/50

4/50

2/50

0

Class Class Boundaries

Midpoint Frequency Relative Frequency

Percent

25 to < 33 24.5 – 33.5 29 5 5/50 = .10 10%

34 to < 42 33.5 – 42.5 38 16 16/50 = .32 32%

43 to < 51 42.5 – 51.5 47 14 14/50 = .28 28%

52 to < 60 51.5 – 60.5 56 10 10/50 = .20 20%

61 to < 69 60.5 – 69.5 65 4 4/50 = .08 8%

70 to < 78 69.5 – 78.5 74 1 1/50 = .02 2%

Shape?

Outliers?

What proportion of the tenured faculty are younger than 42.5?

What is the probability that a randomly selected faculty member is 52 or older?

Skewed right

No.

(16 + 5)/50 = 31/50 = .62=62%

(10 + 4 + 1)/50 = 15/50 = .34

Describing the Distribution

How Many Class Intervals?

• Many (Narrow class intervals)• may yield a very jagged distribution

with gaps from empty classes

• Can give a poor indication of how frequency varies across classes

• Few (Wide class intervals)• may compress variation too much

and yield a blocky distribution

• can obscure important patterns of variation.

0

2

4

6

8

10

12

0 30 60 More

TemperatureF

req

ue

nc

y

0

0.5

1

1.5

2

2.5

3

3.5

4 8

12 16 20 24 28 32 36 40 44 48 52 56 60

Mor

e

Temperature

Fre

qu

ency

(X axis labels are upper class endpoints)

Example 2

Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30,

32, 13, 12, 38, 41, 43, 44, 27, 53, 27

• Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

• Find range: 58 - 12 = 46

• Select number of classes: 5 (usually between 5 and 12)

• Compute class interval (width): 10 (46/5 then round up)

• Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

• Compute class midpoints: 15, 25, 35, 45, 55

• Count observations & assign to classes

Example 2: Solution (Frequency Distribution)

Class

10 ≤ X < 20 3 .15 15

20 ≤ X < 30 6 .30 30

30 ≤ X < 40 5 .25 25

40 ≤ X < 50 4 .20 20

50 ≤ X < 60 2 .10 10

Total 20 1.00 100

RelativeFrequency Percentage

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

(continued)

Example 2: Solution (Frequency Distribution)

Frequency

0

1

2

3

4

5

6

7

5 15 25 35 45 55 65

Fre

qu

ency

Histogram: Daily High Temperature

Class Midpoints

Histogram: Example 2Histogram: Example 2

(No gaps between bars)

Class

10 ≤ X < 20 15 3

20 ≤ X < 30 25 6

30 ≤ X < 40 35 5

40 ≤ X < 50 45 4

50 ≤ X < 60 55 2

FrequencyClass

Midpoint

Ogive (6)Ogive (6)

 An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of respective classes.

Two type of ogive:

(i) ogive less than

(ii) ogive greater than

 

First, build a table of cumulative frequency.

 

Cumulative Frequency

Class

10 ≤ X < 20 3 15 3 15

20 ≤ X < 30 6 30 9 45

30 ≤ X < 40 5 25 14 70

40 ≤ X < 50 4 20 18 90

50 ≤ X < 60 2 10 20 100

Total 20 100

Percentage Cumulative Percentage

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

FrequencyCumulative Frequency

Graphing Cumulative Frequencies: The Ogive

Ogive: Daily High Temperature

0

20

40

60

80

100

10 20 30 40 50 60Cu

mu

lati

ve P

erce

nta

ge

Class Boundaries (Not Midpoints)

Class

<10 0 0

10 ≤ X < 20 10 15

20 ≤ X < 30 20 45

30 ≤ X < 40 30 70

40 ≤ X < 50 40 90

50 ≤ X < 60 50 100

Cumulative Percentage

Lower class

boundary

Interpreting Graphs: Location and Spread

• Where is the data centered on the horizontal axis, and how does it spread out from the center?

• Where is the data centered on the horizontal axis, and how does it spread out from the center?

Interpreting Graphs: Shapes

Mound shaped and symmetric (mirror images)

Skewed right: a few unusually large measurements

Skewed left: a few unusually small measurements

Bimodal: two local peaks

Are there any strange or unusual measurements that stand out in the data set?

OutlierNo Outliers

Interpreting Graphs: Outliers

• A quality control process measures the diameter of a gear being made by a machine (cm). The technician records 15 diameters, but inadvertently makes a typing mistake on the second entry.

1.991 1.891 1.991 1.988 1.993 1.989 1.990 1.988

1.988 1.993 1.991 1.989 1.989 1.993 1.990 1.994

Example