M670_0(Data)

55
Data and Introduction to Probability k Krannert School of Management Purdue University Pre- Session

description

Course work for Business analytics in management, purdue

Transcript of M670_0(Data)

Page 1: M670_0(Data)

Data and Introduction to Probability

k

Krannert School of ManagementPurdue University

Pre-Session

Page 2: M670_0(Data)

2

Using Data for Decisions We deal with data everyday to make decisions.

• Dow Jones averages for investment decisions

• Sales and inventory data for production planning

• Customer preference survey for new product introduction

Very large data sets are not uncommon.• Wal-Mart: over 20M transactions per day in

a 10 terabytes (1012) database.• AT&T: 100M customers and 300M long-

distance calls per day.

Page 3: M670_0(Data)

3

Statistics: The Science of Data

Collecting Data• Surveys• Experiments

Presenting Data• Charts & Tables

Characterizing Data• Averages• Variances

Data Analysis

Decision-Making

© 1984-1994 T/Maker Co.

Why?

© 1984-1994 T/Maker Co.

Page 4: M670_0(Data)

4

Categorical and Numerical Variables Categorical variables:

• divide data into two or more categories.

• also called qualitative variables

Numerical variables:• indicate how much or

how many using numbers

• also called quantitative variables

Why distinguish? • To determine what

computation makes sense

Examples:• gender (male/female), • hair color (brown, blonde,

brunette, …),• economic status (low,

medium, high), • satisfaction (1-10).

Examples: • temperature• age• income

Examples: • What is average hair color? • Is 80F twice as hot as 40F?

Page 5: M670_0(Data)

5

Cross-Sectional and Time Series Data

Cross-sectional data are collected at the same or approximately the same point in time.• Example: data detailing the number of

building permits issued in June 2004 in each of the counties of Indiana

Time series data are collected over several time periods.• Example: data detailing the number of

building permits issued in Tippecanoe County, Indiana in each of the last 36 months

Page 6: M670_0(Data)

6

Year

Y-Da

ta

2004200320022001200019991998199719961995

125

120

115

110

105

100

95

90

Variable

JapanFranceGermanyItalyUK

U.S.Canada

Industrial Production versus Year

Department of Commerceand Council of Economic Advisors Adjusted so that 1997=100

Page 7: M670_0(Data)

7

Example: Firestone Tires

Reported accidents that involved Firestone tires

DATEA STATE INJURIES TIRE_MODEL TIRE_SIZE VEH_MFR VEH_MODEL

2/28/2000 TX NA WILDERNESS AT NA FORD EXPLORER3/30/2000 TX 3 ATX P235/75R15 FORD EXPLORER3/22/2000 TX 1 ATX P235/75R15 FORD EXPLORER5/22/2000 CA 3 ATX P235/75R15 FORD EXPLORER8/15/2000 LA 0 WILDERNESS P225/70R14 FORD RANGER7/11/2000 FL 0 ATX P235/75R15 FORD EXPLORER7/17/2000 MI 0 WILDERNESS P235/75R15 FORD EXPLORER7/25/2000 FL 1 WILDERNESS P235/75R15 FORD EXPLORER8/2/2000 FL 2 ATX NA FORD EXPLORER8/1/2000 FL 0 WILDERNESS AT UKN FORD EXPLORER8/2/2000 FL FIRESTONE N/A NISSAN FRONTIER8/2/2000 IL 1 FIRESTONE NA NISSAN 4X48/2/2000 CA/AR 1 ATX P235/75R15 FORD EXPLORER8/2/2000 TX 1 ATX P235/75R15 FORD EXPLORER8/2/2000 CO 4 ATX 30XR9.50R15LT FORD BRONCO II

Page 8: M670_0(Data)

8

Frequency Distribution

Tabular summary of data showing the frequency (or number) of items in each of several nonoverlapping classes

Provides insights about the data that cannot be quickly obtained by looking only at the original data.

Use COUNTIF function in Excel.

Page 9: M670_0(Data)

9

Example: Firestone Tires

Frequency Distribution

Tire Model Count PercentATX 554 18.7Firehawk 38 1.3Firestone 29 1.0Firestone ATX 106 3.6Firestone Wilderness 131 4.4Radial ATX 48 1.6Wilderness 1246 42.0Wilderness AT 709 23.9Wilderness HT 108 3.6Total 2969 100

84.6%

Page 10: M670_0(Data)

10

Bar Graph

A graphical device for depicting categorical data that have been summarized in a frequency distribution.

On the horizontal axis we specify the labels that are used for each of the classes.

A frequency, relative frequency, or percent frequency scale can be used for the vertical axis.

Page 11: M670_0(Data)

11

Example: Firestone Tires

Pie Chart

Page 12: M670_0(Data)

12

Pareto Chart

A bar graph whose categories are ordered from most frequent to least frequent.

Identifies “vital few” categories that contain most of the observations.

Page 13: M670_0(Data)

13

Example: Unemployment Rates

Unemployment rates for each of the 50 states and Puerto Rico in December 2000

Page 14: M670_0(Data)

14

Frequency Distribution

Divide the range of data into classes of equal width. The number of classes is usually between 5 and 20.

Count the number of elements in each class.

Use FREQUENCY function in Excel. Unemployment in States

Class Frequency1.0 to 1.9 22.0 to 2.9 133.0 to 3.9 214.0 to 4.9 105.0 to 5.9 36.0 to 6.9 17.0 to 7.9 08.0 to 8.9 1

Page 15: M670_0(Data)

15

Histogram A graphical presentation of numerical

data. The variable of interest is placed on the

horizontal axis and the frequency is placed on the vertical axis.

Page 16: M670_0(Data)

16

Cumulative Distribution Shows the number of items with values

less than or equal to the upper limit of each class.

Cumulative percent frequency distribution

Class Frequency Cumulative Frequency

Cumulative Percent

Frequency1.0 to 1.9 2 2 3.9%2.0 to 2.9 13 15 29.4%3.0 to 3.9 21 36 70.6%4.0 to 4.9 10 46 90.2%5.0 to 5.9 3 49 96.1%6.0 to 6.9 1 50 98.0%7.0 to 7.9 0 50 98.0%8.0 to 8.9 1 51 100.0%

Unemployment rate in states

Page 17: M670_0(Data)

17

Ogive

A graph of a cumulative distribution. The data values are shown on the horizontal axis. Shown on the vertical axis are the:

• cumulative frequencies, or• cumulative percent frequencies

Unemployment Rates

0

5

10

15

20

25

1.0 to 1.9

2.0 to 2.9

3.0 to 3.9

4.0 to 4.9

5.0 to 5.9

6.0 to 6.9

7.0 to 7.9

8.0 to 8.9

Freq

uenc

y

0.0%20.0%40.0%60.0%80.0%100.0%120.0%

Page 18: M670_0(Data)

18

Time Plot

Plots a variable against time at which it was measured.

Always put time on the horizontal axis. Connecting the data points by lines helps

emphasize any change over time.

-0.5

0

0.5

1

1.5

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990Year

EPS

Johnson&Johnson EPS

Page 19: M670_0(Data)

19

Summary of Tabular and Graphical Methods

DataNumerical Data

TabularMethods

TabularMethods

Graphical Methods

Graphical Methods

•Frequency Distribution•Rel. Freq. Dist.

•Bar Graph•Pie Chart•Pareto Chart

•Frequency Distribution•Rel. Freq. Dist.•Cum. Freq. Dist.•Cum. Rel. Freq. Distribution •Stem-and-Leaf Display

•Histogram•Ogive•Time Plot

Categorical Data

Page 20: M670_0(Data)

20

The Five-Number Summary and Box-and-Whisker Plot

The five-number summary of a distribution consists of:

Minimum Q1 Median Q3 Maximum

A box-and-whisker plot is a graph of the five-number summary.• A central box spans the quartiles.• A line in the box marks the median.• Lines extend from the box out to the

smallest and largest observations.

Page 21: M670_0(Data)

21

Example: Apartment Rents

For the following data on 70 apartment rents compute the five number summary

Five-Number SummaryLowest Value = 425 First

Quartile = 450 Median = 475

Third Quartile = 525 Largest Value = 615425 430 430 435 435 435 435 435 440 440

440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

Page 22: M670_0(Data)

22

Coefficient of Variation

Measure of relative dispersion Always a % Shows variation relative to mean Used to compare two or more groups Sample coefficient of variation:

Population coefficient of variation:

)100(xsCV

)100(

CV

Page 23: M670_0(Data)

23

Example: Coefficient of Variation

DataGroup 1: 1 2 3Group 2: 100 200 300

Coefficient of Variations

Group 1:

Group 2:

%50)100(21)100(

xsCV

100(100) (100) 50%200

sCVx

Page 24: M670_0(Data)

24

Using MinitabDescriptive Statistics Tool

Page 25: M670_0(Data)

25

Chebyshev’s Theorem

At least (1 - 1/k2) of the items in any data set will be within k standard deviations of the mean, where k > 1.

Examples:• At least 75% of the items must be within

k = 2 standard deviations of the mean.• At least 89% of the items must be within

k = 3 standard deviations of the mean.• At least 94% of the items must be within

k = 4 standard deviations of the mean.

Page 26: M670_0(Data)

26

Example: Apartment Rents

Chebyshev’s Theorem

Let k = 2.0 with = 490.80 and s = 54.74

At least (1 - 1/22) = 0.75 or 75% of the rent values must be between

= 490.80 – 2.0(54.74) = and = 490.80 + 2.0(54.74) =

skx

skx

x

Page 27: M670_0(Data)

27

Examining Relationships

Thus far we have focused on methods that are used to summarize the data for a single variable.

Often a manager is interested in the relationship between two or more variables.

Methods of examining relationships• Scatter Diagrams• Correlation and Covariance• Least-Square Regression• Crosstabulations

Page 28: M670_0(Data)

28

Example: Heating a House

One-degree day is accumulated for each degree a day’s average temperature falls below 65°F, e.g., an average temperature of 20°F corresponds to 45 degree-days.

Page 29: M670_0(Data)

29

Example: Heating a House

Scatter Diagram

Page 30: M670_0(Data)

30

Adding Categorical Variables

Natural-gas consumption against degree-days before (red dots) and after (blue dots) installing solar panel

Page 31: M670_0(Data)

31

Lurking Variables

Lurking in the background, if not measured, they can• falsely suggest a strong relationship

between two variables, or• hide a relationship that is really there.

Page 32: M670_0(Data)

32

Example: False Association

Studies show that men who complain of chest pain are more likely to get detailed test and aggressive treatment than are women with similar complaints.Is this association due to discrimination?

Perhaps not. Women develop heart problems on the average between 10 and 15 years older than men do. Aggressive treatments are more risky for older patients, so doctors may hesitate to recommend them.

Lurking variables – the patient’s age and condition – may explain the false association.

Page 33: M670_0(Data)

33

Example: Hidden Association

Correlation between x and y = 0.08

Page 34: M670_0(Data)

34

Crosstabulations

Jointly show observations of two categorical variables (or categorized numerical variables).

Often referred to as contingency tables. Help us to find the relationship between

the two variables.

Page 35: M670_0(Data)

35

Examining Relationship in Crosstabulations

No single graph (such as a scatter plot) or single numerical measure ( such as the correlation) summaries the strength of the association between categorical variables.

Describing relationships• Marginal Distribution: Distribution of

each variable alone.• Conditional Distribution: Distribution of

one variable for a given class of the other variable

Page 36: M670_0(Data)

36

Example: Marital Status and Job Level

Marginal distribution of the job level

Page 37: M670_0(Data)

37

Example: Marital Status and Job Level

Conditional distributions of the job level

Page 38: M670_0(Data)

38

Pivot Table in Excel

Allows us to construct crosstabulations in a variety of ways.

Provides more variety and flexibility than most other statistical software packages.

Page 39: M670_0(Data)

39

Example: Movie Stars

Amount the stars ask for a movie

Note: Rows 9-67 are not shown.

A B C D

1 Name Gender Amount asking for a movie ($ million)

2 Angela Bassett F 2.53 Jessica Lange F 2.54 Winona Ryder F 45 Michelle Pfeiffer F 106 Whoopi Goldberg F 107 Emma Thompson F 38 Julia Roberts F 12

Page 40: M670_0(Data)

40

Example: Movie Stars

PivotTable

A B C D123 Count of Amount asking for a movie ($ million)Gender4 Amount asking for a movie ($ million)F M Grand Total5 2-5 10 9 196 5-8 1 17 187 8-11 4 8 128 11-14 3 3 69 14-17 3 310 17-20 8 811 Grand Total 18 48 66

Page 41: M670_0(Data)

41

Simpson’s Paradox

An association that holds for all several groups can reverse direction when the data are combined to form a single group.

This reversal is called Simpson’s paradox.

Page 42: M670_0(Data)

42

Promotion of Managers

Manager promotions in MDP Inc.

Who gets more promotions?

  MBA Degree No MBA DegreePromoted 125 155Not Promoted 875 845Total 1000 1000Promoted (in

percent) 12.5% 15.5%

Page 43: M670_0(Data)

43

Lurking Variable: Experience

Data broken down by fresh hire/experienced

MBA promotions are higher in each category Experience is the lurking variable

  MBA Degree   Without MBA Degree  Pro.1 Not Pro.2 %Promote   Pro.1 Not Pro.2 % Promote

Fresh 90 810 10.0% 15 285 5.0%Experience 35 65 35.0%   140 560 20.0%

1: Promote2: Not Promoted

Page 44: M670_0(Data)

44

Example: Flight Arrivals

One month’s data for flights from several western cities for two airlines:

Which airline is on time?

Alaska Airline America WestOn time 3274 6438Delayed 501 787Total 3775 7225

% Delayed 13.3% 10.9%

Page 45: M670_0(Data)

45

Example: Flight Arrivals

Data broken down by cities

Alaska Airline beats America West in all 5 cities.

City of origin is a lurking variable.

On time Delayed % Del. On time Delayed % Del.Los Angeles 497 62 11.1% 694 117 14.4%Pheonix 221 12 5.2% 4840 415 7.9%San Diego 212 20 8.6% 383 65 14.5%San Francisco 503 102 16.9% 320 129 28.7%Seattle 1841 305 14.2% 201 61 23.3%

Alaska Airline America West

Page 46: M670_0(Data)

Introduction to Probability

Page 47: M670_0(Data)

47

Experiment, Outcome, and Sample Space

Experiment : process of obtaining an observation, outcome, or information.

Outcome : result of an experiment

Sample space : collection (or set) of all possible outcomes, defined by experimenter

Event : any collection (or set) of outcomes

Roll a dice

Number that appears on top

{1, 2, 3, 4, 5, 6}

Get an odd number

Page 48: M670_0(Data)

48

Examples of Experiments and Their Sample Spaces

Experiment

Toss a coin, and note a face.

Inspect a part, and note quality.

Toss a coin twice, and note faces.

Purchase a soft drink twice, and note brand, C (Coke), O (all others).

Sample Space

{Defective, Good}

{CC, CO, OC, OO}

{Head, Tail}

{HH, HT, TH, TT}

Page 49: M670_0(Data)

49

Probability of An Event

Numerical measure of the likelihood that an event will occur.

Lies between 0 (impossible) and 1 (certain).

A probability of 0.5 indicates the occurrence of the event is just as likely as it is unlikely.

P(Sample Space) = 1

Page 50: M670_0(Data)

50

Event Properties Mutually exclusive events:

The events can not occur at the same time

Example: Suppose we want to assess the probability that the Federal Reserve Bank will change the prime rate this month. Define the following events:

A: the prime rate is raisedB: the prime rate is loweredC: the prime rate is unchanged

Events A, B, and C are mutually exclusive. A B

C

Page 51: M670_0(Data)

51

Collectively exhaustive• One out of a set of events

must occur• Constitute the sample space

Example: In the prime rate example, Events A, B, and C are collectively exhaustive.

Example: Consider two coin tosses of a fair coin:A: at least one headsB: at least one tails

Are A and B collectively exhaustive?Are they mutually exclusive?

YesNo. Consider HT or TH.

AB

C

Page 52: M670_0(Data)

52

Complement of an Event

The complement of event A is the event consisting of all outcomes that are not in A.

The complement of A is denoted by Ac. The Venn diagram below illustrates the

concept of a complement.

Event A Ac

Sample Space

Page 53: M670_0(Data)

53

The union of events A and B is the event containing all sample points that are in A or B.

The union is denoted by A B The union of A and B is illustrated below.

Union of Two Events

Event A

Sample Space

Event B

Page 54: M670_0(Data)

54

Intersection of Two Events

The intersection of events A and B is the set of all sample points that are in both A and B.

The intersection is denoted by A The intersection of A and B is illustrated

below.

Event A

Sample Space

Event B

Intersection

Page 55: M670_0(Data)

55

Addition Law: Relating Probabilities of Unions and Intersections

Provides a way to compute the probability of the union of events. The law is written as:

When the events are mutually exclusive:

In rolling a dice, what is the probability of getting an odd number or a number less than 5?

0.5+2/3-1/3 = 5/6

< 5: {1,2,3,4}Odd and < 5: {1,3}

Odd or < 5: {1,2,3,4,5}Odd: {1,3,5}

( ) ( ) ( ) ( )P A B P A P B P A B

( ) ( ) ( )P A B P A P B

Sample Space

Event A Event B