M670_0(Data)

Post on 11-Feb-2016

229 views 0 download

description

Course work for Business analytics in management, purdue

Transcript of M670_0(Data)

Data and Introduction to Probability

k

Krannert School of ManagementPurdue University

Pre-Session

2

Using Data for Decisions We deal with data everyday to make decisions.

• Dow Jones averages for investment decisions

• Sales and inventory data for production planning

• Customer preference survey for new product introduction

Very large data sets are not uncommon.• Wal-Mart: over 20M transactions per day in

a 10 terabytes (1012) database.• AT&T: 100M customers and 300M long-

distance calls per day.

3

Statistics: The Science of Data

Collecting Data• Surveys• Experiments

Presenting Data• Charts & Tables

Characterizing Data• Averages• Variances

Data Analysis

Decision-Making

© 1984-1994 T/Maker Co.

Why?

© 1984-1994 T/Maker Co.

4

Categorical and Numerical Variables Categorical variables:

• divide data into two or more categories.

• also called qualitative variables

Numerical variables:• indicate how much or

how many using numbers

• also called quantitative variables

Why distinguish? • To determine what

computation makes sense

Examples:• gender (male/female), • hair color (brown, blonde,

brunette, …),• economic status (low,

medium, high), • satisfaction (1-10).

Examples: • temperature• age• income

Examples: • What is average hair color? • Is 80F twice as hot as 40F?

5

Cross-Sectional and Time Series Data

Cross-sectional data are collected at the same or approximately the same point in time.• Example: data detailing the number of

building permits issued in June 2004 in each of the counties of Indiana

Time series data are collected over several time periods.• Example: data detailing the number of

building permits issued in Tippecanoe County, Indiana in each of the last 36 months

6

Year

Y-Da

ta

2004200320022001200019991998199719961995

125

120

115

110

105

100

95

90

Variable

JapanFranceGermanyItalyUK

U.S.Canada

Industrial Production versus Year

Department of Commerceand Council of Economic Advisors Adjusted so that 1997=100

7

Example: Firestone Tires

Reported accidents that involved Firestone tires

DATEA STATE INJURIES TIRE_MODEL TIRE_SIZE VEH_MFR VEH_MODEL

2/28/2000 TX NA WILDERNESS AT NA FORD EXPLORER3/30/2000 TX 3 ATX P235/75R15 FORD EXPLORER3/22/2000 TX 1 ATX P235/75R15 FORD EXPLORER5/22/2000 CA 3 ATX P235/75R15 FORD EXPLORER8/15/2000 LA 0 WILDERNESS P225/70R14 FORD RANGER7/11/2000 FL 0 ATX P235/75R15 FORD EXPLORER7/17/2000 MI 0 WILDERNESS P235/75R15 FORD EXPLORER7/25/2000 FL 1 WILDERNESS P235/75R15 FORD EXPLORER8/2/2000 FL 2 ATX NA FORD EXPLORER8/1/2000 FL 0 WILDERNESS AT UKN FORD EXPLORER8/2/2000 FL FIRESTONE N/A NISSAN FRONTIER8/2/2000 IL 1 FIRESTONE NA NISSAN 4X48/2/2000 CA/AR 1 ATX P235/75R15 FORD EXPLORER8/2/2000 TX 1 ATX P235/75R15 FORD EXPLORER8/2/2000 CO 4 ATX 30XR9.50R15LT FORD BRONCO II

8

Frequency Distribution

Tabular summary of data showing the frequency (or number) of items in each of several nonoverlapping classes

Provides insights about the data that cannot be quickly obtained by looking only at the original data.

Use COUNTIF function in Excel.

9

Example: Firestone Tires

Frequency Distribution

Tire Model Count PercentATX 554 18.7Firehawk 38 1.3Firestone 29 1.0Firestone ATX 106 3.6Firestone Wilderness 131 4.4Radial ATX 48 1.6Wilderness 1246 42.0Wilderness AT 709 23.9Wilderness HT 108 3.6Total 2969 100

84.6%

10

Bar Graph

A graphical device for depicting categorical data that have been summarized in a frequency distribution.

On the horizontal axis we specify the labels that are used for each of the classes.

A frequency, relative frequency, or percent frequency scale can be used for the vertical axis.

11

Example: Firestone Tires

Pie Chart

12

Pareto Chart

A bar graph whose categories are ordered from most frequent to least frequent.

Identifies “vital few” categories that contain most of the observations.

13

Example: Unemployment Rates

Unemployment rates for each of the 50 states and Puerto Rico in December 2000

14

Frequency Distribution

Divide the range of data into classes of equal width. The number of classes is usually between 5 and 20.

Count the number of elements in each class.

Use FREQUENCY function in Excel. Unemployment in States

Class Frequency1.0 to 1.9 22.0 to 2.9 133.0 to 3.9 214.0 to 4.9 105.0 to 5.9 36.0 to 6.9 17.0 to 7.9 08.0 to 8.9 1

15

Histogram A graphical presentation of numerical

data. The variable of interest is placed on the

horizontal axis and the frequency is placed on the vertical axis.

16

Cumulative Distribution Shows the number of items with values

less than or equal to the upper limit of each class.

Cumulative percent frequency distribution

Class Frequency Cumulative Frequency

Cumulative Percent

Frequency1.0 to 1.9 2 2 3.9%2.0 to 2.9 13 15 29.4%3.0 to 3.9 21 36 70.6%4.0 to 4.9 10 46 90.2%5.0 to 5.9 3 49 96.1%6.0 to 6.9 1 50 98.0%7.0 to 7.9 0 50 98.0%8.0 to 8.9 1 51 100.0%

Unemployment rate in states

17

Ogive

A graph of a cumulative distribution. The data values are shown on the horizontal axis. Shown on the vertical axis are the:

• cumulative frequencies, or• cumulative percent frequencies

Unemployment Rates

0

5

10

15

20

25

1.0 to 1.9

2.0 to 2.9

3.0 to 3.9

4.0 to 4.9

5.0 to 5.9

6.0 to 6.9

7.0 to 7.9

8.0 to 8.9

Freq

uenc

y

0.0%20.0%40.0%60.0%80.0%100.0%120.0%

18

Time Plot

Plots a variable against time at which it was measured.

Always put time on the horizontal axis. Connecting the data points by lines helps

emphasize any change over time.

-0.5

0

0.5

1

1.5

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990Year

EPS

Johnson&Johnson EPS

19

Summary of Tabular and Graphical Methods

DataNumerical Data

TabularMethods

TabularMethods

Graphical Methods

Graphical Methods

•Frequency Distribution•Rel. Freq. Dist.

•Bar Graph•Pie Chart•Pareto Chart

•Frequency Distribution•Rel. Freq. Dist.•Cum. Freq. Dist.•Cum. Rel. Freq. Distribution •Stem-and-Leaf Display

•Histogram•Ogive•Time Plot

Categorical Data

20

The Five-Number Summary and Box-and-Whisker Plot

The five-number summary of a distribution consists of:

Minimum Q1 Median Q3 Maximum

A box-and-whisker plot is a graph of the five-number summary.• A central box spans the quartiles.• A line in the box marks the median.• Lines extend from the box out to the

smallest and largest observations.

21

Example: Apartment Rents

For the following data on 70 apartment rents compute the five number summary

Five-Number SummaryLowest Value = 425 First

Quartile = 450 Median = 475

Third Quartile = 525 Largest Value = 615425 430 430 435 435 435 435 435 440 440

440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

22

Coefficient of Variation

Measure of relative dispersion Always a % Shows variation relative to mean Used to compare two or more groups Sample coefficient of variation:

Population coefficient of variation:

)100(xsCV

)100(

CV

23

Example: Coefficient of Variation

DataGroup 1: 1 2 3Group 2: 100 200 300

Coefficient of Variations

Group 1:

Group 2:

%50)100(21)100(

xsCV

100(100) (100) 50%200

sCVx

24

Using MinitabDescriptive Statistics Tool

25

Chebyshev’s Theorem

At least (1 - 1/k2) of the items in any data set will be within k standard deviations of the mean, where k > 1.

Examples:• At least 75% of the items must be within

k = 2 standard deviations of the mean.• At least 89% of the items must be within

k = 3 standard deviations of the mean.• At least 94% of the items must be within

k = 4 standard deviations of the mean.

26

Example: Apartment Rents

Chebyshev’s Theorem

Let k = 2.0 with = 490.80 and s = 54.74

At least (1 - 1/22) = 0.75 or 75% of the rent values must be between

= 490.80 – 2.0(54.74) = and = 490.80 + 2.0(54.74) =

skx

skx

x

27

Examining Relationships

Thus far we have focused on methods that are used to summarize the data for a single variable.

Often a manager is interested in the relationship between two or more variables.

Methods of examining relationships• Scatter Diagrams• Correlation and Covariance• Least-Square Regression• Crosstabulations

28

Example: Heating a House

One-degree day is accumulated for each degree a day’s average temperature falls below 65°F, e.g., an average temperature of 20°F corresponds to 45 degree-days.

29

Example: Heating a House

Scatter Diagram

30

Adding Categorical Variables

Natural-gas consumption against degree-days before (red dots) and after (blue dots) installing solar panel

31

Lurking Variables

Lurking in the background, if not measured, they can• falsely suggest a strong relationship

between two variables, or• hide a relationship that is really there.

32

Example: False Association

Studies show that men who complain of chest pain are more likely to get detailed test and aggressive treatment than are women with similar complaints.Is this association due to discrimination?

Perhaps not. Women develop heart problems on the average between 10 and 15 years older than men do. Aggressive treatments are more risky for older patients, so doctors may hesitate to recommend them.

Lurking variables – the patient’s age and condition – may explain the false association.

33

Example: Hidden Association

Correlation between x and y = 0.08

34

Crosstabulations

Jointly show observations of two categorical variables (or categorized numerical variables).

Often referred to as contingency tables. Help us to find the relationship between

the two variables.

35

Examining Relationship in Crosstabulations

No single graph (such as a scatter plot) or single numerical measure ( such as the correlation) summaries the strength of the association between categorical variables.

Describing relationships• Marginal Distribution: Distribution of

each variable alone.• Conditional Distribution: Distribution of

one variable for a given class of the other variable

36

Example: Marital Status and Job Level

Marginal distribution of the job level

37

Example: Marital Status and Job Level

Conditional distributions of the job level

38

Pivot Table in Excel

Allows us to construct crosstabulations in a variety of ways.

Provides more variety and flexibility than most other statistical software packages.

39

Example: Movie Stars

Amount the stars ask for a movie

Note: Rows 9-67 are not shown.

A B C D

1 Name Gender Amount asking for a movie ($ million)

2 Angela Bassett F 2.53 Jessica Lange F 2.54 Winona Ryder F 45 Michelle Pfeiffer F 106 Whoopi Goldberg F 107 Emma Thompson F 38 Julia Roberts F 12

40

Example: Movie Stars

PivotTable

A B C D123 Count of Amount asking for a movie ($ million)Gender4 Amount asking for a movie ($ million)F M Grand Total5 2-5 10 9 196 5-8 1 17 187 8-11 4 8 128 11-14 3 3 69 14-17 3 310 17-20 8 811 Grand Total 18 48 66

41

Simpson’s Paradox

An association that holds for all several groups can reverse direction when the data are combined to form a single group.

This reversal is called Simpson’s paradox.

42

Promotion of Managers

Manager promotions in MDP Inc.

Who gets more promotions?

  MBA Degree No MBA DegreePromoted 125 155Not Promoted 875 845Total 1000 1000Promoted (in

percent) 12.5% 15.5%

43

Lurking Variable: Experience

Data broken down by fresh hire/experienced

MBA promotions are higher in each category Experience is the lurking variable

  MBA Degree   Without MBA Degree  Pro.1 Not Pro.2 %Promote   Pro.1 Not Pro.2 % Promote

Fresh 90 810 10.0% 15 285 5.0%Experience 35 65 35.0%   140 560 20.0%

1: Promote2: Not Promoted

44

Example: Flight Arrivals

One month’s data for flights from several western cities for two airlines:

Which airline is on time?

Alaska Airline America WestOn time 3274 6438Delayed 501 787Total 3775 7225

% Delayed 13.3% 10.9%

45

Example: Flight Arrivals

Data broken down by cities

Alaska Airline beats America West in all 5 cities.

City of origin is a lurking variable.

On time Delayed % Del. On time Delayed % Del.Los Angeles 497 62 11.1% 694 117 14.4%Pheonix 221 12 5.2% 4840 415 7.9%San Diego 212 20 8.6% 383 65 14.5%San Francisco 503 102 16.9% 320 129 28.7%Seattle 1841 305 14.2% 201 61 23.3%

Alaska Airline America West

Introduction to Probability

47

Experiment, Outcome, and Sample Space

Experiment : process of obtaining an observation, outcome, or information.

Outcome : result of an experiment

Sample space : collection (or set) of all possible outcomes, defined by experimenter

Event : any collection (or set) of outcomes

Roll a dice

Number that appears on top

{1, 2, 3, 4, 5, 6}

Get an odd number

48

Examples of Experiments and Their Sample Spaces

Experiment

Toss a coin, and note a face.

Inspect a part, and note quality.

Toss a coin twice, and note faces.

Purchase a soft drink twice, and note brand, C (Coke), O (all others).

Sample Space

{Defective, Good}

{CC, CO, OC, OO}

{Head, Tail}

{HH, HT, TH, TT}

49

Probability of An Event

Numerical measure of the likelihood that an event will occur.

Lies between 0 (impossible) and 1 (certain).

A probability of 0.5 indicates the occurrence of the event is just as likely as it is unlikely.

P(Sample Space) = 1

50

Event Properties Mutually exclusive events:

The events can not occur at the same time

Example: Suppose we want to assess the probability that the Federal Reserve Bank will change the prime rate this month. Define the following events:

A: the prime rate is raisedB: the prime rate is loweredC: the prime rate is unchanged

Events A, B, and C are mutually exclusive. A B

C

51

Collectively exhaustive• One out of a set of events

must occur• Constitute the sample space

Example: In the prime rate example, Events A, B, and C are collectively exhaustive.

Example: Consider two coin tosses of a fair coin:A: at least one headsB: at least one tails

Are A and B collectively exhaustive?Are they mutually exclusive?

YesNo. Consider HT or TH.

AB

C

52

Complement of an Event

The complement of event A is the event consisting of all outcomes that are not in A.

The complement of A is denoted by Ac. The Venn diagram below illustrates the

concept of a complement.

Event A Ac

Sample Space

53

The union of events A and B is the event containing all sample points that are in A or B.

The union is denoted by A B The union of A and B is illustrated below.

Union of Two Events

Event A

Sample Space

Event B

54

Intersection of Two Events

The intersection of events A and B is the set of all sample points that are in both A and B.

The intersection is denoted by A The intersection of A and B is illustrated

below.

Event A

Sample Space

Event B

Intersection

55

Addition Law: Relating Probabilities of Unions and Intersections

Provides a way to compute the probability of the union of events. The law is written as:

When the events are mutually exclusive:

In rolling a dice, what is the probability of getting an odd number or a number less than 5?

0.5+2/3-1/3 = 5/6

< 5: {1,2,3,4}Odd and < 5: {1,3}

Odd or < 5: {1,2,3,4,5}Odd: {1,3,5}

( ) ( ) ( ) ( )P A B P A P B P A B

( ) ( ) ( )P A B P A P B

Sample Space

Event A Event B