Management 600: Practical Research Methods for...

MGMT 600 Notes 4 1

Management 600: Practical Research Methods for Managers Chapters 12 and 13 These notes deal with the matter of actually doing data analysis. Some of the actual analytical steps will be shown in Excel, although numerous other packages exist that are often used for data analysis including Stata, SPSS and SAS. We choose to demonstrate these things in Excel despite its limitations because it is so readily available. Chapter 12: Planning for Data Analysis Selecting Analytical Software There are numerous statistical analysis packages available. The amount of analytical power, in terms of what the package can do and how much data it can handle, is roughly correlated with the amount of training and experience necessary to become reasonable proficient with the package. Here is a rundown of a few and names of some others. 1. Microsoft Excel This is the most ubiquitous and probably the easiest program to use. It is also the most limited and, for lack of a better term, the clunkiest. Excel is a spreadsheet program rather than a dedicated statistical analysis software package, so it is set up differently from the other programs that will be mentioned here. Excel can calculate means and variances, can generate a wide variety of charts and tables and can even do t-tests and linear regression analysis. Excel’s abilities are greatly enhanced by the installation of the Analysis ToolPak, which comes standard with the program but, oddly, is not always installed. While Excel will do a lot for you, some of its procedures can be a bit cumbersome and can require that the data be organized in a particular fashion, and there are some things that Excel cannot do. One last drawback of Excel is that it is limited to about 64,000 observations and about 230 variables. This may not seem like a tremendous constraint, but data can get pretty expansive and, as long as the quality of the data is good, bigger is better. 2. SPSS This dedicated statistical package is widely used in academia and to some extent in business and can do just about anything. It takes a little bit of training to learn, but most of what it does can be accomplished through the use of very straightforward dropdown menus. This is the software package used in the undergraduate regression analysis course here at Metropolitan State. There is no limit on the number of variables or observations, although there is a student version available that is limited to 50 variables and 5,000 observations. 3. Stata

MGMT 600 Notes 4 2

This dedicated statistical package is very powerful, though less popular than SPSS and, not coincidentally, requires more training to use. Stata tends to be more command-driven than SPSS, but this makes it easy to execute a large group of procedures very quickly. 4. SAS The granddaddy of statistical analysis packages, this is very widely used in business and in academia and is probably the most powerful tool for data management and statistical analysis. While it is more challenging to learn, even rudimentary knowledge of SAS can make a resume stand out in a crowd. There are also a variety of other dedicated statistical packages about which we have less knowledge. These include Shazam, E-views and Minitab. The Preanalytical Process This is the process by which you get your data ready for analysis. This might mean that you make sure that the primary data you’re collecting is going to be good food for your analysis or it might mean looking through some secondary data you’ve got to assure yourself of an improve its quality. 1. Data Editing If you’re doing your own collection of data, the preanalytical process begins when you start collecting data. You should be sure that collection and measurement is consistent across observations and that what you think you’ve measuring is what you’re really measuring. Once data is collected, the editing process (often also called data cleaning) means several things. First, it means dropping observations that are missing important variables. This may vary depending on which variables from the data set you’re planning on using for your analysis. Second, it means eliminating observations whose variable values are incredible, sometimes referred to as outliers. For example, if you’re collecting data on people’s weight, and you had an observation for one person that read 12000 pounds, it would be clear that this observation is an error, perhaps an error in data entry or a dishonest answer by the respondent. In any case, that observation should be eliminated. Then again, drawing a line between the credible and the incredible is a difficult task. For example, would an adult male with a weight of 75 pounds be reasonable to include, or should you exclude the observation, or should you conclude that the respondent probably weighs 175 pounds and there was just an error somewhere in the data process that resulted in dropping the 1? When dealing with secondary data, you should know what the error codes for the data are. Very often, negative values or values like 99 are reserved for non-answers, either because a respondent refuses to answer or because the question is not applicable for that respondent (such as asking a man whether he is pregnant).

MGMT 600 Notes 4 3

To put this in perspective, one number that gets floated around for error rates in data entry is 3%. That is, something like 3% of the numbers input in a data set are likely to be in error. 2. Variable Development If you are starting with some variables, but need to develop other variables, this needs to be done. This is just a matter of being sure that you have all of the variables you will need and then assembling them. For example, if you want to do a study of body mass index, but you collected respondents’ height and weight, then these need to be combined to generate body mass index. As another example, if you have several measures of employee satisfaction, you might want to combine them in some way to get an aggregate measure of employee satisfaction. 3. Data Coding This means putting data into a format that makes sense to you. For example, if you have a variable called “gender” that takes a value of 1 if a person is female and 2 if they are male, this might be better recoded as a “female” dummy variable, meaning that you make a new variable that takes the value 0 if someone is male and 1 if they are female. As another example, if you are examining the effect of ethnicity on income and have data that includes hundreds of ethnic divisions, this may be more categories than you really want to deal with, so you would have to collapse these hundreds of categories into a more manageable number. In addition, you should generate a codebook which shows what each value means for each variable. If you are working with secondary data, you should be sure that you have a codebook for the data and that you are familiar with the variables and their values. 4. Error Check Look through the data again and make sure that there aren’t any errors. Make sure that there aren’t any numbers that don’t make sense. 5. Data Structure Generation The data should be organized into the form shown in Table 12.4, where each row is an observation and each column represents a different variable. This should also include making sure that the codebook for the data is accurate and complete.

MGMT 600 Notes 4 4

6. Preanalytical Computer Check This is basically a final inspection to be sure that the data are free of errors and inconsistencies. 7. Tabulation and Summarization Tabulation is done for qualitative variables and means creating descriptive tables showing how many respondents fell into which data ranges or categories. For example, you might tabulate the number of women and the number of men. If employees’ satisfaction with their benefits was rated on a scale of 1 through 5, you might tabulate the number of people who responded with each answer. While these tables shouldn’t be the entire analysis, they can help to familiarize the analyst with the data so that she knows what’s in there. Summarization is done for quantitative variables and means calculating means (averages), variances, standard deviations and finding the minimum and maximum for each quantitative variable. As with tabulation, this shouldn’t be the entire analysis but it will let you know what you’ve got with your data and might help to guide the analysis that follows. You should tabulate all of the qualitative variables and calculate summary statistics (summarize) all of the quantitative variables to acquaint yourself with your data. Some Useful Summary Formulas Here are the formulas for some basic summary statistics. It makes sense to calculate these for quantitative variables, such as a person’s height, weight, age, income or test score, but doesn’t make sense to calculate these for qualitative variables like gender, religion, occupation or eye color. Imagine that you have the following data on test scores: Obs Name Score 1 Bob 98 2 Deb 70 3 Scott 65 4 Tom 78 5 James 84 The average or mean test score is the sum of these divided by the total number of scores, five. This is:

MGMT 600 Notes 4 5

795

3955

84786570985

xxxxx5x 54321

5

1i

i ==++++

=++++

== ∑=

X

The median test score is the one in the middle when the scores are ranked from highest to lowest. This is: 65 70 78 84 98 So 78 is the median. If there is an even number of observations then the median is the average of the two observations in the middle. So, if the observations were: 60 65 70 78 84 98 the median would be the average of 70 and 78, or 74. The range is the difference between the lowest and highest value, or might also be a restatement of the lowest and highest values. In the data set with five observations shown above, the range might be stated as either 98-65=33, or you might say that the observations range from a minimum of 65 to a maximum of 98. The variance is a measure of how spread-out or widely dispersed the values of a variable are. The more widely dispersed they are, the larger the variance will be. The less widely spread out they are, the smaller the variance will be. The variance is equal to the sum of the squared differences between the observations and the mean. Perhaps an example will help. From the above, five observation group of test scores, we had a mean of 79. The variance, which goes by the symbol σ2, will be: σ2 = (1/5) x [(65-79)2 + (70-79)2 + (78-79)2 + (84-79)2 + (98-79)2] σ2 = (1/5) x [(-14)2 + (-9)2 + (-1)2 + (15)2 + (19)2] σ2 = (1/5) x [196 + 81 + 1 + 225 + 361] σ2 = (1/5) x [864] = 172.8

MGMT 600 Notes 4 6

The standard deviation, which goes by the symbol σ, is just the square root of the variance:

145.138.1722 ==σ=σ For purposes of comparison, imagine that the five test scores were: 77 78 79 80 81 The mean would still be 79. The variance would be: σ2 = (1/5) x [(77-79)2 + (78-79)2 + (79-79)2 + (80-79)2 + (81-79)2] σ2 = (1/5) x [(-2)2 + (-1)2 + (0)2 + (1)2 + (2)2] σ2 = (1/5) x [4 + 1 + 0 + 1 + 4] σ2 = (1/5) x [10] = 2 It should be fairly clear why you would care about the mean, but why would you care about the standard deviation or the variance? First, if you know the mean and the standard deviation, you know where most of the observations of a variable lie. At least 75% of the observations and, more often, about 95% of the observations of a variable will lie within two standard deviations of the mean. For example, if the mean test score was 100 and the standard deviation was 20, this would mean that most of the observations lie within the range of 60 to 140. In addition, almost all of the observations will be within three standard observations of the mean. Second, if you are comparing two groups that have approximately the same mean, the standard deviation tells you which is more spread out. Imagine that you were trying to assess employee satisfaction at two companies and you found that while the two companies had similar mean salaries, salaries at the two companies had dramatically different standard deviations. What would this mean? As another example, if you were assessing living conditions in different countries and found a set of countries with similar mean incomes but very different standard deviations of income, what would this tell you?

MGMT 600 Notes 4 7

Chapter 13: Basic Analytical Methods The purpose of this chapter is to introduce you to some basic analytical methods. These methods will be explained and screen captures will illustrate how to do these things in Microsoft Excel. The procedures shown here will be broken down into two groups. Descriptive procedures will basically let you get acquainted with and offer general, descriptive accounts of your data. Inferential procedures will let you infer some things about the population from the sample that you are examining. Each procedure will be introduced and then screen captures will show you how to do it in Excel. To do this, we’ll use some fake data showing gender, hat size (small, medium or large), age and income for fifteen people. Doing some of this will require the use of the Analysis ToolPak. If you look in Excel under the Tools menu and don’t see DataAnalysis as an option, you should install this ToolPak. Here’s how you do it: If you don’t see DataAnalysis:

MGMT 600 Notes 4 8

From the Tools menu, choose AddIns.

MGMT 600 Notes 4 9

Then choose the Analysis ToolPak and click OK.

Descriptives These are procedures that let you analyze your data in a fairly casual way. Descriptives for Qualitative Variables For qualitative variables, you can generate frequency tables. For example, we could figure out how many of the people in the data set are male and how many are female.

MGMT 600 Notes 4 10

To do this, we want to highlight the area of the spreadsheet containing the gender variable and then go to the Data menu and choose Pivot Tables:

MGMT 600 Notes 4 11

Then specify that the data you wish to use is in an Excel workbook and choose Next:

MGMT 600 Notes 4 12

The wizard will guess that the data you have highlighted is the data you wish to use and you can click Next:

Then it will ask you whether you want to put the resulting table on the same worksheet or on a new worksheet. This is up to you, but if you want to put it on the existing worksheet, you’ll need to choose a cell to put it in.

MGMT 600 Notes 4 13

Important!!!!! Before you procede, click on the Layout button to tell Excel how you want to do this table.

MGMT 600 Notes 4 14

Click on the Gender box and slide it over to the Row box, then click on the Gender box again and slide it over to the Data box in the table in the window and you’ll be all set. Click OK and then Finish.

MGMT 600 Notes 4 15

What you will get should look like this, from which you can conclude that the data contain observations on seven females and eight males.

MGMT 600 Notes 4 16

If you like, you can create a chart of this by copying the numbers in the table and then choosing Edit/PasteSpecial/Values and pasting the numbers into the worksheet.

MGMT 600 Notes 4 17

And then choosing Insert/Chart and whatever sort of chart you like for the data:

MGMT 600 Notes 4 18

I did a pie chart and this is how it turned out:

MGMT 600 Notes 4 19

You can also do frequency tables that let you look at two qualitative variables simultaneously and casually examine whether there seems to be a relationship between them This involves doing another pivot table similar to the one we did before, except that you should highlight the two contiguous columns that contain the qualitative variables whose relationship you hope to examine:

MGMT 600 Notes 4 20

Drag the Gender box over to Columns and the Size box over to Rows and then drag one or the other (it doesn’t matter which) to Data and you’re all set:

MGMT 600 Notes 4 21

And here’s the final result:

This tells you that of the people with large heads, two are female and three are male, of the people with medium heads, two are female and three are male and of the people with small heads, three are female and two are male. The conclusion might be that there doesn’t seem to be a strong relationship between gender and hat size; the genders seem to have similar distributions of hat sizes.

MGMT 600 Notes 4 22

Descriptives for Quantitative Variables These are descriptive techniques that are relevant for quantitative variables. The first is to calculate the mean. We can calculate the mean age by using the =average() function in Excel, and specifying the cells in which the observations of Age lie. This can be done through the Insert/Function menu option:

MGMT 600 Notes 4 23

Then choose =average() from the list of many different functions that you will see:

MGMT 600 Notes 4 24

Then specify the cells that contain the observations of Age:

MGMT 600 Notes 4 25

The result will be the average value for Age:

MGMT 600 Notes 4 26

You can also find the variance, standard deviation, minimum and maximum through the Insert/Function menu option:

MGMT 600 Notes 4 27

A histogram is a type of chart that shows how a quantitative variable is distributed. To get this sort of picture for the variable income, for example, we should first type into cells the range and divisions over which we would like to see this chart constructed. In the case of Income, it looks like the incomes are all between zero and 1000, so I’m going to try dividing it up into five blocks, each $200 wide:

MGMT 600 Notes 4 28

Now, to do the histogram, go to Tools/DataAnalysis/ToolPak/Histogram

MGMT 600 Notes 4 29

Tell Excel where the data are, where the bin ranges are and where you would like it to put the resulting table:

MGMT 600 Notes 4 30

The resulting table tells you that there was one observation of income below 200, five between 200 and 400, six between 400 and 600 and so on.

MGMT 600 Notes 4 31

You can make this into a chart by highlighting the relevant numbers and choosing Insert/Chart/Column:

MGMT 600 Notes 4 32

MGMT 600 Notes 4 33

The result tells you that most people have incomes that are between 200 and 600, with one person below that range and three people above, but no one over 1000:

Relationships Between Pairs of Variables Instead of just looking at one variable at a time, it is sometimes interesting to look at relationships between pairs of variables. There are two types. The first is relationships between pairs of qualitative variables. This may be assessed through a crosstab table, as was done with qualitative variables above. The table will have separate rows for values of one variable and separate columns for values of another and will show any patterns between the two variables. For example, if you looked at a table and saw this:

Gender Hat Size Male Female Small 63 43 Medium 58 50 Large 70 47

MGMT 600 Notes 4 34

It would suggest that hat size isn’t too closely related to gender. However, if the data looked like this:

Gender Hat Size Male Female Small 13 73 Medium 58 50 Large 120 17

It would suggest that there is a definite relationship between gender and hat size and that men tend to have larger heads. The second is relationships between pairs of quantitative variables. There are two main forms of assessing relationships between pairs of quantitative variables. The first is through a type of graph called a scatterplot. The second is through calculation of a correlation coefficient. A scatterplot shows, graphically, the relationship between pairs of variables. To obtain a scatterplot in Excel, highlight the columns containing the variables whose relationship you want to investigate and then choose Insert/Chart and choose XY (Scatter):

MGMT 600 Notes 4 35

If there is a positive relationship between the two variables (that is, when one variable is higher than average the other tends to be higher than average) the dots in the scatterplot will show a generally upward sloping pattern.

Y

00.20.40.60.8

11.21.41.61.8

2

0 0.2 0.4 0.6 0.8 1 1.2

MGMT 600 Notes 4 36

If there is a negative relationship between the two variables the dots in the scatterplot will show a generally downward sloping pattern. Scatterplots might also reveal more complex relationships between the variables:

Y

0

0.5

1

1.5

2

2.5

3

3.5

0 0.2 0.4 0.6 0.8 1

The relationship between two quantitative variables may also be represented by a correlation coefficient. This is a number between -1 and +1 that represents the extent to which there is a linear relationship between the two variables. A correlation coefficient of -1 means that there is a perfectly linear negative relationship between the variables, a correlation coefficient of +1 means that there is a perfectly linear positive relationship between the two variables. For the first of the two graphs shown above, the correlation coefficient is 0.702. For the second of the two graphs, the correlation coefficient is -0.089. In the second case, there is definitely a relationship between the variables, it just isn’t linear.

MGMT 600 Notes 4 37

To get a correlation coefficient, choose Tools/DataAnalysis/Correlation and highlight the columns with the relevant data:

MGMT 600 Notes 4 38

The result will be the correlation coefficient between the variables:

MGMT 600 Notes 4 39

The correlation coefficient of 0.338 suggests a moderate positive relationship between age and income, as suggested by the scatterplot between age and income. There is a slight upward pattern.

Income Against Age

0

200

400

600

800

1000

1200

0 20 40 60 80 1

Income

Age

00

The third is relationships between quantitative variables and qualitative variables. For example, how is age related to gender, how is income related to gender, or how is income related to hat size? The best way to demonstrate a relationship between a quantitative variable and a qualitative variable is to calculate the mean of the quantitative variable for the different values of the qualitative variable. For example, you might calculate the average income for men and the average income for women and then comparing the two. You might calculate mean income for people with small heads, with medium heads and with large heads and compare these means.

MGMT 600 Notes 4 40

To compare the means of two groups (men and women or people with small heads and people with large heads, for example) sort the data according to the qualitative variable by highlighting the data and choosing Data/Sort to sort by the qualitative variable.

MGMT 600 Notes 4 41

With the data sorted according to the qualitative variable that will define the two groups (gender or hat size, for example) choose Tools/DataAnalysis and then t-Test: Two Sample Assuming Unequal Means:

MGMT 600 Notes 4 42

Highlight the data from the first group and then the data from the second group and tell Excel where you want the results to go:

MGMT 600 Notes 4 43

The results are:

This says that women (Gender=F) have an average income of 437.429 and men have an average income of 482.125. A Bit About Inferential Statistics Inferential statistics means taking what you know about a sample and using it to infer something about the population from which it was drawn. There are a lot of techniques that can be used to do this. We’re going to briefly discuss two of them. The first is used to compare the mean of a variable between two groups. This was done above when we compared the income of men with the income of women. For the sample we had, men had an average income of 482.125 and women had an average income of 437.429. The question is, based on this information, is there convincing evidence that the population of men from which this sample was drawn has mean income greater than that of the population of women from which the female sample was drawn. The answer to this question depends on the calculation of a t-statistic, some details of which are presented in the textbook. Basically, it depends on the size of the two samples, how different their means were, and how large the standard deviations were. However, the answer is contained in the Excel output shown above.

MGMT 600 Notes 4 44

The question of whether or not the male average is equal to the female average is answered by the values of the t-statistic (t-Stat) and in the associated p-value (P(T<=t) two tail). If the t-stat is larger than 2.00 or if the p-value is less than 0.05, this generally means that the two groups have significantly different means. If however, the t-stat is smaller than 2.00 or the p-value is greater than 0.05, this means that the difference between the two samples is not statistically significant. To summarize: Large t-stat, small p-value => significantly different Small t-stat, large p-value => not significantly different In this case, the t-stat was -0.3738 and the associated two-tailed p-value was 0.7146. This is a small t-stat and a large p-value, so the difference between male incomes and female incomes is not statistically significant. That is, the difference between the men and women in the sample isn’t big enough to let us conclude with a great degree of confidence that there is a difference in the population.

Management 600: Practical Research Methods for...

Documents

Transcript of Management 600: Practical Research Methods for...