Correlations and Regression

download Correlations and Regression

of 30

Transcript of Correlations and Regression

  • 8/7/2019 Correlations and Regression

    1/30

    Correlations and Regression

    This will be a brief tutorial, since there is very little that is required to calculate correlations and

    linear regressions. To calculate a simple correlation matrix, one must use [Statistics =>Correlate => Bivariate...], and [Statistics => Regression => Linear] for the calculation of a

    linear regression.

    For this section, the analyses presented in the computer section of the Correlation and Regressionchapter will be replicated. To begin, enter the data as follows,

    IQ GPA

    102 2.75108 4.00

    109 2.25

    118 3.00

    79 1.67

    88 2.25

    ... ...

    ... ...

    85 2.50

    Simple Correlation

    y Click on [Statistics => Correlate => Bivariate...], then select and move "IQ" and"GPA" to the Variables: list. [Explore the options presented on this controlling dialogbox.]

    y Click on [OK] to generate the requested statistics.The results from output window should look like the following,

  • 8/7/2019 Correlations and Regression

    2/30

    As you can see, r=0.702, and p=.000. The results suggest that the correlation is significant.

    Note: In the above example we only created a correlation matrix based on two variables. The

    process of generating a matrix based on more than two variables is not different. That is, if thedataset consisted of 10 variables, they could have all been placed in the Variables: list. Theresulting matrix would include all the possible pairwise correlations.

    Correlation and Regression

    Linear regression....it is possible to output the regression coefficients necessary to predict one

    variable from the other - that minimize error. To do so, one must select the [Statistics =>Regression => Linear...] option. Further, there is a need to know which variable will be used as

    the dependent variable and which will be used as the independent variable(s). In our currentexample, GPA will be the dependent variable, and IQ will act as the independent variable.

    Specifically,

    y Initiate the procedure by clicking on [Statistics => Regression => Linear...]y Select and move GPA into the Dependent: variable boxy Select andmove IQ into the Independent(s): variable boxy Click on the [OK] to generate the statistics.

    Note: A variety of options can be accessed via the buttons on the bottom half of thiscontrolling dialog box (e.g., Statistics, Plots,...). Many more statistics can be generated by

    explore the additional options via the Statistics button.

    Some of the results of this analysis are presented below,

  • 8/7/2019 Correlations and Regression

    3/30

    The correlation is still 0.702, and the p value is still 0.000. The additional statistics are

    "Constant", ora from the text, and "Slope", orB from the text. If you recall, the dependentvariable is GPA, in this case. As such, one can predict GPA with the following,

    GPA = -1.777 + 0.0448*IQ

    The next section will discuss the calculation of the ANOVA.

    Chi-Square & T-Test

    The computation of the Chi-Square statistic can be accomplished by clicking on [Statistics =>

    Summarize => Crosstabs...]. This particular procedure will be your first introduction to coding

    of data, in the data editor. To this point data have been entered in a column format. That is, onevariable per column. However, that method is not sufficient in a number of situations, including

    the calculation of Chi-Square, Independent T-tests, and any Factorial ANOVA design withbetween subjects factors. I'm sure there are many other cases, but they will not be covered in this

    tutorial. Essentially, the data have to be entered in a specific format that makes the analysispossible. The format typcially reflects the design of the study, as will be demonstrated in the

    examples.

    In your text, the following data appear in section 6.????. Please read the text for a description ofthe study. Essentially, the table - below - includes the observed data and the expected data in

    parentheses.

    Fault Guilty Not Guilty Total

  • 8/7/2019 Correlations and Regression

    4/30

    Low 153(127.559) 24(49.441) 177

    High 105(130.441) 76(50.559) 181

    Total 258 100 358

    In the hopes of minimizing the load time for remaining pages, I will make use of the built intable facilty of HTML to simulate the Data Editor in SPSS. This will reduce the number of

    images/screen captures to be loaded.

    For the Chi-Square statistic, the table of data can be coded by indexing the column and row ofthe observations. For example, the count for being guilty with Low fault is 153. This specific

    cell can be indexed as coming from row=1 and column=1. Similarly, Not Guilty with High faultis coded as row=2 and column=2. For each observation, four in this instance, there is unique

    code for location on the table. These can be entered as follows,

    Row Column Count

    1 1 153

    1 2 24

    2 1 105

    2 2 76

    y So, 2 rows * 2 columns equals 4 observations. That should be clear.y For each of the rows, there are 2 corresponding columns, that is reflected in the Count

    column. The Count column represents the number of time each unique combination

    Row and Column occurs.

    The above presents the data in an unambigous manner. Once entered, the analysis is a matter ofselecting the desired menu items, and perhaps selecting additional options for that statistic.

    [Don't forget to use the labelling facilities, as mentioned earlier, to meaningfully identify thecolumns/variables. The labels that are chosen will appear in the output window.]

    To perform the analysis,

    y The first step is to inform SPSS that the COUNT variable represents the frequency foreach unique coding of ROW and COLUMN, by invoking the WEIGHT command. To dothis, click on [Data => Weight Cases]. In the resultant dialog box, enable the Weight

    cases by option, then move the COUNT variable into the Frequency Variable box. Ifthis step is forgotten, the count for each cell will be 1 for the table.

  • 8/7/2019 Correlations and Regression

    5/30

    y Now that the COUNT variable has been processed as a weighted variable, select[Statistics => Summarize => Crosstabs...] to launch the controlling dialog box.

    y At the bottom of the dialog box are three buttons, with the most important being the[Statistics...] button. You must click on the [Statistics...] button and then select the Chi-square option, otherwise the statistic will not be calculated. Exploring this dialog box

    makes it clear that SPSS can be forced to calcuate a number of other statistics inconjuction with Chi-square. For example, one can select the various measures of

    association (e.g., contingency coefficient, phi and cramer's v,...), among others.y Move the ROW variable into the Row(s): box, and the COLUMN variable into the

    Column(s):, then click[OK] to perform the analysis. A subset of the output looks likethe following,

  • 8/7/2019 Correlations and Regression

    6/30

    Although simple, the calculation of the Chi-square statistic is very particular about all the

    required steps being followed. More generally, as we enter hypothesis testing, the user should bevery careful and should make use of manuals for the programme and textbooks for statistics.

    T-tests

    By now, you should know that there are two forms of the t-test, one for dependent variables andone for independent variables, or observations. To inform SPSS, or any stats package for that

    matter, of the type of design it is necessary to have to different ways of laying out the data. Forthe dependent design, the two variables in question must be entered in two columns. For

    independent t-tests, the observations for the two groups must be uniquely coded with a Gruopvariable. Like the calculation of the Chi-square statistic, these calculations will reinforce the

    practice of thinking about, and laying out the data in the correct format.

    Dependent T-Test

    To calculate this statistic, one must select [Statistics => Compare Means => Paired-Samples T

    Test...] after enterin the data. For this analysis, we'll use the data from Table 7.3, in Howell.

    y Enter the data into a new datafile. Your data should look a bit like the following. That is,the two variables should occupy separate columns...

  • 8/7/2019 Correlations and Regression

    7/30

    Mnths_6 Mnths_24

    124 114

    94 88115 102

    110 2

    116 2

    139 2

    116 2

    110 2

    129 2

    120 2

    105 2

    88 2

    120 2

    120 2

    116 2

    105 2

    ... ...

    ... ...

    123 132

    Note that the variable names start with a letter and are less than 8 characters long. This is

    a bit constraining, however, one can use the variable label option to label the variable

    with a longer name. This more descriptive name will then be reproduced in the outputwindow.

    y To calculate the tstatistic click on [Statistics => Compare Means => Paired-SamplesT Test...], then select the two variables of interest. To select the two variables, hold the

    [Shift] key down while using the mouse for selection. You will note that the selectionbox requires that variables be selected two at a time. Once the two variables have been

    selected, move them to the Paired Variables: list. This procedure can be repeated foreach pair of variables to be analyzed. In this case, select MNTHS_6 and MNTHS_24

    together, then move them to the Paired Variables list. Finally, click the [OK] button.

    The critical result for the current analysis will appear in the output window as follows,

  • 8/7/2019 Correlations and Regression

    8/30

    As you can see an exact t-value is provided along with an exact p-value, and this p-value

    is greater that the expected value of 0.025, for a two-tailed assessment. Closerexamination indicates several other statistics are presented in output window.

    Quite simply, such calculations require very little effort!

    Independent T-tests

    When calculating an independent t-test, the only difference involves the way the data areformatted in the datasheet. The datasheet must include both the raw data and group coding, for

    each variable. For this example, the data from table 7.5 will be used. As an added bonus, thenumber of observations are unequal for this example.

    Take a look at the following table to get a feel for how to code the data.

    Group Exp_Con

    1 96

    1 127

    1 127

    1 119

    1 109

    1 143

    1 ...

    1 ...

    1 106

    1 109

    2 114

    2 88

  • 8/7/2019 Correlations and Regression

    9/30

    2 104

    2 104

    2 91

    2 96

    2 ...

    2 ...

    2 114

    2 132

    From the above you can see that we used the "Group" variable to code for the two variables. The

    value of1 was used to code for "LBW-Experimental", while a value of2 was used to code for"LBW-Control". If you're confused please study the table, above.

    To generate the t-statistic,

    y Clik on [Statistics => Compare Means => Independent-Samples T Test] to launch theappropriate dialog box.

    y Select "exp_con" - the dependent variable list - and move it to the Test Variable(s): box.y Select "group" - the grouping variable list - and move it to the Grouping Variable: box.y The final step requires that the groups be defined. That is, one must specify that Group1 -

    the experimental group in this case - is coded as 1, and Group2 - the control group in this

    case - is coded as 2. To do this, click on the [Define Groups...] button. Click on the[Continue] button to return to the controlling dialog box.

    y Run the analysis by clicking on the [OK] button.The output for the current analysis extracted from the output window looks like thefollowing.

  • 8/7/2019 Correlations and Regression

    10/30

    The p-value of .004 is way lower than the cutoff of 0.025, and that suggests that the means aresignificantly different. Further, a Levene's Test is performed to ensure that the correct results are

    used. In this case the variances are equal, however, the calculations for unequal variances arealso presented, among some other statistics - some not presented.

    In the next section we will briefly demonstrate the calculation of correlations and regression, asdiscussed in Chapter 9 of Howell. In truth, you should be able to work through many statisticswith your current knowledge base and the help files, including correlations and regressions. Most

    statistics can be calculated with a few clicks of the mouse.

    One-Way ANOVA

    As in the independent t-test datasheet, the data must be coded with a group variable. The data

    that will be used for the first part of this section is from Table 11.2, of Howell. There are 5

    groups of 10 observations each - resulting in a total of 50 observations. The group variable willbe coded from 1 to 5, for each group. Take a look at the following to get an idea of the coding.

    Groups Scores

    1 91 8

    1 6

    ... ...

    1 7

    2 7

    2 9

    2 6

    ... ...

    ... ...

    ... ...

    5 10

    5 19

    ... ...

  • 8/7/2019 Correlations and Regression

    11/30

    5 11

    The coding scheme uniquely identifies the origin of each observation.

    To complete the analysis,

    y Select [Statistics => Compare Means => One-Way ANOVA...] to launch thecontrolling dialog box.

    y Select and move "Scores" into the Dependent list:y Select and move "Groups" into the Factor: listy Click on [OK]

    The preceeding is a complete spefication of the design for this oneway anova. The simple

    presentation of the results, as taken from the output window, will look like the following,

    The analysis that was just performed provides minimal details with regard to the data. If you take

    a look at the controlling dialog box, you will find 3 additional buttons on the bottom half -[Contrasts...], [Post Hoc..], and [Options...].

  • 8/7/2019 Correlations and Regression

    12/30

    Selecting [Options...] you will find,

    IfDescriptive is enabled, then the descriptive statistics for each condition will be generated.

    Making Homogeneity-of-variance active forces a Levene's test on the data. The statistics from

    both of these analyses will be reproduced in the output window.

    Selecting [Post Hoc] will launch the following dialog box,

  • 8/7/2019 Correlations and Regression

    13/30

    One can active one or multiple post hoc tests to be performed. The results will then be placed in

    the output window. For example, performing a R-E-G-W F statistic on the current data would

    produce the following,

    Finally, one can use the [Contrasts...] option to specify linear and/or orthogonal sets of

    contrasts. One can also perform trend analysis via this option. For example, we may wish to

    contrast the third condition with the fifth,

  • 8/7/2019 Correlations and Regression

    14/30

  • 8/7/2019 Correlations and Regression

    15/30

    Factorial ANOVA

    To conduct a Factorial ANOVA one only need extend the logic of the oneway design. Table 13.2

    presents the data for a 2 by 5 factorial ANOVA. The first factor, AGE, has two levels, and thesecond factor, CONDITION, has five levels. So, once again each observation can be uniquely

    coded.

    AGE CONDITION

    Old = 1 Counting = 1

    Young = 2 Rhyming = 2

    Adjective = 3

    Imagery = 4

    Intentional = 5

    For each pairing of AGE and CONDITION, there are 10 observations. That is, 2*5 conditions by

    10 observations per condition results in 100 observations, that can be coded as follows. [Note,that the names for the factors are meaniful.]

    AGE CONDITIO Scores

    1 1 9

    1 1 8

    1 1 6

    1 ... ...

    1 1 7

    1 2 7

    1 2 9

    1 2 6

    1 ... ...

    1 ... ...1 ... ...

    1 5 10

    1 5 19

    1 ... ...

    1 5 11

  • 8/7/2019 Correlations and Regression

    16/30

    2 1 8

    2 1 6

    2 1 4

    2 ... ...

    2 1 7

    2 2 10

    2 2 7

    2 2 8

    2 ... ...

    2 ... ...

    2 ... ...

    2 5 21

    2 5 192 ... ...

    2 5 21

    Examine the table carefully, until you understand how the coding has been implemented. Note:

    one can enhance the readability of the output by using Value Labels for the two factors.

    To compute the relevant statistics - a simple approach,

    y Select [Statistics => General Linear Model => Simple Factorial...]y Select and move "Scores" into the Dependent: boxy Select and move "Age" into the Factor(s): box.

  • 8/7/2019 Correlations and Regression

    17/30

    y Click on [Define Range...] to specify the range of coding for the Age factor. Recall that 1is used for Old and 2 is used for Young. So, the Minimum: value is , and the

    Maximum: value is 2. Click on [Continue].y Select and move "Conditio" into the Dependent: boxy Click on [Define Range...] to specify the range of the Condition factor. In this case the

    Minimum: value is1

    and the Maximum: value is 5.

    By clicking on the [Options...] button one has the opportunity to select the Method used.

    According to the online help,

    "Method: Allows you to choose an alternate method for decomposingsums of squares. Method selection controls how the effects are assessed."

    For the our purposes, selecting the Hierarchical, or the Experimental method will makeavailable the option to output Means and counts. --- Note: I don't know the details of

    these methods, however, they are probably documented.

    y Under[Options...] activate Hierarchical, orExperimental, then activate Means andcounts - Click[Continue]

    y Click on [OK] to generate the output.

  • 8/7/2019 Correlations and Regression

    18/30

    As you can see the use of the Means and count option produces a nice summary table,with all the Variable Labels and Value Labels that were incorporated into the datasheet.

    Again, the use of those options makes the output a great deal more readable.

    The output is a complete source table with the factors identified with Variable Labels

    As noted earlier, the analysis that was just conducted is the simplest approach to performing a

    Factorial ANOVA. If one uses [Statistics => General Linear Model => GLM - GeneralFactorial...], then more options become available. The specification of the Dependent and

    Independent factors is the as the method used for the Simple Factorial analysis. Beyond that,

    the options include,

    y By selecting [Model...], one can specify a Custom model. The default is for a FullyFactorial model, however, with the Custom option one can explicitly determine the

    effects to look at.y The Contasts option allows one "test the differences among the levels of a factor" (see

    the manual for greater detail).y Various graphs can be specified with the [Plots...] option. For example, one can plot"Conditio" on the Horizontal Axis:, and "Age" on Separate Lines:, to generate a simple

    "conditio*age" plot (see the dialog box for[Plots...],

  • 8/7/2019 Correlations and Regression

    19/30

    y The standard post-hoc tests for each factor can be calculated by selecting the desiredoptions under[Post Hoc...]. All one has to do is select the factors to analyze and theappropriate post-hoc(s).

    y The [Options...] dialog box provides a number of diagnostic and descriptive features.One can generate descriptive statistics, estimates of effect size, and tests for homogeneity

    of variance - among others. An example source table using some of these options wouldlook like the following,

  • 8/7/2019 Correlations and Regression

    20/30

    The use of the GLM - General Factorial procedure offers a great deal more than the SimpleFactorial. Depending on your needs, the former procedure may provide greater insight into your

    data. Explore these options!

    Higher order factorial designs are carried in the same manner as the two factor analysis presented

    above. One need only code the factors appropriately, and enter the corresponding observations.

    Descriptive Statistics

    We can replicate the frequency analyses that are described in chapter 2 of the text, by using thefile that was just read into the data editor - tab2-1.dat. These analyses were conducted on the

    reaction time data. Recall, that we have labelled this data as RT.

    To begin, click on [Statistics=>Summarize=>Frequencies]....

  • 8/7/2019 Correlations and Regression

    21/30

    The result is a new dialog box that allows the user to select the variables of interest. Also, notethe other clickable buttons along the border of the dialog box. The buttons labelled [Statistics...]

    and [Charts...] are of particular importance. Since we're interested in the reaction time data,click on rt followed by a mouse click on the arrow pointing right. The consequence of this action

    is a transference of the rt variable to the Variables list. At this point, clicking on the [OK] button

    would spawn an output window with the Frequency information for each of the reaction times.However, more information can be gathered by exploring the options offered by the

    [Statistics...] and [Charts...].

    [Statistics...] offers a number of summary statistics. Any statistic that is selected will be

    summarized in the output window.

  • 8/7/2019 Correlations and Regression

    22/30

    As for the options under[Charts...] click on Bar Charts to replicate the graph in the text.

    Once the options have been selected, click on [OK] to run the procedure. The results are then

    displayed in an output window. In this particular instance the window will include summary

    statistics for the variable RT, the frequency distribution, and the frequency distribution. You cansee all of this by scrolling down the window. The results should also be identical to those in thetext.

  • 8/7/2019 Correlations and Regression

    23/30

  • 8/7/2019 Correlations and Regression

    24/30

    You may have gathered from the above that calculating summary statistics requires nothing morethan selecting variables, and then selecting the desired statistics. The frequency example allowed

    us to generate frequency information plus measures of central tendencies and dispersion. Thesestatistics can be had by clicking directly on [Statistics=>Summarize=>Descriptives]. Not

    surprisingly, another dialog box is attached to this procedure. To control the type of statistics

    produced, click on the [Options...] button. Once again, the options include the typical measuresof central tendency and dispersion.

    Each time as statistical procedure is run, like [Frequencies...] and [Descriptives...] the resultsare posted to an Output Window. If several procedures are run during one session the results will

    be appended to the same window. However, greater organization can be reached by opening newOutput windows before running each procedure - [File=>New=>Output]. Further, the contents

    of each of these windows can be saved for later review, or in the case of charts saved to be laterincluded in formattted documents. [Explore by left mouse clicking on any of the output objects

    (e.g., a frequency table, a chart, ...) followed by a right button click. The left left button click willhighlight the desired object, while the right button click will popup a new menu. The next step is

    to click on the copy option. This action will store the object on the clipboard so that it can bepasted to Word for Windows document, for example.....]

    Data

    To begin the process of adding data, just click on the first cell that is located in the upper left

    corner of the datasheet. It's just like a spreadsheet. You can enter your data as shown. Enter each

    datapoint then hit [Enter]. Once you're done with one column of data you can click on the firstcell of the next column.

    These data are taken from table2.1 in Howell's text. The first column represents "Reaction Time

    in 100ths of a second" and the second column indicates "Frequency".

  • 8/7/2019 Correlations and Regression

    25/30

    If you're entering data for the first time, like the above example, the variable names will be

    automatically generated (e.g., var00001, var00002,....). They are not very informative. To change

    these names, click on the variable name button. For example, double click on the "var00001"button. Once you have done that, a dialog box will appear. The simplest option is to change the

    name to something meaningful. For instance, replace "var00001" in the textbox with "RT" (seefigure below).

    In addition to changing the variable name one can make changes specific to [Type], [Labels],[Missing Values], and [Column Format].

    y [Type]One can specify whetherthe data are innumeric or stringformat, in additionto afew more formats. The default is numeric format.

  • 8/7/2019 Correlations and Regression

    26/30

    y [Labels] Usingthe labels option can enhance the readability ofthe output. A variablename is limitedto a length of 8 characters, however, by usinga variable labelthe lengthcan be as much as 256 characters. This provides the ability to have very descriptive

    labels thatwill appear atthe output.

    Often,there is a needto code categoricalvariables innumeric format. For example,

    male andfemale can be coded as 1 and2, respectively. To reduce confusion, it isrecommendedthatone uses value labels . Forthe example ofgender coding,Value:1

    would have a correspodingValue label: male. Similarly,Value:2 would be coded withValue Label: female. (click onthe [Labels]buttonto verify the above)

    y [Missing Values] See the accompanyinghelp. This optionprovides a means to code forvarious types of missingvalues.

    y [Column Format] The column formatdialogprovides control over several features ofeach column (e.g., width of column).

    The next image reflects the variable name change.

  • 8/7/2019 Correlations and Regression

    27/30

    Once data has been entered or modified, it is adviseable to save. In fact, save as often as possible[File => SaveAs].

    SPSS offers a large number of possible formats, including their own. A list of the availableformats can be viewed and selected by clicking on the Save as type: , on the SaveAs dialog box.If your intention is to only work in SPSS, then there may be some benefit to saving in the

    SPSS(*.sav) format. I assume that this format allows for faster reading and writing of the datafile. However, if your data will be analyzed and looked by other packages (e.g., a spreadsheet), it

    would be adviseable to save in a more universal format (e.g., Excel(*.xls), 1-2-3 Rel 3.0 (*.wk3).

    Once the type of file has been selected, enter a filename, minus the extension (e.g., sav, xls). Youshould also save the file in a meaningful directory, on your harddrive or floppy. That is, for any

    given project a separate directory should be created. You don't want your data to get mixed-up.

    The process of reading already saved data can be painless if the saved format is in the SPSS or a

    spreadsheet format. All one has to do is,

    o click on [File => New => Data]

  • 8/7/2019 Correlations and Regression

    28/30

  • 8/7/2019 Correlations and Regression

    29/30

    o type "obs" in the Name boxo the Data Type is set to Numeric by default. If "obs" was a string variable, then

    one would have to click on Stringo click on the Add button to include this variable to the list.o repeat the above procedure with new names and data types for each of the

    remaining variables. It is important that all variables be added to the list.Otherwise, the data will be scrambled.

    (Please explore the various options by clicking on any accessible menu item.)

    The resulting data files appears in the data editor like the following.

  • 8/7/2019 Correlations and Regression

    30/30