0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field)...

0AP03: Methods and models in

behavioral research

Part 2: Understanding statistics

using SPSS (Field)

Chris Snijders

[email protected]

www.tue-tm.org/moodle

2

EXAMPLE: NETFLIX DVD RENTAL

3

Example: The Netflix Prize

www.netflixprize.com

$1,000,000

4

Example: the Netflix prize (3)

input = kind of previous rentals input = number of previous rentals input = day of the week input = ...

output = extent to which you like a movie

input

input

input

input

output

5

Example: The Netflix Prize (2)

• Predict the extent to which a person will like a movie, from previous ratings by others.

• NB – Measurement – Root Mean Square Error– Large prizes!– You have about 2 Gb of data to work on ...

6

0AP03: two parts

Blumberg et al. Gerrit Rooks

Blocks A and B

Field Chris Snijders

Blocks B and C {LANGUAGE=ENGLISH}

7

Understanding statistics using SPSS

http://www.sagepub.co.uk/field/field.htm

-CD rom material-- data sets-- some software (g*power)

-answers to (some) assignments in the book

-test banks (note: not identical to exam)

8

www.tue-tm.org/moodle

enrolment key = "fieldspss"

9

http://www.tue-tm.org/moodle

Course home-page

10

Let’s get acquainted …

• Technische InnovatieWetenschappen– Bachelor’s– Pre-Master program

• Technische Bedrijfskunde – Bachelor’s– Pre-Master program

= = =

Some key concepts:

- Stochastic variables, distributions, normal distribution

- SPSS usage (StatGraphics users?)- Mean, median, skewness, kurtosis- Correlation- Simple regression: Y = a + b X- Factor analysis- A chi2 test

A: never heard of it

B: was a topic in previous lectures, but don’t ask me what it is or how to do it

C: was covered and understood

11

Understanding statistics using SPSS

About: Style

About: Content

1 About statistics 8 ANOVA2 SPSS 9-12, 14 More

ANOVAs3 Exploring data 13 Non-param. tests4 Correlation 15 Factor analysis5 Multiple regression 16 Chi2-tests etc6 Logistic Regression7 The t-test

12

T-test, chi2-test

• We have two groups of students, one group that started early and worked regularly, one group that started late (in the last three lectures or later)

Are the grades of the students in the regular group higher? (t-test)

Average MaxREGULAR 6.3 8.4LATE 3.0 3.9

Are the regular students more likely to pass the course? (chi-2 test)

Pass NoREGULAR 30 10LATE 2 38

13

Exam for the Field-part

[tentative: check the course website later]

Chapters1, 2, 3, 4, 7: assumed to be common

knowledge

Chapters5, 8, 15, <probably 9-12, 14, perhaps 6>

+ additional material supplied with the course

(such as PS – software)

= = =

Exam on laptop:

1 – multiple choice questions2 – you are given data and must be able to

handle the data sensibly

14

The average (quantitative) paper …

• Problem formulation– What are sensible questions?

• Theory-development and hypotheses– “What do I expect to be the answer to my question, and what

are the implications from the theory that I want to test”(nb: different in exploratory work)

• Choice of research design– Experiment– Survey– Case study– Participant observation– …

• Data collection– Designing questionnaires– Designing experimental procedures– Finding your respondents. Sampling (how and how many?)– …

• Analysis of results – Measurement: from raw data to measured constructs– Relational claims: X Y ?

• Conclusions– What can we conclude, given our

analyses?

15

About the course setup

• Mainly on moodle-site, studyweb only used to send mail to you

• “Do-it-yourself course”: mastering SPSS, getting up to speed with SPSS, keeping up with the material is up to you– Extra material and links on the website– Practice material for the exam

If you do not practice in between, you will not be able to pass the exam.

• Part 1-Rooks : “Think, then do”Part 2-me : “Do, then think”

• We have data, now what do we do? (and partly we collect these data from you)

• Hybrid setup: – English/Dutch– business administration / social sciences

16

THE ART OF

SAMPLING

17

Sampling

We want conclusions about the population, but we only have (enough time and money to collect) data from part of the population, a sample.

From sample data to population statement:STATISTICAL INFERENCE

sample

population

18

Two parts to every analysis

• Calculate some property of the sample– Mean (mean length of soccer players)– Difference between mean of two groups

(difference in length of soccer-players)– Correlation between two things measured

(correlation between length and number of goals you score)

• Calculate a confidence interval around the property, creating a statement about the property in the sample

sample

population

19

On sampling "analog cheese"

Analog cheese = palm oil + starch (zetmeel)

"Keuringsdienst van waarde" took a sample of 11 products and found 5 to contain "analog cheese"

Estimate of the percentage of products containing analog cheese = 5/11 = 45%

What is the (approximate) confidence interval?

A 40 – 50 %B 32 – 58 %C 25 – 65 %D 17 – 77 %E 9 – 81 %

20

Applying the 1/sqrt(n) rule

You want to predict how many seats in congres a certain Dutch political party will get. You allow for a range of plus or minus 2 seats. Say you expect the number of seats to be around 50.

You intend to call a representative sample of people. About how many do you need?

A 50B 100C 500D 5,000E 50,000F more than 50,000

21

Some more sampling

Suppose you want to know, say, the percentage of people in The Netherlands who support the recent foreign policy of the US-government. The Netherlands has 12,000,000 voters.

According to your (correct) calculations you need a sample of 2,000 people.

Now you want to do the same, but in France (population = 36,000,000 voters).

How large should your sample size be in France?

A less than 2,000

B about 2,000C about 6,000 D more than 6,000E you need more informationRule of thumb: For large populations, the required

sample size is independent of the population size

22

Explanation: Mean and variance of the mean

We measure x and get measurements x1, …, xn

Expectation and variance give the 95%-confidence-interval:

n

xx i

n

s

NnN

xVar x2

)()(

sample in the variance1

)(1

22

n

xxs

n

i ix

ixxi unit for oft measuremen

population of size N

sample of size n

)(96.1 , )(96.1 xVarxxVarx

23

Sample size determined by:

Are white soccer players smaller?

• How precise do you want to measure your statistic? [what is the height difference you would find interesting enough to report about]

• What is the probability of Type I error that you will allow? (rejecting the H0-hypothesis when in fact it is true) Usually 5%[How small do you want the probability to be that you reject “(on average) black and non-black players are equally tall” when in fact it is true?]

• How likely do you want it to be that you will find an effect, assuming that it exists in the population? Power, usually 80% or 90%.

• Onesided or twosided tests?

You need special purpose software for this, for instance G*Power (on the disc), or PS

24

X Y

25

All the same, but different

• Problem formulation– What are sensible questions?

• Theory-development and hypotheses– “What do I expect to be the answer to my question, and what

are the implications from the theory that I want to test”(nb: different in exploratory work)

• Choice of research design– Experiment– Survey– Case study– Participating research– …

• Data collection– Designing questionnaires– Designing experimental procedures– Finding your respondents. Sampling (how and how many?)– …

• Analysis of results – Measurement: from raw data to measured constructs– Relational claims: X Y ?

• Conclusions– What can we conclude?

X1 Y1

X2 Y2

…

X1 Y1

X2 Y2

HOW?

X1 Y1

X2 Y2

AND?

26

About $80 / hour

27

It is all about XY:

X Y

“white soccer player” “length”“being a woman” “sensitive to alcohol”“being bald” “prob. of a heart-attack”“left handed” “die early”“listen to Mozart” “score higher on IQ-test”

Y = dependent variableresponse variabletarget variableY-variableexplanandum

X = independent variableX-variablepredictor variableexplanans

Usually we want to say something like “X causes Y”, but often we have to settle for “X is related to Y”.

28

Survey vs experiment (Milgram)

Y = which voltage do you apply?

measured X's:– subject is male– subject is young

manipulated X's:– experimentor wears white coat– experimentor is older (vs young)

Experiment: researcher determines XSurvey: researcher measures X

29

X Y

30

Kinds of variables (in case you forgot)

Categorical / NominalTwo or more categories, without intrinsic ordering (ex.: “kind of movie”: action/drama/...)When only two categories, also called a binary variable (ex.: gender, “age over 40”, etc)

OrdinalTwo or more categories, with intrinsic ordering(ex.: 5-point ratings such as never/sometimes/often/always, …)

IntervalOrdinal + intervals between values are evenly spaced (age, income, number of movies rented).

NB Not always easy to classify.

Categorical and interval are the most important (often ordinal are treated as either categorical or interval).

31

Statistics at UCLA {http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm}

Y X

32

Dealing with data

1. Import SPSS file

2. Check your data• To get acquainted with it• For outliers and coding errors

3. Determine the kind of analysis

4. Recode your data so that you have the variables in the appropriate format

5. Check the assumptions for the analysis of choice (1)

6. Run your analysis

7. Check the assumptions for the analysis of choice (2)

8. If necessary, back to 3. until CONCLUSION

33

Fact and fiction

Are white soccer players smaller?

34

Example data: soccer players

File: soccer_0AP03.sav. All players from WC2002.

Let’s see what the data looks like:

<to SPSS>

Variable view vs Data viewRun a “Frequencies”Check histogramsCreate new variables (Transform > Compute)Recode variables (Transform > Recode)Run analyses

USE SYNTAX FILES (*.SPS)!

35

Weekly not-on-the-exam fact

Suppose: You have a handful of numerical inputs and want to use these to predict some output.

For instance: chance of survival of a firm based on firm characteristics, probability of job success based on credentials, probability of surgery survival based on medical records, …

We compare experts in the field with computer models (both have the same amount of data).

Out of 160 studies of this kind, how often do the experts perform significantly better?

(sources: see “Super Crunchers” by Ayres)

input

input

input

input

output

36

To Do

Get familiar with SPSS: reading data, recoding variables, and running a t-test or a correlation. Especially recoding variables and the syntax window are important. You should be able to do the assignments on the web page fairly quickly.

Check chapters 1 through 4 (up to 4.5.4) of the Field-book for anything that looks unfamiliar to you.

Don’t wait until the last couple of weeks!

Add to the WIKIs

0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field)...

Documents

Transcript of 0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field)...