0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field)...
-
Upload
melvyn-woods -
Category
Documents
-
view
219 -
download
3
Transcript of 0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field)...
0AP03: Methods and models in
behavioral research
Part 2: Understanding statistics
using SPSS (Field)
Chris Snijders
www.tue-tm.org/moodle
2
EXAMPLE: NETFLIX DVD RENTAL
3
Example: The Netflix Prize
www.netflixprize.com
$1,000,000
4
Example: the Netflix prize (3)
input = kind of previous rentals input = number of previous rentals input = day of the week input = ...
output = extent to which you like a movie
input
input
input
input
output
5
Example: The Netflix Prize (2)
• Predict the extent to which a person will like a movie, from previous ratings by others.
• NB – Measurement – Root Mean Square Error– Large prizes!– You have about 2 Gb of data to work on ...
6
0AP03: two parts
Blumberg et al. Gerrit Rooks
Blocks A and B
Field Chris Snijders
Blocks B and C {LANGUAGE=ENGLISH}
7
Understanding statistics using SPSS
http://www.sagepub.co.uk/field/field.htm
-CD rom material-- data sets-- some software (g*power)
-answers to (some) assignments in the book
-test banks (note: not identical to exam)
8
www.tue-tm.org/moodle
enrolment key = "fieldspss"
9
http://www.tue-tm.org/moodle
Course home-page
10
Let’s get acquainted …
• Technische InnovatieWetenschappen– Bachelor’s– Pre-Master program
• Technische Bedrijfskunde – Bachelor’s– Pre-Master program
= = =
Some key concepts:
- Stochastic variables, distributions, normal distribution
- SPSS usage (StatGraphics users?)- Mean, median, skewness, kurtosis- Correlation- Simple regression: Y = a + b X- Factor analysis- A chi2 test
A: never heard of it
B: was a topic in previous lectures, but don’t ask me what it is or how to do it
C: was covered and understood
11
Understanding statistics using SPSS
About: Style
About: Content
1 About statistics 8 ANOVA2 SPSS 9-12, 14 More
ANOVAs3 Exploring data 13 Non-param. tests4 Correlation 15 Factor analysis5 Multiple regression 16 Chi2-tests etc6 Logistic Regression7 The t-test
12
T-test, chi2-test
• We have two groups of students, one group that started early and worked regularly, one group that started late (in the last three lectures or later)
Are the grades of the students in the regular group higher? (t-test)
Average MaxREGULAR 6.3 8.4LATE 3.0 3.9
Are the regular students more likely to pass the course? (chi-2 test)
Pass NoREGULAR 30 10LATE 2 38
13
Exam for the Field-part
[tentative: check the course website later]
Chapters1, 2, 3, 4, 7: assumed to be common
knowledge
Chapters5, 8, 15, <probably 9-12, 14, perhaps 6>
+ additional material supplied with the course
(such as PS – software)
= = =
Exam on laptop:
1 – multiple choice questions2 – you are given data and must be able to
handle the data sensibly
14
The average (quantitative) paper …
• Problem formulation– What are sensible questions?
• Theory-development and hypotheses– “What do I expect to be the answer to my question, and what
are the implications from the theory that I want to test”(nb: different in exploratory work)
• Choice of research design– Experiment– Survey– Case study– Participant observation– …
• Data collection– Designing questionnaires– Designing experimental procedures– Finding your respondents. Sampling (how and how many?)– …
• Analysis of results – Measurement: from raw data to measured constructs– Relational claims: X Y ?
• Conclusions– What can we conclude, given our
analyses?
15
About the course setup
• Mainly on moodle-site, studyweb only used to send mail to you
• “Do-it-yourself course”: mastering SPSS, getting up to speed with SPSS, keeping up with the material is up to you– Extra material and links on the website– Practice material for the exam
If you do not practice in between, you will not be able to pass the exam.
• Part 1-Rooks : “Think, then do”Part 2-me : “Do, then think”
• We have data, now what do we do? (and partly we collect these data from you)
• Hybrid setup: – English/Dutch– business administration / social sciences
16
THE ART OF
SAMPLING
17
Sampling
We want conclusions about the population, but we only have (enough time and money to collect) data from part of the population, a sample.
From sample data to population statement:STATISTICAL INFERENCE
sample
population
18
Two parts to every analysis
• Calculate some property of the sample– Mean (mean length of soccer players)– Difference between mean of two groups
(difference in length of soccer-players)– Correlation between two things measured
(correlation between length and number of goals you score)
• Calculate a confidence interval around the property, creating a statement about the property in the sample
sample
population
19
On sampling "analog cheese"
Analog cheese = palm oil + starch (zetmeel)
"Keuringsdienst van waarde" took a sample of 11 products and found 5 to contain "analog cheese"
Estimate of the percentage of products containing analog cheese = 5/11 = 45%
What is the (approximate) confidence interval?
A 40 – 50 %B 32 – 58 %C 25 – 65 %D 17 – 77 %E 9 – 81 %
20
Applying the 1/sqrt(n) rule
You want to predict how many seats in congres a certain Dutch political party will get. You allow for a range of plus or minus 2 seats. Say you expect the number of seats to be around 50.
You intend to call a representative sample of people. About how many do you need?
A 50B 100C 500D 5,000E 50,000F more than 50,000
21
Some more sampling
Suppose you want to know, say, the percentage of people in The Netherlands who support the recent foreign policy of the US-government. The Netherlands has 12,000,000 voters.
According to your (correct) calculations you need a sample of 2,000 people.
Now you want to do the same, but in France (population = 36,000,000 voters).
How large should your sample size be in France?
A less than 2,000
B about 2,000C about 6,000 D more than 6,000E you need more informationRule of thumb: For large populations, the required
sample size is independent of the population size
22
Explanation: Mean and variance of the mean
We measure x and get measurements x1, …, xn
Expectation and variance give the 95%-confidence-interval:
n
xx i
n
s
NnN
xVar x2
)()(
sample in the variance1
)(1
22
n
xxs
n
i ix
ixxi unit for oft measuremen
population of size N
sample of size n
)(96.1 , )(96.1 xVarxxVarx
23
Sample size determined by:
Are white soccer players smaller?
• How precise do you want to measure your statistic? [what is the height difference you would find interesting enough to report about]
• What is the probability of Type I error that you will allow? (rejecting the H0-hypothesis when in fact it is true) Usually 5%[How small do you want the probability to be that you reject “(on average) black and non-black players are equally tall” when in fact it is true?]
• How likely do you want it to be that you will find an effect, assuming that it exists in the population? Power, usually 80% or 90%.
• Onesided or twosided tests?
You need special purpose software for this, for instance G*Power (on the disc), or PS
24
X Y
25
All the same, but different
• Problem formulation– What are sensible questions?
• Theory-development and hypotheses– “What do I expect to be the answer to my question, and what
are the implications from the theory that I want to test”(nb: different in exploratory work)
• Choice of research design– Experiment– Survey– Case study– Participating research– …
• Data collection– Designing questionnaires– Designing experimental procedures– Finding your respondents. Sampling (how and how many?)– …
• Analysis of results – Measurement: from raw data to measured constructs– Relational claims: X Y ?
• Conclusions– What can we conclude?
X1 Y1
X2 Y2
…
X1 Y1
X2 Y2
HOW?
X1 Y1
X2 Y2
AND?
26
About $80 / hour
27
It is all about XY:
X Y
“white soccer player” “length”“being a woman” “sensitive to alcohol”“being bald” “prob. of a heart-attack”“left handed” “die early”“listen to Mozart” “score higher on IQ-test”
Y = dependent variableresponse variabletarget variableY-variableexplanandum
X = independent variableX-variablepredictor variableexplanans
Usually we want to say something like “X causes Y”, but often we have to settle for “X is related to Y”.
28
Survey vs experiment (Milgram)
Y = which voltage do you apply?
measured X's:– subject is male– subject is young
manipulated X's:– experimentor wears white coat– experimentor is older (vs young)
Experiment: researcher determines XSurvey: researcher measures X
29
X Y
30
Kinds of variables (in case you forgot)
Categorical / NominalTwo or more categories, without intrinsic ordering (ex.: “kind of movie”: action/drama/...)When only two categories, also called a binary variable (ex.: gender, “age over 40”, etc)
OrdinalTwo or more categories, with intrinsic ordering(ex.: 5-point ratings such as never/sometimes/often/always, …)
IntervalOrdinal + intervals between values are evenly spaced (age, income, number of movies rented).
NB Not always easy to classify.
Categorical and interval are the most important (often ordinal are treated as either categorical or interval).
31
Statistics at UCLA {http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm}
Y X
32
Dealing with data
1. Import SPSS file
2. Check your data• To get acquainted with it• For outliers and coding errors
3. Determine the kind of analysis
4. Recode your data so that you have the variables in the appropriate format
5. Check the assumptions for the analysis of choice (1)
6. Run your analysis
7. Check the assumptions for the analysis of choice (2)
8. If necessary, back to 3. until CONCLUSION
33
Fact and fiction
Are white soccer players smaller?
34
Example data: soccer players
File: soccer_0AP03.sav. All players from WC2002.
Let’s see what the data looks like:
<to SPSS>
Variable view vs Data viewRun a “Frequencies”Check histogramsCreate new variables (Transform > Compute)Recode variables (Transform > Recode)Run analyses
USE SYNTAX FILES (*.SPS)!
35
Weekly not-on-the-exam fact
Suppose: You have a handful of numerical inputs and want to use these to predict some output.
For instance: chance of survival of a firm based on firm characteristics, probability of job success based on credentials, probability of surgery survival based on medical records, …
We compare experts in the field with computer models (both have the same amount of data).
Out of 160 studies of this kind, how often do the experts perform significantly better?
(sources: see “Super Crunchers” by Ayres)
input
input
input
input
output
36
To Do
Get familiar with SPSS: reading data, recoding variables, and running a t-test or a correlation. Especially recoding variables and the syntax window are important. You should be able to do the assignments on the web page fairly quickly.
Check chapters 1 through 4 (up to 4.5.4) of the Field-book for anything that looks unfamiliar to you.
Don’t wait until the last couple of weeks!
Add to the WIKIs