Predictive Tests

Predictive Tests

Predictive Tests

Predictive Tests

Overview

• Introduction• Some theoretical issues

– The failings of human intuitions in prediction– Issues in formal prediction– Inference from class membership: The individual

versus group problem (and its only solution)• Some predictive tests

Predictive Tests

Predictive Tests

• Many tests are used to make predictions, of levels of achievement or success, or of likelihood of recidivism, or diagnostic category

• Two kinds of predictions:– Categorical: Predict which category this subject will

fall into (diagnosis, occupation)– Numerical: predict the value of a relevant numerical

value (GPA, economic return to company)

Predictive Tests

The failings of human intuition

• We have already seen many ways in which humans succumb to errors in numerical reasoning

• Kahneman & Tversky: Asked subjects about areas of graduate specialization: base rate estimation, estimates (from a description) of similarity to other students in each field, and predictive estimate (also from a description)

Predictive Tests

Results

• Results: – Similarity and prediction correlate at 0.97– Similarity and base rates correlate at -0.65

– What does this result remind you of?– What do these subjects need to be taught?

Predictive Tests

6 Errors discussed by Kahneman & Tversky

• Representativeness error: Assumes predictions are not different from assessments of similarity

• Insufficient regression error: People fail to take into account that when predictive validity is less than perfect, correlations between predictors and performance should be < 1

• Central tendency error: Subjects making judgments tend to avoid extremes, and compress their judgments into a smaller range than the phenomenon being judged

Predictive Tests

6 Errors discussed by Kahneman & Tversky

• Discounting of prior probabilities: Human predictors will throw out base rate info for almost any reason

• Overweighting of coherence: There is greater confidence in predictions based on consistent input than inconsistent input with the same average (i.e. two B's is better than a B & C for predicting a B average)

• Overweighting of extremes: Confidence in judgment is over-weighted at extremes, especially positive extremes (= j-shaped confidence function)

Predictive Tests

What do we need to make good predictions?

• We need three pieces of information:– 1.) Base rates– 2.) Relevant predictors in the individual case– 3.) Bounds on accuracy (cutting scores)

– Kahneman & Tversky's experimental evidence (previous slides) show that subjects usually fail to weight any of these three properly

Predictive Tests

Measuring validation error [Repeat slide]

• Coefficient of alienation = k = (1 - r2)0.5, where r is correlation of test score with some predicted performance

• k = the proportion of the error inherent in guessing that your estimate has

• If k = 1.0, you have 100% of the error you’d have had if you just guessed (since this means your r was 0)

• If k = 0, you have achieved perfection = your r was 1, and there was no error at all*

• If k = 0.6, you have 60% of the error you’d have had if you guessed

* N.B. This never happens.

Predictive Tests

Why should we care? [Repeat slide]• We care because k is useful in interpreting accuracy of an individual’s

scores– r = 0.6 (good), k = 0.80 (not good)– r = 0.7 (great), k = 0.71 (not so great)– r = 0.95 (fantastic!), k = 0.31 (so so)– Since even high values of r give us fairly large error margins, the

prediction of any individual’s criterion score is always accompanied by a wide margin of error

– The moral: Predicting individual performance is really hard to do!

Predictive Tests

What can we infer from class membership?

• Some commentators have suggested that inference from class membership is inherently fallacious– i.e. 25% of first-degree relatives of those diagnosed with

malignant melanoma (skin cancer) will also develop melanoma

– I am a first-degree relative of a person diagnosed with melanoma, so I take my odds of developing the disease to be 25%

– Critics of the inference say: No, it is either 0% (I don't develop the disease) or 100% (I do): i.e. group probabilities don't apply to individuals

Predictive Tests

Do group probabilities apply to individuals?

• Meehl's response: "If nothing is rationally inferable from membership in a class, no empirical prediction is ever possible"– The argument is a re-statement of the necessity of

inference: even in the case of predicting individual behavior from that individual's data, we need to consider the pattern over past data

– Moreover, claim of 'certainty' is philosophical, not real: in the absence of knowing which group you are in, there is only probability, not knowledge

Predictive Tests

Some Predictive Tests• Scholastic Aptitude Tests (SAT, GREs)

– Highly reliable tests (0.9) developed to painstaking psychometric standards

– One strength is that they have specific regression equation by college: i.e. they can predict future performance at a particular college independently

Predictive Tests

Some Predictive Tests• Scholastic Aptitude Tests (SAT, GREs)

– How well do they do?• SAT: r = 0.4 with university GPA

– By comparison, high school grade r = 0.48– Together, r = 0.55

• GRE: correlations between various combinations of GRE scores and grad school performance are only between 0.25 and 0.35, and only marginally better (0.4) if you include undergraduate grades

Predictive Tests

Can you beat the standards?• Notwithstanding the huge industry waiting to take money

from anxious high school students, studying for the SAT doesn't help much– SAT coaching increases scores by about 15 points,

which is 0.15 SDs– Repeat testing increases it a little less, about 12 points

or 0.12 SDs

Predictive Tests

Some Predictive Tests• Professional school tests (MCAT, GRE subject tests,

LSAT)– MCAT r = low .80s– LSAT r > 0.9– There is relatively little evidence of validity

• They predict performance about as well as undergraduate GPA alone: r = 0.25 - 0.3

Predictive Tests

Some Predictive Tests• The Strong (1927) Interest Inventory (Strong-Campbell, 1981): widely used

test of interests as predictors of professional aptitude– Empirically constructed with concurrent validity, comparing each

vocational group to the overall average– Has 325 items, 162 scales covering 85 occupations– Reliability is high

• 0.9+test/retest over weeks; 0.6-0.7 over years unless they were old (= 25+years!) at first test, then 0.8+ even after 20 years

– Does not predict success or satisfaction in a profession– Does predict likelihood of entering and remaining in a profession:

chances of 50% that a person will end up in a profession most strongly predicted (A score), and only 12% that he will end in one least predicted (C score)

Predictive Tests

Prediction in scientific psychology

• Prediction & scientific explanation are related– We admire Newton's laws precisely because they are

accurate in predicting real phenomena– Many cognitive models in psychology are purely

descriptive: they fail to make an effort to predict how a person will perform on unseen stimuli

– There are many ways to do so, if you have sufficient variation in predictors: multiple regression, neural networks, 'cheap' methods (i.e. best single predictor)

Predictive Tests

Predicting lexical decision RTs

• Lexical decision (= time to decide if a string is a word or not) is a simple task to perform

• Many well-specified variables can be calculated for words: frequency, similarity to other words, frequency of components– This allows for predictive testing: How well can we

predict how long it will take (average reaction time = RT) to reach a decision about wordness?

• We used 35 predictors, and a non-linear method of combining them (genetic programming) to predict average RTs

Predictive Tests

CORRELATION WITH FITNESS

RTsMULTIPLE

REGRESSION ANN CI GP EXP. 1 GP EXP. 2GP EXP. 2

SIMPLIFIEDGP. EXP. 2 SIMPLER

DISTANCE 0.43 * * * * * *PNSIZEHIGHER 0.31 * * *ONSIZELOWER -0.30 * * *PNSIZELOWER -0.23 *B1NSIZELOWER -0.21 *ONFREQLOWER -0.21 * * *OFREQ -0.20 * * * * *NOSELFONSIZE -0.13 * *NODUP-ONSIZE -0.13 * * * *PNFREQLOWER -0.12 *ONFREQHIGHER -0.10 *B1NFREQLOWER -0.09 *PFREQ -0.08B1NFREQ -0.08 *B1NFREQHIGHER -0.07 *BG-FREQ 0.05 *NODUPPNFREQ -0.04 * *B1NSIZE -0.04 *PN -0.04 * *FITNESS CORRELATION N/A 0.51 0.87 0.77 0.56 0.54 0.52TEST CORRELATION N/A 0.54 0.42 0.57 0.52 0.54 0.590.5 Z-SCORE BIN HIT RATE N/A 22 (15%) 34 (24%) 40 (27%) 38 (26%) 44 (30%) 44 (30%)0.5 Z-SCORE BIN R N/A 0.42 0.71 0.85 0.93 0.92 0.92

Predictive Tests

Some lessons about scientific prediction

• models can 'cheat' by using variance in the input data set that does not transfer to unseen data = you must test your predictions on unseen data– Some models that are very good may be very good precisely

because they are very good at using this 'within-set' variation

• Very simple (3-variable) non-linear models may do as well or better than than much more complex models, especially linear models, and may exclude highly-correlated variables

• Different measures of successful prediction may yield quite different results (i.e. test correlation versus 0.5 SD correlation)

Predictive Tests

Documents

Transcript of Predictive Tests