Categories and Concepts Computational models (part 1)

Categories and ConceptsComputational models (part 1)

PSYCH-GA 2207

Brenden Lake

“Verbally expressed statements are sometimes flawed by internal inconsistencies, logical contradictions, theoretical weaknesses and gaps. A running computational model, on the other hand, can be considered as a sufficiency proof of the internal coherence and completeness of the ideas it is based upon.” (Fum, Del Missier, & Stocco, 2007)

Why build a computational model?

Some famous psychological theories…

• Attention is like a spotlight• A child learning about the world is like a scientist theorizing about science• Language influences thought• Working memory is having 7 +/- 2 slots to store information• Categorization happens by comparing novel instances to past exemplars• Categories influence perception

Each of these theories benefits from formalization with a computational model to…

• Make predictions explicit• Implications often defy expectations• Aid communication between scientists• Support cumulative progress

My view echos Richard Feynman’s “What I cannot create, I do not understand.”

Agenda for the next 2 classes

Two influential computational models

• ALCOVE“An exemplar-based connectionist model of category learning”, Krushcke

• Rational Model“An exemplar-based connectionist model of category learning”, Anderson

And some comments on category learning in modern AI

Bird? Birds You’ve Seen

yx ∈ C

Like the Context Model, ALCOVE is an exemplar model

0 1 0 1

Category A Category B

Transfer item

Review of the “Context Model”Similarity between two items x and y

(Medin & Schaffer, 1978)

D1 D2 D3 D4

1 1 1 1

1 1 1 0

0 0 0 1

0.09

D1 D2 D3 D4

0 0 0 0

0 0 1 1

1 1 0 0

x

y

sim(y, x) = ∏Di

m1[xi≠yi] = m ⋅ 1 ⋅ m ⋅ 1 = 0.09

“mismatch” free parameter such that m = 0.3

sim(y, x)

Review: Context model

sim(y, x) = ∏Di

m1[xi≠yi] “mismatch” free parameter m

Item similarity

Category similarity


ysim(y, C) = ∑x∈C

sim(y, x)

x ∈ C

P(y ∈ C) = sim(y, C)∑C′

sim(y, C′ )

Probability of classification

“classification is based on similarity to all exemplars in a class”(Medin & Schaffer, 1978)

sim(y, x) = ∏Di

m1[xi≠yi]i

sim(y, x) = e ∑Dilog mi 1[xi≠yi] = e ∑Di

αi|xi−yi|

N

∏i=1

zi =N

∏i=1

elog zi = e ∑Ni=1 log zi

ALCOVE’s similarity between two items x and y

Context model’s item similarity Recall this identity:

(assuming we have a different mismatch parameter mi for each dimension)

ALCOVE’s item similarity (also works for continuous features)

αi = log mi(equivalent for binary dims, with this identity)

(parameters c, q, and r not shown)

0 1 0 1

Category A Category B

Transfer item

D1 D2 D3 D4

1 1 1 1

1 1 1 0

0 0 0 1

0.09

D1 D2 D3 D4

0 0 0 0

0 0 1 1

1 1 0 0x

y

sim(y, x) = ∏Di

m1[xi≠yi] = m ⋅ 1 ⋅ m ⋅ 1 = 0.09“mismatch” free parameter such that m = 0.3

sim(y, x)

sim(y, x) = e ∑Dilog mi |xi−yi| = e−1.20 + 0 −1.20 + 0 = 0.09

ALCOVE’s similarity between two items x and y

ALCOVEItem similarity

(discrete or continuous)

Category similarity

sim(y, C) = ∑x∈X

wcx sim(y, x)

P(y ∈ C) = eϕsim(y,C)

∑C′ eϕsim(y,C′ )


sim(y, x) = e ∑Diαi|xi−yi| key parameter: attention weights αi

is the set of all exemplars across all categories

X


y

x ∈ X

key parameter: association weightsbetween exemplar x and category c

wcx

ϕ response “temperature”

ALCOVEItem similarity

Category similarity


wcx sim(y, x)

P(y ∈ C) = eϕsim(y,C)

∑C′ eϕsim(y,C′ )


sim(y, x) = e ∑Diαi|xi−yi|

key parameter: attention weights αi

key parameter: association weightsbetween exemplar x and category c

wcx

ϕ response “temperature”

sim(y, x) = ∏Di

m1[xi≠yi]

“mismatch” free parameter m

sim(y, C) = ∑x∈C

sim(y, x)

P(y ∈ C) = sim(y, C)∑C′

sim(y, C′ )

Context model

(only considers exemplars x assigned to C)

ALCOVE


wcx sim(y, x)P(y ∈ A) = eϕsim(y,A)

eϕsim(y,A) + eϕsim(y,B)


attention weights αi

association weightsbetween exemplar x and category c

wcx

category A category BResponse rule

y current stimulus

denotes one of the exemplars

x


wcx sim(y, x)P(y ∈ A) = eϕsim(y,A)

eϕsim(y,A) + eϕsim(y,B)


attention weights αi

association weightsbetween exemplar x and category c

wcx

category A category BResponse rule

y current stimulus

denotes one of the exemplars

x

How does ALCOVE learn?Learning is incremental fitting of the attention weights and association weights

Learning by optimizing an objective function

F(w, α) = ∑x∈A

log Pw,α(x ∈ A) + ∑x∈B

log(Pw,α(x ∈ B))

Let’s use log-likelihood as our objective function:

Then the optimal parameters are given by:

argmaxw,α F(w, α)

(“maximum likelihood” is also the objective function for linear regression, logistic regression, and many other stats/machine learning algorithms…)

(note that this is a different objective function than used in original ALCOVE paper)

ALCOVE applied to SHJ Type I category

category A category B

colortexture

slash

Shepard, Hovland, & Jenkins (1961)

y

αcolor αtexture αslash

Exemplars

current stimulus

wAx

wAx x}}

attention weights

association weights

P(y ∈ A)response

0.33 0.33 0.33

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.5

sim(y, x)

sim(y, x)

1.0

0.11

0.110.11

0.01

0.01

0.00.01

Network BEFORE training



wcx sim(y, x)

Computing the gradient tells us which direction to go for steepest ascent:

Maximizing likelihood via gradient descent(the workhorse of modern machine learning, and often cognitive modeling too)

w

w ← w + γ∂F∂w

γ : learning rate (small constant)

Finitial weight

grad

ient

∂F∂w

likelihood

objective functionto maximize

F(w, α) = ∑x∈A

log Pw,α(x ∈ A) + ∑x∈B

log(Pw,α(x ∈ B))

y


Exemplars

current stimulus

wAx

wAx x}}

attention weights

association weights

P(y ∈ A)response

0.33 0.33 0.33

0.015 0.002 0.0

0.0 0.0 0.0

0.503

sim(y, x)

sim(y, x)

1.0

0.11

0.110.11

0.01

0.01

0.00.01

0.002

0.002

ALCOVE after one gradient stepResponse is a little better, since y is in A!



wcx sim(y, x)

y


Exemplars

current stimulus

wAx

wAx x}}

attention weights

association weights

P(y ∈ A)response

0.544 0.0 0.0

1.15 1.14

0.99

sim(y, x)

sim(y, x)

1.0

0.03

-1.14

ALCOVE after many more gradient steps…Response is now nearly perfect

1.14 1.14

-1.14 -1.14 -1.14

1.0 1.0 1.0

0.03 0.03 0.03

Attention only looks at color



wcx sim(y, x)

A B

colortextureslash

A BwAx

1.14

1.14

1.14

1.14

-1.14

-1.14

-1.14

-1.14

1.51

1.51

1.51

1.51

-1.51

-1.51

-1.51

-1.51

αcolor αtexture αslash0.544 0.0 0.0

αcolor αtexture αslash0.61 0.0 0.61

Type I Type II

Solutions found for SHJ problems

A B

colortextureslash

A BwAx

1.69

1.69

-1.63

-1.63

-1.69

-1.59

αcolor αtexture αslash0.399

αcolor αtexture αslash0.62

Type III Type VI

Solutions found for SHJ problems

1.63 -1.69

1.63

0.399 0.385 0.62 0.62

-1.591.59

1.59

1.59

1.59

-1.59

-1.59

MODELSOF CLASSI FI CATI ON LEARI t i I : ~G 355

wer e t est ed equal l y of t en, and t he or der of pr obl ems wi t hi n eachpai r was bal anced acr oee subt ect s . Thesubj ect s wer e l i ven expl i ci ti nst r uct i ons t hat t he r el evant r ul e and di mensi ons f or t he secondpr obl emwer e chosen i ndependent l y of t hose t hat wer e r el evant i nt he f i r st pr obl em.Thepr ocedur e f or t he l ear ni ng of each pr obl emwas si r nul ar t o

t he oneused by 5hepar d et al . ( 1461) . I n t he f i r st andsecond bl ockof 8 t naI s, each st i mul us appear edonce t o a r andomor der I n eachsubsequent bl ock of 16 t r i al s, each st i mul us appear ed t wi ce i s ar andom or der . ( Ti ns pr ocedur e di r ect l y f ol l ows t he one used byShepar d et al ) On each t r i al , a st i mul us appear ed on t he scr een,t he subj ect cl assi f i ed i t i nt o Cat egor y 1 or 2. and f eedback was pr o-vi ded Lear ni ng cont i nued unt i l asubj ect r eached a cnt enonof 4consecut i ve sub- bl ocks of R t r i al s wi t h no er r or s, or f or a i naxrmumof 400 t r i al s ( 25 bl ocks of 16 t r i al s)

RESULTS

The aver age pr obabi l i t i es of er r or s f or each pr obl emi n each bl ock of 16 t r i al s ar e r epor t ed i n Tabl e I and ar eshown gr aphi cal l y i n Fi gur e 2. Not e t hat t he means on[ at e bl ocks r ef l ect zer o val ues f or subj ect s who had al -r eady r eached cr i t er i on. Our assumpt i on i s t hat t he sub-j ect s whohad r eached cr i t er i on, andwho t her eby had al -r eady achi eved bet ween 32 and 40 consecut i ve cor r ectr esponses, woul d have cont i nued t o r espond wi t hout er -r or i f t hey mai nt ai ned t he same l evel of mot i vat i on andconcent r at i on .

Tabl e 1Aver age Er r or Pr opor t i ons f or Each of Pr obl emTypes I - VI

i s St ocks 1- 25 of Lear ni ngPr obl emType

Bl ock l u I I I N V VI1 0111 0378 0459 0 422 0472 0. 4982 0. 025 0. 156 0286 0295 0331 0. 341_ 0. 003 0. 083 0. 223 0222 0230 0. 2844 0. 000 0. 456 0 145 0. 172 0 139 0. 2455 0 000 0. 031 0. 081 0 148 0 106 02176 0. 000 0. 427 0. 078 0 109 0081 0 1927 0. 000 0. 028 0063 0 089 0067 0 1928 0. 000 0A16 0433 0063 0078 0 1779 0000 0016 0 023 0 025 0048 0 . 77210 0. 000 0. 008 0. 016 0 031 0045 0 f 2811 0. 000 0. 000 0. 019 0019 0. 050 0 13912 0. 000 0 002 0. 049 0. 025 0. 036 0. 11713 0. 000 0 005 0. 008 0. 005 0031 0. 20314 0000 0 003 0 013 4. 000 0. 027 0. 048

0OW 0002 0 009 0000 0. 016 0. 10616 0000 0 000 0 013 0. 000 0. 014 0. 20617 0 00o 0 COO 0008 0. 000 0. 014 0. 078i s 0000 0 000 0_006 0000 0. 014 0. 07719 0 000 0. 000 0009 0. 000 0. 013 0. 07820 0 000 0 000 0 003 0. 000 0. 014 0. 06121 0. 000 0 ow 0005 0. 000 0. 013 0. 05822 o. ooo 0. 000 0. 000 0. 000 0. 009 0. 04223 0. 000 0. 0oo 0003 0. 000 0. 011 0. 04224 0. 000 o. aoa 0005 o. ooo 0. 008 0. 03025 0000 0. 000 0. 002 0. 000 0. 008 0. 038

Aver age _ 0. 010 0. 032 0. 061 0. 065 0 475 0 143

Obser ved Dat a

i

CWv

Bl ock

Fi gur e 2. Aver age pr obabi l i t i es of er r or s f or each pr obl emi n each bl ack of 16 t r i al s. Thedat a f r omonl y Bl ocks1- 16ar eshown. Ther e wer e essent i al l y zer o er r oRdur i ng Bl ocks 17- ZSf or al l pr obl ems except Type VI ( see Tabl e 1) .

2 4 6 8 10 12 14 16

364 NOSOFSKY, GLUCK, PALM$RI , MCKI NLEY, ANDGLAUTHI ER

ALCOVE0. 7

e

x Ig 0. 4- %; , %A&%W0

I l k4. si , ` . Y

A~ \ pTS; `

. 1

. __ - -

2 4 6 8 10 12 14 16Bl ock

Fi gur e 6. Lear ni ng cur ves f or t he si x pr obl emt ypes pr edi ct ed by ALCOVE.

Conf i gur al Cue Mode!0. 7-

0. 6-

0. 5-

j0. 4-~S

2-wd n . t i i

I s, X

51~, y. -

2 4 6 8 10 v12 . 14 16Bl ock

Fi gur e 7. Lear ni ng car ves f or t he si x pr obl emt ypes pr edi ct ed by t he coQCi gur ai t uemodel .

- - ; - -

111

i vV

t

- -+- -! 1

f l l

I V

V

VI

Comparing learning curves for people and ALCOVE

(from Nofosky et al. (1994) SHJ replication in Memory & Cognition)

Attentional learning is critical

ALCOVE with attention weights ALCOVE without attention weights(gets Type II order wrong)

What about a standard neural network?

Category A vs. B

Hidden layer(not exemplars)

Color Texture Slash

ALCOVE standard multi-layer network

ALCOVE with attention weights Standard multi-layer neural network(gets Type II order wrong)(gets Type IV order very wrong)

What about a standard neural network?

activation profile for ALCOVE hidden unit

activation profile for standard neural net hidden unit

Standard nets prefer linearly separable categories, while ALCOVE prefers strong dimensional alignment

Standard multi-layer neural network

Only Type 1 and Type IV are linearly separable, and thus they are learned first with standard net

colortextureslash

As predicted by ALCOVE, people don’t necessarily favor linearly separable categories

linearly separable condition: people had 39.5 errors on average

non-linearly separable condition: people had 38.0 errors on average

(criterion was two error-free passes through stimuli)

Interlude: Bayes’ rule for updating beliefs in light of data

Data (D): John is coughing

h1 = John has a coldh2 = John has emphysemah3 = John has a stomach flu

We want to calculate the posterior probabilities: P (h1|D), P (h2|D), and P (h3|D)

Hypotheses:

Which hypotheses should we believe, and with what certainty?

priorlikelihoodposterior

P (hi|D) =P (D|hi)P (hi)Pj P (D|hj)P (hj)

“Bayes’ rule”

This example is from Josh Tenenbaum.

Bayesian inferenceData (D): John is coughing

h1 = John has a coldh2 = John has emphysemah3 = John has a stomach flu

Hypotheses:

Prior favors h1 and h3, over h2

P (h1) = .75P (h2) = .05

P (h3) = .2

P (D|h1) = 1P (D|h2) = 1P (D|h3) = .2

P (h1|D) = .89 =.75(1)

.75(1) + .05(1) + .2(.2)P (h2|D) = .06

P (h3|D) = .05

priorlikelihoodposterior

P (hi|D) =P (D|hi)P (hi)Pj P (D|hj)P (hj)

Likelihood favors h1 and h2, over h3

Posterior favors h1, over h2 and h3

Base rate neglect and ALCOVE

Symptom Rare disease (R)

Common disease (C)

s1 0.6 0.2

s2 0.4 0.3

s3 0.3 0.4

s4 0.2 0.6

(Experiment from Gluck and Bower 1988)

P(s1 |R) = 0.6P(s1 |C) = 0.2

P(C) = 0.75 P(R) = 0.25There is a common disease and rare disease with certain base rates:

The diseases lead to symptoms with a certain pattern, with s1 arising often in cases of the rare disease:

Both due to mismatch in base rates, the presence of s1 could mean either disease:

P(R |s1) = P(s1 |R)P(R)P(s1 |R)P(R) + P(s1 |C)P(C) = 0.5 P(C |s1) = 0.5

Base rate neglect

P(R |s1) = P(s1 |R)P(R)P(s1)

Normative:

Presence Absence

The context model behaves like the normative model, since it stores all examples with their correct labels

Error driven learning forces exemplars to compete for the right to activate output nodes, resulting in base-rate neglect when s1 is presented in isolation

Categories and Concepts Computational models (part 1)

Documents

Transcript of Categories and Concepts Computational models (part 1)