Categories and Concepts Computational models (part 1)
Transcript of Categories and Concepts Computational models (part 1)
Categories and ConceptsComputational models (part 1)
PSYCH-GA 2207
Brenden Lake
“Verbally expressed statements are sometimes flawed by internal inconsistencies, logical contradictions, theoretical weaknesses and gaps. A running computational model, on the other hand, can be considered as a sufficiency proof of the internal coherence and completeness of the ideas it is based upon.” (Fum, Del Missier, & Stocco, 2007)
Why build a computational model?
Some famous psychological theories…
• Attention is like a spotlight• A child learning about the world is like a scientist theorizing about science• Language influences thought• Working memory is having 7 +/- 2 slots to store information• Categorization happens by comparing novel instances to past exemplars• Categories influence perception
Each of these theories benefits from formalization with a computational model to…
• Make predictions explicit• Implications often defy expectations• Aid communication between scientists• Support cumulative progress
My view echos Richard Feynman’s “What I cannot create, I do not understand.”
Agenda for the next 2 classes
Two influential computational models
• ALCOVE“An exemplar-based connectionist model of category learning”, Krushcke
• Rational Model“An exemplar-based connectionist model of category learning”, Anderson
And some comments on category learning in modern AI
Bird? Birds You’ve Seen
yx ∈ C
Like the Context Model, ALCOVE is an exemplar model
0 1 0 1
Category A Category B
Transfer item
Review of the “Context Model”Similarity between two items x and y
(Medin & Schaffer, 1978)
D1 D2 D3 D4
1 1 1 1
1 1 1 0
0 0 0 1
0.09
D1 D2 D3 D4
0 0 0 0
0 0 1 1
1 1 0 0
x
y
sim(y, x) = ∏Di
m1[xi≠yi] = m ⋅ 1 ⋅ m ⋅ 1 = 0.09
“mismatch” free parameter such that m = 0.3
sim(y, x)
Review: Context model
sim(y, x) = ∏Di
m1[xi≠yi] “mismatch” free parameter m
Item similarity
Category similarity
Bird? Birds You’ve Seen
ysim(y, C) = ∑x∈C
sim(y, x)
x ∈ C
P(y ∈ C) = sim(y, C)∑C′
sim(y, C′ )
Probability of classification
“classification is based on similarity to all exemplars in a class”(Medin & Schaffer, 1978)
sim(y, x) = ∏Di
m1[xi≠yi]i
sim(y, x) = e ∑Dilog mi 1[xi≠yi] = e ∑Di
αi|xi−yi|
N
∏i=1
zi =N
∏i=1
elog zi = e ∑Ni=1 log zi
ALCOVE’s similarity between two items x and y
Context model’s item similarity Recall this identity:
(assuming we have a different mismatch parameter mi for each dimension)
ALCOVE’s item similarity (also works for continuous features)
αi = log mi(equivalent for binary dims, with this identity)
(parameters c, q, and r not shown)
0 1 0 1
Category A Category B
Transfer item
D1 D2 D3 D4
1 1 1 1
1 1 1 0
0 0 0 1
0.09
D1 D2 D3 D4
0 0 0 0
0 0 1 1
1 1 0 0x
y
sim(y, x) = ∏Di
m1[xi≠yi] = m ⋅ 1 ⋅ m ⋅ 1 = 0.09“mismatch” free parameter such that m = 0.3
sim(y, x)
sim(y, x) = e ∑Dilog mi |xi−yi| = e−1.20 + 0 −1.20 + 0 = 0.09
ALCOVE’s similarity between two items x and y
ALCOVEItem similarity
(discrete or continuous)
Category similarity
sim(y, C) = ∑x∈X
wcx sim(y, x)
P(y ∈ C) = eϕsim(y,C)
∑C′ eϕsim(y,C′ )
Probability of classification
sim(y, x) = e ∑Diαi|xi−yi| key parameter: attention weights αi
is the set of all exemplars across all categories
X
Bird? Birds You’ve Seen
y
x ∈ X
key parameter: association weightsbetween exemplar x and category c
wcx
ϕ response “temperature”
ALCOVEItem similarity
Category similarity
sim(y, C) = ∑x∈X
wcx sim(y, x)
P(y ∈ C) = eϕsim(y,C)
∑C′ eϕsim(y,C′ )
Probability of classification
sim(y, x) = e ∑Diαi|xi−yi|
key parameter: attention weights αi
key parameter: association weightsbetween exemplar x and category c
wcx
ϕ response “temperature”
sim(y, x) = ∏Di
m1[xi≠yi]
“mismatch” free parameter m
sim(y, C) = ∑x∈C
sim(y, x)
P(y ∈ C) = sim(y, C)∑C′
sim(y, C′ )
Context model
(only considers exemplars x assigned to C)
ALCOVE
sim(y, C) = ∑x∈X
wcx sim(y, x)P(y ∈ A) = eϕsim(y,A)
eϕsim(y,A) + eϕsim(y,B)
sim(y, x) = e ∑Diαi|xi−yi|
attention weights αi
association weightsbetween exemplar x and category c
wcx
category A category BResponse rule
y current stimulus
denotes one of the exemplars
x
sim(y, C) = ∑x∈X
wcx sim(y, x)P(y ∈ A) = eϕsim(y,A)
eϕsim(y,A) + eϕsim(y,B)
sim(y, x) = e ∑Diαi|xi−yi|
attention weights αi
association weightsbetween exemplar x and category c
wcx
category A category BResponse rule
y current stimulus
denotes one of the exemplars
x
How does ALCOVE learn?Learning is incremental fitting of the attention weights and association weights
Learning by optimizing an objective function
F(w, α) = ∑x∈A
log Pw,α(x ∈ A) + ∑x∈B
log(Pw,α(x ∈ B))
Let’s use log-likelihood as our objective function:
Then the optimal parameters are given by:
argmaxw,α F(w, α)
(“maximum likelihood” is also the objective function for linear regression, logistic regression, and many other stats/machine learning algorithms…)
(note that this is a different objective function than used in original ALCOVE paper)
ALCOVE applied to SHJ Type I category
category A category B
colortexture
slash
Shepard, Hovland, & Jenkins (1961)
y
αcolor αtexture αslash
Exemplars
current stimulus
wAx
wAx x}}
attention weights
association weights
P(y ∈ A)response
0.33 0.33 0.33
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.5
sim(y, x)
sim(y, x)
1.0
0.11
0.110.11
0.01
0.01
0.00.01
Network BEFORE training
sim(y, x) = e ∑Diαi|xi−yi|
sim(y, C) = ∑x∈X
wcx sim(y, x)
Computing the gradient tells us which direction to go for steepest ascent:
Maximizing likelihood via gradient descent(the workhorse of modern machine learning, and often cognitive modeling too)
w
w ← w + γ∂F∂w
γ : learning rate (small constant)
Finitial weight
grad
ient
∂F∂w
likelihood
objective functionto maximize
F(w, α) = ∑x∈A
log Pw,α(x ∈ A) + ∑x∈B
log(Pw,α(x ∈ B))
y
αcolor αtexture αslash
Exemplars
current stimulus
wAx
wAx x}}
attention weights
association weights
P(y ∈ A)response
0.33 0.33 0.33
0.015 0.002 0.0
0.0 0.0 0.0
0.503
sim(y, x)
sim(y, x)
1.0
0.11
0.110.11
0.01
0.01
0.00.01
0.002
0.002
ALCOVE after one gradient stepResponse is a little better, since y is in A!
sim(y, x) = e ∑Diαi|xi−yi|
sim(y, C) = ∑x∈X
wcx sim(y, x)
y
αcolor αtexture αslash
Exemplars
current stimulus
wAx
wAx x}}
attention weights
association weights
P(y ∈ A)response
0.544 0.0 0.0
1.15 1.14
0.99
sim(y, x)
sim(y, x)
1.0
0.03
-1.14
ALCOVE after many more gradient steps…Response is now nearly perfect
1.14 1.14
-1.14 -1.14 -1.14
1.0 1.0 1.0
0.03 0.03 0.03
Attention only looks at color
sim(y, x) = e ∑Diαi|xi−yi|
sim(y, C) = ∑x∈X
wcx sim(y, x)
A B
colortextureslash
A BwAx
1.14
1.14
1.14
1.14
-1.14
-1.14
-1.14
-1.14
1.51
1.51
1.51
1.51
-1.51
-1.51
-1.51
-1.51
αcolor αtexture αslash0.544 0.0 0.0
αcolor αtexture αslash0.61 0.0 0.61
Type I Type II
Solutions found for SHJ problems
A B
colortextureslash
A BwAx
1.69
1.69
-1.63
-1.63
-1.69
-1.59
αcolor αtexture αslash0.399
αcolor αtexture αslash0.62
Type III Type VI
Solutions found for SHJ problems
1.63 -1.69
1.63
0.399 0.385 0.62 0.62
-1.591.59
1.59
1.59
1.59
-1.59
-1.59
MODELSOF CLASSI FI CATI ON LEARI t i I : ~G 355
wer e t est ed equal l y of t en, and t he or der of pr obl ems wi t hi n eachpai r was bal anced acr oee subt ect s . Thesubj ect s wer e l i ven expl i ci ti nst r uct i ons t hat t he r el evant r ul e and di mensi ons f or t he secondpr obl emwer e chosen i ndependent l y of t hose t hat wer e r el evant i nt he f i r st pr obl em.Thepr ocedur e f or t he l ear ni ng of each pr obl emwas si r nul ar t o
t he oneused by 5hepar d et al . ( 1461) . I n t he f i r st andsecond bl ockof 8 t naI s, each st i mul us appear edonce t o a r andomor der I n eachsubsequent bl ock of 16 t r i al s, each st i mul us appear ed t wi ce i s ar andom or der . ( Ti ns pr ocedur e di r ect l y f ol l ows t he one used byShepar d et al ) On each t r i al , a st i mul us appear ed on t he scr een,t he subj ect cl assi f i ed i t i nt o Cat egor y 1 or 2. and f eedback was pr o-vi ded Lear ni ng cont i nued unt i l asubj ect r eached a cnt enonof 4consecut i ve sub- bl ocks of R t r i al s wi t h no er r or s, or f or a i naxrmumof 400 t r i al s ( 25 bl ocks of 16 t r i al s)
RESULTS
The aver age pr obabi l i t i es of er r or s f or each pr obl emi n each bl ock of 16 t r i al s ar e r epor t ed i n Tabl e I and ar eshown gr aphi cal l y i n Fi gur e 2. Not e t hat t he means on[ at e bl ocks r ef l ect zer o val ues f or subj ect s who had al -r eady r eached cr i t er i on. Our assumpt i on i s t hat t he sub-j ect s whohad r eached cr i t er i on, andwho t her eby had al -r eady achi eved bet ween 32 and 40 consecut i ve cor r ectr esponses, woul d have cont i nued t o r espond wi t hout er -r or i f t hey mai nt ai ned t he same l evel of mot i vat i on andconcent r at i on .
Tabl e 1Aver age Er r or Pr opor t i ons f or Each of Pr obl emTypes I - VI
i s St ocks 1- 25 of Lear ni ngPr obl emType
Bl ock l u I I I N V VI1 0111 0378 0459 0 422 0472 0. 4982 0. 025 0. 156 0286 0295 0331 0. 341_ 0. 003 0. 083 0. 223 0222 0230 0. 2844 0. 000 0. 456 0 145 0. 172 0 139 0. 2455 0 000 0. 031 0. 081 0 148 0 106 02176 0. 000 0. 427 0. 078 0 109 0081 0 1927 0. 000 0. 028 0063 0 089 0067 0 1928 0. 000 0A16 0433 0063 0078 0 1779 0000 0016 0 023 0 025 0048 0 . 77210 0. 000 0. 008 0. 016 0 031 0045 0 f 2811 0. 000 0. 000 0. 019 0019 0. 050 0 13912 0. 000 0 002 0. 049 0. 025 0. 036 0. 11713 0. 000 0 005 0. 008 0. 005 0031 0. 20314 0000 0 003 0 013 4. 000 0. 027 0. 048
0OW 0002 0 009 0000 0. 016 0. 10616 0000 0 000 0 013 0. 000 0. 014 0. 20617 0 00o 0 COO 0008 0. 000 0. 014 0. 078i s 0000 0 000 0_006 0000 0. 014 0. 07719 0 000 0. 000 0009 0. 000 0. 013 0. 07820 0 000 0 000 0 003 0. 000 0. 014 0. 06121 0. 000 0 ow 0005 0. 000 0. 013 0. 05822 o. ooo 0. 000 0. 000 0. 000 0. 009 0. 04223 0. 000 0. 0oo 0003 0. 000 0. 011 0. 04224 0. 000 o. aoa 0005 o. ooo 0. 008 0. 03025 0000 0. 000 0. 002 0. 000 0. 008 0. 038
Aver age _ 0. 010 0. 032 0. 061 0. 065 0 475 0 143
Obser ved Dat a
i
CWv
Bl ock
Fi gur e 2. Aver age pr obabi l i t i es of er r or s f or each pr obl emi n each bl ack of 16 t r i al s. Thedat a f r omonl y Bl ocks1- 16ar eshown. Ther e wer e essent i al l y zer o er r oRdur i ng Bl ocks 17- ZSf or al l pr obl ems except Type VI ( see Tabl e 1) .
2 4 6 8 10 12 14 16
364 NOSOFSKY, GLUCK, PALM$RI , MCKI NLEY, ANDGLAUTHI ER
ALCOVE0. 7
e
x Ig 0. 4- %; , %A&%W0
I l k4. si , ` . Y
A~ \ pTS; `
. 1
. __ - -
2 4 6 8 10 12 14 16Bl ock
Fi gur e 6. Lear ni ng cur ves f or t he si x pr obl emt ypes pr edi ct ed by ALCOVE.
Conf i gur al Cue Mode!0. 7-
0. 6-
0. 5-
j0. 4-~S
2-wd n . t i i
I s, X
51~, y. -
2 4 6 8 10 v12 . 14 16Bl ock
Fi gur e 7. Lear ni ng car ves f or t he si x pr obl emt ypes pr edi ct ed by t he coQCi gur ai t uemodel .
- - ; - -
111
i vV
t
- -+- -! 1
f l l
I V
V
VI
Comparing learning curves for people and ALCOVE
(from Nofosky et al. (1994) SHJ replication in Memory & Cognition)
Attentional learning is critical
ALCOVE with attention weights ALCOVE without attention weights(gets Type II order wrong)
What about a standard neural network?
Category A vs. B
Hidden layer(not exemplars)
Color Texture Slash
ALCOVE standard multi-layer network
ALCOVE with attention weights Standard multi-layer neural network(gets Type II order wrong)(gets Type IV order very wrong)
What about a standard neural network?
activation profile for ALCOVE hidden unit
activation profile for standard neural net hidden unit
Standard nets prefer linearly separable categories, while ALCOVE prefers strong dimensional alignment
Standard multi-layer neural network
Only Type 1 and Type IV are linearly separable, and thus they are learned first with standard net
colortextureslash
As predicted by ALCOVE, people don’t necessarily favor linearly separable categories
linearly separable condition: people had 39.5 errors on average
non-linearly separable condition: people had 38.0 errors on average
(criterion was two error-free passes through stimuli)
Interlude: Bayes’ rule for updating beliefs in light of data
Data (D): John is coughing
h1 = John has a coldh2 = John has emphysemah3 = John has a stomach flu
We want to calculate the posterior probabilities: P (h1|D), P (h2|D), and P (h3|D)
Hypotheses:
Which hypotheses should we believe, and with what certainty?
priorlikelihoodposterior
P (hi|D) =P (D|hi)P (hi)Pj P (D|hj)P (hj)
“Bayes’ rule”
This example is from Josh Tenenbaum.
Bayesian inferenceData (D): John is coughing
h1 = John has a coldh2 = John has emphysemah3 = John has a stomach flu
Hypotheses:
Prior favors h1 and h3, over h2
P (h1) = .75P (h2) = .05
P (h3) = .2
P (D|h1) = 1P (D|h2) = 1P (D|h3) = .2
P (h1|D) = .89 =.75(1)
.75(1) + .05(1) + .2(.2)P (h2|D) = .06
P (h3|D) = .05
priorlikelihoodposterior
P (hi|D) =P (D|hi)P (hi)Pj P (D|hj)P (hj)
Likelihood favors h1 and h2, over h3
Posterior favors h1, over h2 and h3
Base rate neglect and ALCOVE
Symptom Rare disease (R)
Common disease (C)
s1 0.6 0.2
s2 0.4 0.3
s3 0.3 0.4
s4 0.2 0.6
(Experiment from Gluck and Bower 1988)
P(s1 |R) = 0.6P(s1 |C) = 0.2
P(C) = 0.75 P(R) = 0.25There is a common disease and rare disease with certain base rates:
The diseases lead to symptoms with a certain pattern, with s1 arising often in cases of the rare disease:
Both due to mismatch in base rates, the presence of s1 could mean either disease:
P(R |s1) = P(s1 |R)P(R)P(s1 |R)P(R) + P(s1 |C)P(C) = 0.5 P(C |s1) = 0.5
Base rate neglect
P(R |s1) = P(s1 |R)P(R)P(s1)
Normative:
Presence Absence
The context model behaves like the normative model, since it stores all examples with their correct labels
Error driven learning forces exemplars to compete for the right to activate output nodes, resulting in base-rate neglect when s1 is presented in isolation