An exactly solvable maximum entropy model Peter Latham Gatsby Computational Neuroscience Unit UCL...

Post on 20-Dec-2015

217 views 0 download

Tags:

Transcript of An exactly solvable maximum entropy model Peter Latham Gatsby Computational Neuroscience Unit UCL...

An exactly solvable maximum entropy model

Peter LathamGatsby Computational Neuroscience Unit

UCL

CNSJuly 20, 2006

s r1, r2, ..., rn

?

The neural coding problem

s r1, r2, ..., rn

The neural coding problem

P(s|r1, r2, ..., rn)

?

s r1, r2, ..., rn

The neural coding problem

P(r1, r2, ..., rn|s) P(s|r1, r2, ..., rn)

?

s r1, r2, ..., rn

The neural coding problem

P(r1, r2, ..., rn|s) P(s|r1, r2, ..., rn)

P(r1, r2, ..., rn|s) P(s)

P(r1, r2, ..., rn)

?

Bayes

P(r

|s)

0 10 20

response (r)

response:one neuron, spike count, 300 ms bins.

decent histogram: ~20 responses, ~200 trials/stimulus.

r 2r1

P(r

1 , r 2

|s)

200 trials is just not enough

response:two neurons, spike count, 300 ms bins.

decent histogram: ~202=400 responses, ~4000 trials/stimulus.

more realistic case ...

P(r1, r2, ..., r10|s) =???

10-D histogram, ~1013 responses.

time to collect:20,000,000 years/stimulus.

Clearly, an approximate approach is needed.

There are several possibilities:

1. Assume independence: p(r|s) = p(r1|s)p(r2|s) …

2. Parametric models.

2a. Point process models.2b. Gaussian approximation (for rates).2c. Maximum entropy models.

Clearly, an approximate approach is needed.

There are several possibilities:

1. Assume independence: p(r|s) = p(r1|s)p(r2|s) …

2. Parametric models.

2a. Point process models.2b. Gaussian approximation (for rates).2c. Maximum entropy models.

Clearly, an approximate approach is needed.

There are several possibilities:

1. Assume independence: p(r|s) = p(r1|s)p(r2|s) …

2. Parametric models.

2a. Point process models.2b. Gaussian approximation (for rates).2c. Maximum entropy models.

Questions:

1. Are maximum entropy models useful for neural data?

2. How do we assess model quality?

3. How tractable are these models?

I’m not sure.

Not the way you might think.

Not very.

The idea behind maximum entropy models:

1. Measure, from data, some aspect of a probability distribution.

2. Find the maximum entropy distribution consistent with that measurement.

An example (in 1-D, with no dependence on s).

1. estimate the mean response from data,

r = ∑ r(k)

K1

k=1

K

An example (in 1-D, with no dependence on s).

2. find the maximum entropy distribution consistent with this,

An example (in 1-D, with no dependence on s).

2. find the maximum entropy distribution consistent with this,

[ -∑r' p(r') log p(r') - λ0∑r' p(r') - λ∑r' r'p(r') ] = 0∂

∂p(r)

entropy normalization (1) mean (r)

An example (in 1-D, with no dependence on s).

2. find the maximum entropy distribution consistent with this,

[ -∑r' p(r') log p(r') - λ0∑r' p(r') - λ∑r' r'p(r') ] = 0∂

∂p(r)

-1-log p(r) -λ0 -λr

An example (in 1-D, with no dependence on s).

2. find the maximum entropy distribution consistent with this,

-1 - log p(r) - λ0 - λr = 0

=> p(r) = exp[-1 - λ0 - λr]

To find λ0 and λ,

p(r) =

Z = ∑r exp(-λr) => ∑r p(r) = 1

r = determines λ

exp(-λr)

Z

∑r r exp(-λr)

Z

p(r) =exp(-λr - λ1r2)

Z

p(r1, r2) =exp(-λ1r1 - λ2r2 - λ11r1

2 - λ12r1r2 - λ22r22)

Z

An aside:

Maximum entropy => Maximum likelihood

just another parametric model;it’s just one that lives in theexponential family.

Assessing goodness of fit: KL distance.

D(p(r)||p(r|λ)) = ∑r p(r) log [p(r)/p(r|λ)]

= ∑r p(r) log p(r) - ∑r p(r) log p(r|λ)

Assessing goodness of fit: KL distance.

D(p(r)||p(r|λ)) = ∑r p(r) log [p(r)/p(r|λ)]

= ∑r p(r) log p(r) - ∑r p(r) log p(r|λ)

= -H(r) - ∑r p(r) log p(r|λ)

entropy

Our original example:

p(r|λ) =

- ∑r p(r) log p(r|λ)

exp(-λr)

Z

Our original example:

p(r|λ) =

- ∑r p(r) log p(r|λ) = λ∑r p(r) r + log Z

= λ∑r p(r|λ) r + log Z

= - ∑r p(r|λ) log p(r|λ)

= H(r|λ)

exp(-λr)

Z

Assessing goodness of fit: KL distance.

D(p(r)||p(r|λ)) = ∑r p(r) log [p(r)/p(r|λ)]

= ∑r p(r) log p(r) - ∑r p(r) log p(r|λ)

= -H(r) - ∑r p(r) log p(r|λ)

entropy

Assessing goodness of fit: KL distance.

D(p(r)||p(r|λ)) = ∑r p(r) log [p(r)/p(r|λ)]

= ∑r p(r) log p(r) - ∑r p(r) log p(r|λ)

= -H(r) + H(r|λ)

entropy

entropy under the model

Assessing goodness of fit: KL distance.

D(p(r)||p(r|λ)) = model entropy - true entropy

We have a problem:

D(p(r)||p(r|λ)) = model entropy - true entropy

Although we might be able to compute the modelentropy, there’s no way in hell we can compute thetrue entropy.

number of neurons

“Solution”: an exactly solvable model.

In the large N limit,

we can compute the true entropyand the model entropy.

N neurons

time

neu

ron

N binary neurons

time

neu

ron

0110010111

1111011110

0001100100

ri = response of neuron i = {-1, 1}

0 spikesin a bin

1 or more spikesin a bin

ri = response of neuron i = {-1, 1}

0110010111

r = (-1, 1, 1, -1, -1, 1, -1, 1, 1, 1)

Exactly solvable model (suppressing dependenceon stimulus)

p(r) = ∑θ p(θ) ∏i p(ri|θ)

D(p(r)||p(r|λ)) = model entropy - true entropy

Exactly solvable model (suppressing dependenceon stimulus)

p(r,θ) = ∑θ p(θ) ∏i p(ri|θ)

Computing the true entropy, H(r)

H(r) – H(r|θ) = I(r;θ)

mutual information

Computing the true entropy, H(r)

H(r) – H(r|θ) = I(r;θ) ≤ H(θ)

H(r) = H(r|θ) + I(r;θ) ≤ H(r|θ) + H(θ)

H(r) = H(r|θ) + I(r;θ) ≥ H(r|θ)

mutual information

entropy of p(θ)

Computing the true entropy, H(r)

H(r) – H(r|θ) = I(r;θ) ≤ H(θ)

H(r) = H(r|θ) + I(r;θ) ≤ H(r|θ) + H(θ)

H(r) = H(r|θ) + I(r;θ) ≥ H(r|θ)

H(r|θ) ≤ H(r) ≤ H(r|θ) + H(θ)

Why is this useful?

H(r|θ) = ∑θ p(θ) H(∏i p(ri|θ))

= ∑θ p(θ) ∑i H1(ri|θ)

H1(r|θ) = - ∑r p(r|θ) log p(r|θ)

Why is this useful?

H(r|θ) = ∑θ p(θ) H(∏i p(ri|θ))

= ∑θ p(θ) ∑i H1(ri|θ)

= N ∑θ p(θ) H1(r|θ)

H1(r|θ) = - ∑r p(r|θ) log p(r|θ)

Why is this useful?

H(r|θ) = ∑θ p(θ) H(∏i p(ri|θ))

= ∑θ p(θ) ∑i H1(ri|θ)

= N ∑θ p(θ) H1(r|θ)

H1(r|θ) = - ∑r p(r|θ) log p(r|θ)

only two terms in this sum!!!

H(r|θ) ≤ H(r) ≤ H(r|θ) + H(θ)

order(N) and easyto compute

order(1)

In the large N limit,

H(r) ≈ H(r|θ)

Two maximum entropy models:

1. p1(r|h'), which captures first moments:

∑r ri p(r) = ∑r ri p(r|h')

2. p2(r|h, J), which captures first and second moments:

∑r ri p(r) = ∑r ri p2(r|h, J)

∑r ri rj p(r) = ∑r ri rj p2(r|h, J)

p1(r|h') =

p2(r|h, J) =

p(r) = ∑θ p(θ) ∏i p(ri|θ)

exp(h' ∑i ri)

Z

exp(h∑i ri + (J/2N)∑ij ri rj)

Z

p1(r|h') = => all neurons have the same mean

ri ≡ ρ, independent of i.

ρ completely specifies p1(r|h').

exp(h' ∑i ri)

Z

p(r) = ∑θ p(θ) ∏i p(ri|θ) => conditioned on θ, all neurons have the same mean.

ri(θ) ≡ ∑ ri p(ri|θ) ≡ ρ(θ).

ρ(θ) and p(θ) completely specifies p(r).

-1 1

p1(r|h')

ρ

-1 1

p(r)

ρ(θj)

p(θj)

1

-1 1

p2(r|h, J)

-ρ2 ρ2

p2-

p2+

Conclusion #1:

The “pairwise” maximum entropy distribution(p2) does not do a very good job matching thetrue distribution.

Whether or not this is true for more complexdistributions is not known.

A simple case: p(θ) consists of two terms

-1 1

p1(r|h')

ρ

-1 1

p(r)

ρ(θ1) ρ(θ2)

p(θ1)

p(θ2)

1

-1 1

p2(r|h, J)

-ρ2 ρ2

p2-

p2+

-1 1

p1(r|h')

ρ

-1 1

p(r)

ρ(θ1) ρ(θ2)

p(θ1)

p(θ2)

1

-1 1

p2(r|h, J)

-ρ2 ρ2

p2-

p2+

Three parameters: ρ, p(θ2), ρ2

firing rate bin size

In terms of more intuitive parameters:

ρ = (+1)×ντ + (-1)×(1-ντ) = 2ντ – 1

ρ22 - ρ2 ≡ δ2 = <rirj> - <ri><rj>, i ≠ j

time

neu

ron

bin size

firing rate bin size

In terms of more intuitive parameters:

ρ = (+1)×ντ + (-1)×(1-ντ) = 2ντ – 1

ρ22 - ρ2 ≡ δ2 = <rirj> - <ri><rj>, i ≠ j

Model parameters: ντ, δ, p(θ2).

Goodness of fit:

D(p(r)||p1(r|h')) = H1 – H

D(p(r)||p2(r|h, J)) = H2 – H

The picture:

0 H1H2H

What’s a good cost function?

For the independent distribution, p1(r|h'):

0 H1H2H

HH1

What’s a good cost function?

For the independent distribution, p1(r|h'):

For the pairwise distribution p2(r|h, J):

0 H1H2H

HH1

HH2

H1-H2

H1-H

p(θ2)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

En

trop

y re

lati

ve t

o H

1τ = 20 ms, ν = 25 Hz, δ = 0.1

HH2

H1

p(θ2)0.0 0.2 0.4 0.6 0.8 1.0

0.990

0.992

0.994

0.996

0.998

1.000

τ = 20 ms, ν = 25 Hz, δ = 0.1

H2 ≈ H;both are ≈ H1

≈ 1H1-H2

H1-H

0 H1H2H

H1

p(θ2)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

En

trop

y re

lati

ve t

o H

1τ = 20 ms, ν = 2 Hz, δ = 0.05

HH2≈ 0.2

H1-H2

H1-H

0 H1H2H

H1

p(θ2)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

En

trop

y re

lati

ve t

o H

1τ = 20 ms, ν = 2 Hz, δ = 0.1

H H2

H1

p(θ2)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

En

trop

y re

lati

ve t

o H

1τ = 20 ms, ν = 2 Hz, δ = 0.2

H

H2

H1

A better approach to determining goodness of fit

What we’re computing is

p(r|s,λm)

model m

A better approach to determining goodness of fit

What we’re computing is

p(r|s,λm)

=> p(s|r,λm) = p(r|s,λm) p(s) / normalization

A better approach to determining goodness of fit

What we should be comparing are posteriors:

p(s|r,λ1) and p(s|r,λ2)

It doesn’t make sense to spend a huge amount of timeand effort finding the ultimate model for p(r|s,λ) ifthat’s not going to improve p(s|r,λ).

1. Maximum entropy => Maximum Likelihood.

2. For at least one maxent model, pairwise correlations don’t match the true distribution very well.

3. One needs to be very careful about assessing models: compare posteriors!!!

4. For binary neurons, the pairwise maxent model is intractable.

5. Wherever possible, use point process models. After all, spike timing is irrelevant in the brain.

Conclusions