NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive...

26
NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive...

NIPS Workshops 12/10/05

Does RL Occur Naturally?

C. R. Gallistel

Rutgers Center for Cognitive Science

Turing’s Vision (‘47-’48)

“It would be quite possible to have the machine try out behaviors and accept or reject them…”

“What we want is a machine that can learn from experience. The possibility of letting the machine alter its own instructions provides the mechanism for this…”

“It might possible to carry through the organizing [of a learning machine] with only two interfering inputs, one for reward (R) or pleasure and the other for pain or punishment (P). It is intended that pain stimuli occur when the machine’s behavior is wrong, pleasure stimuli when it is particularly right.”

A Different Vision

• Policy (what to do given a state of the world) is pre-specified and immutable

• Learning consists in determining the state of the world; it’s all model estimation

• Appropriate sampling behavior is itself prespecified

The Deep Reasons

• Wolpert & Macready’s “No Free Lunch” theorems

• Chomsky’s “Poverty of the Stimulus” argument• Bottom line: reinforcement learning takes too long• Because there is not enough information in the R

& P signals• Because learning in the absence of a highly

structured hypothesis space is a practical impossibility (we don’t live long enough)

Learning by Integrating

• Ant knows where it is• This knowledge is

acquired (learned)• It is acquired by path

integration

--Harkness & Maroudas,1985

Building a Map

• Ant remembers where the food was (records its coordinates)

• Bees & ants make a map by the GPS principle (record location coordinates--& views)

• They do not discover by trial and error that this is a good thing to do

• As in the GPS, the computational machinery to determine a course from an arbitrary location to an arbitrary location is built in

• No RL learning here

Ranging Behavior

• When leaving a new food source or a new nest (hive), bees & wasps fly backwards in an ever increasing zigzag

• Determining visual feature distances by parallax

• Innately specified sampling (model building) behavior Wehner, 1981

Also in the Locust

• Locust scanning• Sobel, 1990• Moved target, so as to

make independent of D

• Reproduced function relating take off velocity to D

Learning by Parameter Estimation

• Animal’s (including insects) use sun as compass reference

• To do this, must learn solar ephemeris: sun’s compass bearing as a function of the time of day--where it is when

• Solar ephemeris varies with latitude and season

Learning from the Dance

• Returning forager does a dance to tell other foragers the location (range & bearing) of source

• Compass bearing, , specified by specifying current solar bearing,

• Range specified by number of waggles

• Hopeless as an RL problem?

= compass bearing of sun = compass bearing of source =solar bearing of source

Ephemeris Framework

Deceived Dancing

Dyer, 1987

Poverty of Stimulus

• Dyer & Dickinson, 1994• Incubator raised bees allowed to forage to station

due west of hive but only in late afternoon when sun declining in west

• On heavy overcast day, moved to new field line with different compass orientation and allowed to forage in morning (with feeder “west” of hive location)

• Experimenter observes dance of returning foragers to estimate where they believe the sun to be

Bees Believe Earth is Round

Implications

• Form of solar ephemeris equation is built into the nervous system

• Only its parameters are estimated from observation

• Solves poverty of the stimulus problem: the information about universal properties of the ephemeris in the priors

• Neural net without this prior information could not generalize as bees do

Language Learning

• Same story?• Innate universal grammar specifies structure

common to all language• Distinctions between languages are due to

differences in parameters (e.g., head final versus head first)

• Learning a language reduces to learning the (binary?) parameter values

• Mark Baker (2001) The Atom’s of Language

Natural Learning Curves

• Gallistel et al (PNAS 2004)• Analyzed individual(!) learning curves from standard

paradigms and in pigeons, rats, rabbits and mice Pavlovian (autoshaping in pigeon, rat & mouse) Eyeblink in rabbit + Maze in rat Water maze in mouse

• Regardless of paradigm, the typical curve cannot be distinguished from a step function

• Latency and size of step varies between subjects• Averaging across these steps produces a gradual learning

curve: it’s gradualness is an averaging artifact

Matching

• Subjects foraging back and forth between locations where food becomes available unpredictably (on random rate schedules with unlimited holds)

• Subjects match the ratio of the time they invest in the locations (expected stay duration, T1/T2) to the ratio of the incomes they have derived from them (I1/I2)

• Matching equates returns: Ri = Ii/Ti;I1/T1 = I2/T2 iff T1/T2 = I1/I2

RL Models

• Most assume hill-climbing discovery of the policy that equates returns

• Policy is one dimensional(ratio of expected stay durations)

• Try-out given policy (stay ratio)

• Determine direction of inequality

• Adjust investment ratio accordingly

T 1T 2

I1

T1 ? I2

T2

But (Gallistel et al 2001)

• Adjustment of investment ratio after a step change in the relative rates of reward is quick and step-like

Bayesian Ideal Detector Analysis

Second Example

Incomes, Not Returns

• Evidence of a change in behavior appears as soon as there is evidence of a change in incomes

• And (often) before there is evidence of a change in returns

Evidence ofAbsence of Evidence

• Upper panel: Odds that subject’s stay durations had changed as a function of session time

• Lower panel: Odds that subject’s returns had changed. There was no evidence--in the returns!

Implications

• Matching is an innate policy• Depends only on estimates of incomes• Anti-aliasing sampling behavior to detect periodic

structure in reward provision built into policy• Estimates of incomes to be expected based on

small samples taken only when a change in income detected

• Here, too, learning is model updating, not policy value updating

• Subjects perversely ignore returns (policy values)

Conclusions

• Most (all?) natural learning looks like model estimation

• Efficient model estimation is made possible by Informative priors (a highly structured

problem-specific hypothesis space) Innately specified efficient sampling routines