Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and...

Reinforcement Learning in the Control of Attention

Roderic A Grupen

Laboratory for Analysis and Architecture of Systems(State University of Campinas-near future)www.laas.fr/~lmgarcia

Laboratory for Perceptual RoboticsState University of Massachusetts (USA)www-robotics.cs.umass.edu

Luiz M G Gonçalves

Objective

To develop a robotic system to perform tasks involving attention and pattern categorization, integrating multi-modal (haptic and visual) information in a behaviorally cooperative active system.

Motivation Towards finding an useful robotic

system able to: foveate (verge) the eyes onto a ROI; keep attention on the ROI if needed; choose another ROI (shift focus of

attention). Result is a behaviorally cooperative

active system, which provides on-line feedback to environmental stimuli in form of actions

Method

Use of (real time) visual information from a stereo head and a simulator

Selective Attention (bottom-up salience maps)

Multi feature extraction (perceptual state) Associative memory (pattern address

identification) Efficient topological mapping Learn policies to program the system

Task Specification (Objectives)

Visual Monitoring or Environment Inspection Construction of an attentional map Keep this map consistent with a current

perception (update) Categorize all patterns

Processo Markoviano

Um processo estocástico cujo passado não influencia o futuro se o seu presente está completamente especificado

Ex: Jogo de damas, Xadrez

)}(|)({

}),(|)({

1

1

1

nnn

nnn

nn

tXxtXP

tttXxtXP

tt

Programação Dinâmica Percorrer todos os estados possíveis,

testando todas as possibilidades (executar todas as ações infinitamente)

Solução melhor (PD): Reduzir a complexidade de um problema

que pode ser resolvido em uma dimensão D para dois ou mais problemas em dimensões menores

Ex: Disparidade estéreo: 1 problema em 3D (x,y,d) é reduzido para

2 problemas em 2D (x,d) e (y,d)

Pavlov

Animal faz certo, ganha comida Animal faz errado, apanha Em teoria, é provado que apenas

um deles (recompensa ou punição funciona): fez coisa errada, não ganha comida.

Assim: robô fez certo => recompensa

Reinforcement Learning(Related Work)

Watkins: Learning from Delayed Rewards (1989).

Sutton/Barto: Reinforcement Learning: An Introduction (1998).

Araujo: Learning a Control Composition in a Complex Environment (1996).

Huber: A Feedback Control Structure for On-line Learning tasks (1997).

Coelho: A Control Basis for Learning Multifingered Grasps (1997).

Modelling a problem with delayed reinforcement as an MDP:

a set of states (estados) S, a set of actions (operadores) A, a reward function R:SxA, and a state transition function T:SxA

(S), which maps states transition to probabilities.

Q-learning equation:

s, aQa,sQrs, aQs, aQ MaxAa

Q-learning equation

a = ação executada r = recompensa s’ = estado resultante de aplicar a A = todas as ações possíveis a’ de

serem executadas em s’ = learning rate (geralmente 0.1) = fator de disconto (geralmente 0.5)

s, aQa,sQrs, aQs, aQ MaxAa

Observações

Uma transição no espaço de estados pode ser completamente caracterizada pelo vetor (s,a,r,s’)

Supondo que para todos os pares (s,a), Q(s,a) seja atualizado infinitamente (muitas vezes) para todo par (s,a), Q(s,a) converge com probabilidade 1 para a melhor recompensa possível para este par.

Exploração e explotação

Exploração; randomicamente escolher uma ação Explotação: após certo tempo, o sistema começa

a convergir, assim, escolhe-se ações que sabe-se estejam contribuindo para a convergência

Balancear entre exploração e explotação Temperatura (lembra Simulated Annealing)

Escolher randomicamente em função da temperatura (inicial alta, depois baixa)

Na prática, mesmo no final, ainda 10% randomico

Algoritmo Q-learning 1) Define current state s by decoding sensory information available; 2) Use stochastic action selector to determine action a; 3) Perform action a, generating new state s’ and a reinforcement r; 4) Calculate temporal differencial error r’:

5) Update Q-value of the state/action pair(s,a)

6) Go to 1;

s, aQa,sQrr MaxAa

'

rs, aQs,aQ

Elegibility trace

Atualizar não apenas um par estado-ação de cada vez, mas sim uma seqüência de pares (após execução de uma série de ações).

Ganho em convergência

Na prática

Uma tabela (Q-table) Linhas são os estados (s) Colunas são as ações (a) Elemento Q(s,a) são os Q-values,

valores dados pela função que avalia a utilidade de tomar a ação a quando o estado é s

Roger-the-Crab

Stereo Head Environment

Degrees of Freedom (Controllers)

System Control Architecture

Low-level Control Defining a target

Pre-attentional phase (stimuli + internal biased)

Shifting attention (saccade generation) Fine saccade (using target model) Verging eyes onto a target (correlation) Movements are computed from errors to

image centers

Low-level Control Identifying Objects

Selecting a region of interest Extracting features Associative memory match

Mapping objects and/or updating memory Pre-attentional maps Automatic supervised learning

Behavioral Program

A straight-forward control algorithm Step 0: Initialize the associative memory and start the

concurrent controllers of arms, neck, and eyes. Step 1: Re-direct attention; if a representation is activated,

update attentional maps and re-do this step (1). Step 2: Try a visual improvement; if a representation is

activated, update attentional maps and return to step 1. Step 3: Try an arm improvement; if a representation is

activated, update attentional maps and return to step 1; Step 4: Activate “supervised learning” module, update

attentional maps and return to step 1.

Finite state machine

Results

Q-learning convergence

Partial Evaluation of strategies

AttentionalShifts


Visual/armImprov


ObjectsIdentified


New objects

Global evaluation

Mapped objects

Task accomplishment

Mapped objects

Times for each phase or process Phase Min(sec) Max(sec) Mean(sec) Computing retina 0.145 0.189 0.166 Transfer to host 0.017 0.059 0.020 Total acquiring 0.162 0.255 0.186 Pre-attention 0.139 0.205 0.149 Salience map 0.067 0.134 0.075 Total attention 0.324 0.395 0.334 Total saccade 0.466 0.903 0.485 Features for match 0.135 0.158 0.150 Memory match 0.012 0.028 0.019 Total matching 0.323 0.353 0.333

Conclusions

The system can support other sensors. Attention and categorization act

together: tasks must be formulated Inspection task succesfully done. Currently support a 10-15 frame rate. Reinforcement learning appr. worked

well in simulation

Future works Consider focus for saccade generation and

accomodation (vergence) Test with partially ocluded objects Derive policies (with Q-learning) for control of top-

down attention Increase the state space and/or the set of actions Define other hierarchical tasks (several policies, each

appropriate for a given task) Test learning architecture on a real environment

Thanks

Thanks to CNPQ, CAPES, FAPERJ, NSF and UMASS (USA)

To all of you for your patience

To Mimmo and Dr. Arcangelo Distante for hosting me:-).

Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and...

Documents

Transcript of Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and...