Miguel Angel Soto Santibanez. Overview Why they are important Previous Work Advantages and...

179
Building an Artificial Cerebellum using a System of Distributed Q-Learning Agents Miguel Angel Soto Santibanez
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Miguel Angel Soto Santibanez. Overview Why they are important Previous Work Advantages and...

  • Slide 1
  • Miguel Angel Soto Santibanez
  • Slide 2
  • Overview Why they are important Previous Work Advantages and Shortcomings
  • Slide 3
  • Overview Advantages and Shortcomings New Technique Illustration
  • Slide 4
  • Overview Illustration Centralized Learning Agent Issue New Techniques
  • Slide 5
  • Overview New Techniques Evaluate Set of Techniques Summary of Contributions
  • Slide 6
  • Overview Summary of Contributions Software Tool
  • Slide 7
  • Early Sensorimotor Neural Networks Sensory System (receptors) Motor System (motor executors )
  • Slide 8
  • Modern Sensorimotor Neural Networks Sensory System Motor System Cerebellum
  • Slide 9
  • Natural Cerebellum FAST PRECISELY SYNCHRONIZED
  • Slide 10
  • Artificial Cerebellum
  • Slide 11
  • Previous Work Cerebellatron CMAC FOX SpikeFORCE SENSOPAC
  • Slide 12
  • Previous work strengths Cerebellatron improved movement smoothness in robots.
  • Slide 13
  • Previous work strengths CMAC provided function approximation with fast convergence.
  • Slide 14
  • Previous work strengths FOX improved convergence by making use of eligibility vectors.
  • Slide 15
  • Previous work strengths SpikeFORCE and SENSOPAC have improved our understanding of natural cerebellums.
  • Slide 16
  • Previous work strengths May allow us to: Treat nervous system ailments. Discover biological mechanisms.
  • Slide 17
  • Previous work strengths LWPR a step in the right direction towards tackling the scalability issue.
  • Slide 18
  • Previous work issues Cerebellatron is difficult to use and requires very complicated control input
  • Slide 19
  • Previous work issues CMAC and FOX depend on fixed sized tiles and therefore do not scale well.
  • Slide 20
  • Previous work issues Methods proposed by SpikeFORCE and SENSOPAC require rare skills.
  • Slide 21
  • Previous work issues The methods LWPR proposed by SENSOPAC only works well if the problem has only a few non- redundant and non-irrelevant dimensions.
  • Slide 22
  • Previous work issues Two Categories: 1) Framework Usability Issues: 2)Building Blocks Incompatibility Issues:
  • Slide 23
  • Previous work issues Two Categories: 1) Framework Usability Issues: very difficult to use requires very specialize skills
  • Slide 24
  • Previous work issues Two Categories: 2) Building Blocks Incompatibility Issues: memory incompatibility processing incompatibility
  • Slide 25
  • Proposed technique. Two Categories: Framework Usability Issues: new development framework
  • Slide 26
  • Proposed technique. Two Categories: Building Blocks Incompatibility Issues new I/O mapping algorithm
  • Slide 27
  • Proposed Technique Provides a shorthand notation.
  • Slide 28
  • Proposed Technique Provides a recipe.
  • Slide 29
  • Proposed Technique Provides simplification rules.
  • Slide 30
  • Proposed Technique Provides a more compatible I/O mapping algorithm. Moving Prototypes
  • Slide 31
  • Proposed Technique The shorthand notation symbols: a sensor:
  • Slide 32
  • Proposed Technique The shorthand notation symbols: an actuator:
  • Slide 33
  • Proposed Technique The shorthand notation symbols: a master learning agent:
  • Slide 34
  • Proposed Technique The shorthand notation symbols: the simplest artificial cerebellum:
  • Slide 35
  • Proposed Technique The shorthand notation symbols: an encoder:
  • Slide 36
  • Proposed Technique The shorthand notation symbols: a decoder:
  • Slide 37
  • Proposed Technique The shorthand notation symbols: an agent with sensors and actuators:
  • Slide 38
  • Proposed Technique The shorthand notation symbols: a sanity point: S
  • Slide 39
  • Proposed Technique The shorthand notation symbols: a slave sensor learning agent:
  • Slide 40
  • Proposed Technique The shorthand notation symbols: a slave actuator learning agent:
  • Slide 41
  • Proposed Technique The Recipe: 1)Become familiar with problem at hand. 2) Enumerate significant factors. 3) Categorize factors as either sensors, actuators or both. 4)Specify sanity points.
  • Slide 42
  • Proposed Technique The Recipe: 5)Simplify overloaded agents. 6) Describe system using proposed shorthand notation. 7) Apply simplification rules. 8)Specify reward function for each agent.
  • Slide 43
  • Proposed Technique The simplification rules: 1) Two agents in series can be merged into a single agent:
  • Slide 44
  • Proposed Technique The simplification rules: 2)Ok to apply simplification rules as long as no sanity point is destroyed. S
  • Slide 45
  • Proposed Technique The simplification rules: 3)If decoder and encoder share same output and input signals respectively, they can be deleted
  • Slide 46
  • Proposed Technique The simplification rules: 4)Decoders with a single output signal can be deleted
  • Slide 47
  • Proposed Technique The simplification rules: 5)Encoders with a single output signal can be deleted
  • Slide 48
  • Proposed Technique The simplification rules: 6)If several agents receive signals from a single decoder and send their signals to single encoder:
  • Slide 49
  • Proposed Technique Q-learning: Off-policy Control algorithm. Temporary-Difference algorithm. Can be applied online.
  • Slide 50
  • Proposed Technique L. A. R Q-learning:
  • Slide 51
  • Proposed Technique Q-learning: How it does it? By learning an estimate of the long-term expected reward for any state-action pair.
  • Slide 52
  • Proposed Technique state long term expected reward action Q-learning: value function
  • Slide 53
  • Proposed Technique Q-learning: state long term expected reward action
  • Slide 54
  • Proposed Technique Q-learning: state long term expected reward action stateaction 64140 72140 80140 88140 96140 104140 112140
  • Slide 55
  • Proposed Technique action state long term expected reward stateaction 0.20.70 0.40.75 0.60.90 0.80.95 1.00.80 1.20.90 1.40.55 Q-learning:
  • Slide 56
  • Proposed Technique How is the Q-function updated? Initialize Q(s, a) arbitrarily Repeat (for each episode) Initialize s Repeat (for each step of the episode) Choose a for s using an exploratory policy Take action a, observe r, s Q(s, a) Q(s,a) + [r + max Q (s, a) Q(s, a)] a s s
  • Slide 57
  • Proposed Technique states actions 987654321987654321 1 2 3 4 5 6 7 8 9 Q-learning:
  • Slide 58
  • Proposed Technique Tile Coding Kanerva Coding Proposed alternatives:
  • Slide 59
  • Proposed Technique Distribution of memory resources:
  • Slide 60
  • Proposed Technique Moving Prototypes:
  • Slide 61
  • Proposed Technique Tile Based vs. Moving Prototypes:
  • Slide 62
  • Proposed Technique CMAC vs. Moving Prototypes:
  • Slide 63
  • Proposed Technique LWPR vs. Moving Prototypes:
  • Slide 64
  • Hidden Input Output Proposed Technique LWPR vs. Moving Prototypes:
  • Slide 65
  • Hidden Input Output Proposed Technique O(n)O(log n)
  • Slide 66
  • LWPR vs. Moving Prototypes: Hidden Input Output Proposed Technique One trillion interactions A few dozen interactions
  • Slide 67
  • Centralized L.A. issue Steering angle L. A. R R R Obstacles positions
  • Slide 68
  • Centralized L.A. issue L. A. R R R
  • Slide 69
  • Centralized L.A. issue L. A. R R R
  • Slide 70
  • Centralized L.A. issue L. A. R R R
  • Slide 71
  • Centralized L.A. issue ( possible solution ) Centralize L.A.: Distributed L.A.: L. A.
  • Slide 72
  • Distributed L.A. Additional Tasks: 1)Analyze Distributed L.A. Systems. 2)Identify most important bottlenecks. 3)Propose amelioration techniques. 4)Evaluate their performance.
  • Slide 73
  • Distributed L.A. DEVS Java:
  • Slide 74
  • Distributed L.A. Some DEVS Java details: 1)Provides continuous time support. 2)Models are easy to define. 3)Do not need to worry about the simulator. 4)Supports concurrency.
  • Slide 75
  • Distributed L.A. First technique: Learning will occur faster if we split the value function in slices along the action axis. states actions states actions states actions
  • Slide 76
  • Distributed L.A. First Technique Results: 1) 3500 time units to learn to balance pole. 2) 1700 time units to learn to balance pole. ~ 50% improvement.
  • Slide 77
  • Distributed L.A. Second technique: We suspect that we may be able to ameliorate this problem by making sluggish learning agents pass some of their load to speedy learning agents.. Sluggish agent Speedy agent
  • Slide 78
  • Distributed L.A. Second Technique Results: 1) 52000 time units to learn to balance pole. 2) 42000 time units to learn to balance pole. ~ 20% improvement.
  • Slide 79
  • Distributed L.A. Third Technique: Giving each distributed LA its own local simulator should improve learning performance significantly.
  • Slide 80
  • Distributed L.A. Third Technique Results: 1) 29000 time units to learn to balance pole. 2) 14000 time units to learn to balance pole. ~ 50% improvement.
  • Slide 81
  • Distributed L.A. Fourth Thechnique: Using a push policy instead of a poll policy to propagate max-values among the learning agents should improve learning performance significantly.
  • Slide 82
  • Distributed L.A. Fourth Technique Results: 1) 14000 time units to learn to balance pole. 2) 6000 time units to learn to balance pole. ~ 60% improvement.
  • Slide 83
  • Distributed L.A. How does the Push policy scale?
  • Slide 84
  • Distributed L.A. Results: 1) 24000 time units to learn to balance pole. 2) 12000 time units to learn to balance pole. ~ 50% improvement.
  • Slide 85
  • Distributed L.A. Results Suggest the use of a distributed system that: 1) divides value function along action axis. 2) auto balances load from slow agents to fast ones. 3) allows each agent access to its own simulator. 4) uses a Push policy to propagate maxQ(s,a) information.
  • Slide 86
  • Contributions Explained importance of Artificial Cerebellums:
  • Slide 87
  • Contributions Provided Analysis of previous work: Cerebellatron CMAC FOX SpikeFORCE SENSOPAC
  • Slide 88
  • Contributions Extracted most important shortcomings: 1) Framework usability. 2) Building blocks incompatibility.
  • Slide 89
  • Contributions Proposed to use a new framework
  • Slide 90
  • Contributions Proposed the use of Moving Prototypes: CMAC, FOX Moving Prototypes
  • Slide 91
  • Contributions Proposed the use of Moving Prototypes Hidden Input Output O(n)O(log n)
  • Slide 92
  • Contributions Illustrated new technique using a real-life problem
  • Slide 93
  • Contributions Proposed using distributed agent system L. A.
  • Slide 94
  • Contributions Proposed a set of techniques to ameliorate bottlenecks.
  • Slide 95
  • Future Work Development Tool:
  • Slide 96
  • Future Work Automatization: Extract Policy Train
  • Slide 97
  • Future Work Intuitive: Extract Policy Train Components
  • Slide 98
  • Future Work Simple component configuration: Extract Policy Train Hardware Details Signal Table Details . Components
  • Slide 99
  • Future Work Integrated simulator: Extract Policy Train Components
  • Slide 100
  • Extract Policy Train Components Library Water Components Space Components Land Components Air Components Future Work Available library:
  • Slide 101
  • Questions or Comments
  • Slide 102
  • Slide 103
  • Slide 104
  • Asteroids Problem
  • Slide 105
  • Asteroids Problem (step 1)
  • Slide 106
  • spacecraft
  • Slide 107
  • Asteroids Problem (step 1) spacecraft
  • Slide 108
  • Asteroids Problem (step 2) 1)Distance to nearby Asteroids 1)Angle to nearby Asteroids 1)Angle to goal direction d a g
  • Slide 109
  • Asteroids Problem (step 3) 1)Distance to nearby Asteroids Sensor d
  • Slide 110
  • Asteroids Problem (step 3) 2)Angle to nearby Asteroids Sensor Actuator a
  • Slide 111
  • Asteroids Problem (step 3) 3)Angle to goal direction Sensor Actuator
  • Slide 112
  • Asteroids Problem (step 3) Sensor 1 find distance to nearby Asteroids Sensor 2 find angle to nearby Asteroids Sensor 3 find angle to goal direction Actuator 1 Steer spacecraft (to avoid collision) Actuator 2 Steer spacecraft (to match goal)
  • Slide 113
  • Asteroids Problem (step 3) Sensor 1 find distance to nearby Asteroids Sensor 2 find angle to nearby Asteroids Sensor 3 find angle to goal direction Actuator 1 Steer spacecraft (to avoid collision and to match goal)
  • Slide 114
  • Asteroids Problem (step 3) Learning agent receives information about the environment from sensors S 1, S 2 and S 3 and instruct the actuators A 1 what to do in this case
  • Slide 115
  • Asteroids Problem (step 4) Learning agent receives information about the environment from sensors S 1, S 2 and S 3 and generates the desired heading angle signal Learning agent receives desired delta heading angle signal and instruct the actuators A 1 what to do in this case S desired delta heading angle
  • Slide 116
  • Asteroids Problem (step 5) Learning agent receives information about the environment from sensors S 1, S 2 and S 3 and generates the desired heading angle signal Learning agent receives desired delta heading angle signal and instruct the actuators A 1 what to do in this case S desired delta heading angle
  • Slide 117
  • Asteroids Problem (step 6) S S1 S2 S3 A1 desired delta heading angle
  • Slide 118
  • Asteroids Problem (step 6) S radar angles to nearby asteroids distances to nearby asteroids desired delta heading angle angle to goal
  • Slide 119
  • Asteroids Problem (step 6)
  • Slide 120
  • Slide 121
  • Asteroids Problem [0,4] distance [0,0] angle signaldistance (m) 0[0,19) 1[19,39) 2[39,59) 3[59,79) 4[79,) signalAngle () 0[0,22.5)
  • Slide 122
  • Asteroids Problem (step 6) S Radar angles to nearby asteroids [0,7] angle to goal distances to nearby asteroids [0,390624] desired delta heading angle
  • Slide 123
  • Asteroids Problem (step 6) spacecraft
  • Slide 124
  • Asteroids Problem (step 6) S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] desired delta heading angle
  • Slide 125
  • Asteroids Problem (step 6) S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] [0,24999999][0,20] desired delta heading angle SignalAngle 0-10 1-9 100 199 2010
  • Slide 126
  • Asteroids Problem (step 6)
  • Slide 127
  • Slide 128
  • Slide 129
  • S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] [0,24999999][0,20] propulsion system propulsion signal [0,100] [0,20] desired delta heading angle
  • Slide 130
  • Asteroids Problem (step 7) S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] [0,24999999][0,20] propulsion system propulsion signal [0,100] [0,20] desired delta heading angle
  • Slide 131
  • Asteroids Problem (step 8) S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] [0,24999999][0,20] propulsion system propulsion signal [0,100] [0,20] desired delta heading angle
  • Slide 132
  • Asteroids Problem (step 8) spacecraft environment [0,24999999][0,20] desired delta heading angle propulsion signal [0,20] [0,100] FIRST AGENT:SECOND AGENT:
  • Slide 133
  • Asteroids Problem (step 8) FIRST AGENT: REWARD FUNCTION: Reward = distance to an asteroid < 10m -10000 else => (180 |angleToGoal| ) if Spacecraft environment [0,24999999][0,20] Desired delta heading angle
  • Slide 134
  • Asteroids Problem (step 8) SECOND AGENT: REWARD FUNCTION: Reward = - |achievedAngle desiredAngle| desired delta heading angle propulsion signal [0,20] [0,100]
  • Slide 135
  • Asteroids Problem (training)
  • Slide 136
  • spacecraft
  • Slide 137
  • Asteroids Problem (issues) S Radar angles to nearby asteroids [0,7] angle to goal [0,7] distances to nearby asteroids [0,390624] [0,24999999] desired delta heading angle [0,20] propulsion system propulsion signal [0,100] [0,20]
  • Slide 138
  • Asteroids Problem (issues) 1)Master L.A. took much longer to converge.
  • Slide 139
  • Asteroids Problem (issues) 2)Trained system not as good as experienced human driver.
  • Slide 140
  • Asteroids Problem (issues) 3)Slave L.A. should probably react differently to a given command depending on current angular velocity.
  • Slide 141
  • Asteroids Problem ( possible solutions ) 1)Should simplify Master Agent. L. A.
  • Slide 142
  • Asteroids Problem ( possible solutions ) 2)Use more precise angles in order to characterize a nearby asteroid.
  • Slide 143
  • Asteroids Problem ( possible solutions ) 3)Allow the slave agent to know the current angular velocity before selecting the appropriate steering signal.
  • Slide 144
  • Asteroids Problem (step 1) Radars are not very precise with respect to predicting distance to target. Signal Strength
  • Slide 145
  • Asteroids Problem (step 1)
  • Slide 146
  • Asteroids Problem (step 2) New factors: 1) angular velocity around azimuth of the spacecraft
  • Slide 147
  • Asteroids Problem (step 3) Angular velocity around azimuth Sensor
  • Slide 148
  • Asteroids Problem (step 3) Angular velocity around azimuth Sensor
  • Slide 149
  • Asteroids Problem (step 3) Learning agent receives information about the environment from sensors S 1, S 2 and S 3 and generates the desired heading angle signal Learning agent receives desired delta heading angle signal and current angular velocity from s4 and then instruct the actuators A 1 what to do in this case S Desired delta heading angle
  • Slide 150
  • Asteroids Problem (step 4) Learning agent receives position of closest asteroids and current heading from sensor S 3 and generates the desired heading angle signal Learning agent receives desired delta heading angle signal and current angular velocity and then instruct the actuators A 1 what to do in this case desired delta heading angle S position of closest asteroids S Learning agent receives information about environment from sensor S 1 and generates the distance to closest asteroids signal
  • Slide 151
  • Asteroids Problem (step 4) Laser Verification
  • Slide 152
  • Asteroids Problem (step 5) 1)Slave Sensor Learning Agent: signal from radar distance to closest asteroids. 2)Master Learning Agent: closest asteroids position and angle to goal desired delta heading angle. 3)Slave Actuator Learning Agent: desired delta heading angle steering signal
  • Slide 153
  • Asteroids Problem (step 6) radar desired delta heading angle S position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope solar wind detector radar
  • Slide 154
  • Asteroids Problem (step 6)
  • Slide 155
  • Signal Strength
  • Slide 156
  • Asteroids Problem (step 6) Signal Strength
  • Slide 157
  • Asteroids Problem (step 6) Signal Strength
  • Slide 158
  • Asteroids Problem (step 6) Signal Strength
  • Slide 159
  • Asteroids Problem (step 6) Signal Strength
  • Slide 160
  • Asteroids Problem (step 6) Signal Strength
  • Slide 161
  • Asteroids Problem (step 6) [0,4] distance [0,4] angle signaldistance (m) 0[0,19) 1[19,39) 2[39,59) 3[59,79) 4[79,) signalangle () 0[0,2) 1[2,4) 2[4,6) 3[6,8) 4[8,10)
  • Slide 162
  • Asteroids Problem (step 6) [79,) 4 [19,39) 1 [39,59) 2 Signal Strength
  • Slide 163
  • Asteroids Problem (step 6) Signal Strength [8,10) 4 [4,6) 2 [2,4) 1
  • Slide 164
  • Asteroids Problem (step 6) radar desired delta heading angle S Position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope Solar wind detector radar [0,15624] [0,124] [0,124999] distance to closest asteroids angle to closest asteroids
  • Slide 165
  • Asteroids Problem (step 6) Signal Strength
  • Slide 166
  • Asteroids Problem (step 6) signalamplitude 0[0,0.01) 1[0.01,0.02) 497[4.97,498) 498[4.98,4.99) 499[4.99,)
  • Slide 167
  • Asteroids Problem (step 6) radar desired delta heading angle S Position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope Solar wind detector radar [0,15624] [0,124] [0,499] [0,124999] distance to closest asteroids angle to closest asteroids
  • Slide 168
  • Asteroids Problem (step 6) [0,9] angular speed signalangular speed (/s) 0(-,-7] 1(-7,-5] 2(-5,-3] 3(-3,-1] 4(-1,1) 5[1,3) 6[3,5) 7[5,7) 8[7, )
  • Slide 169
  • Asteroids Problem (step 6) radar desired delta heading angle S position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope solar wind detector radar [0,15624] [0,124] [0,124999] [0,8] [0,188]
  • Slide 170
  • Asteroids Problem (step 7) radar desired delta heading angle S position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope solar wind detector radar [0,15624] [0,124] [0,124999] [0,8] [0,188]
  • Slide 171
  • Asteroids Problem (step 8) radar desired delta heading angle S position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope solar wind detector radar [0,15624] [0,124] [0,124999] [0,8] [0,188]
  • Slide 172
  • Asteroids Problem (step 8) FIRST AGENT: [0,124] [0,499] radar signal distances to closest asteroids
  • Slide 173
  • Asteroids Problem (step 8) position of closest asteroids [0,124999][0,20] desired delta heading angle SECOND AGENT:
  • Slide 174
  • Asteroids Problem (step 8) desired delta heading angle and current angular velocity propulsion signal [0,188] [0,100] THIRD AGENT:
  • Slide 175
  • Asteroids Problem (step 8) FIRST AGENT: [0,124] radar signal distances of closest asteroids REWARD FUNCTION: for each distance: reward = -10* deltaDistance Note: deltaDistance = |distance laserDistance|
  • Slide 176
  • Asteroids Problem (step 8) position of closest asteroids [0,124999][0,20] desired delta heading angle SECOND AGENT:REWARD FUNCTION: if distance to an asteroid -10000 reward = else => (180 |angleToGoal| )
  • Slide 177
  • Asteroids Problem (step 8) desired delta heading angle propulsion signal [0,209] [0,100] THIRD AGENT: REWARD FUNCTION: reward = - |achievedAngle desiredAngle|
  • Slide 178
  • Asteroids Problem radar desired delta heading angle S position of closest asteroids angle to goal [0,7] [0,20] propulsion system propulsion signal [0,100] [0,20] S gyroscope solar wind detector radar [0,15624] [0,124] [0,124999] [0,8] [0,188]
  • Slide 179