Top level learning
description
Transcript of Top level learning
![Page 1: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/1.jpg)
Top level learningPass selection using TPOT-RL
![Page 2: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/2.jpg)
DT receiver choice function
DT is trained off-line in artificial situationDT used in a heuristic, hand-coded function to limit the potential receivers to those that are at least as close to the opponent‘s goal as the passer always passes to the potential receiver with the highest confidence of success (max(passer, receiver))
![Page 3: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/3.jpg)
requirement in „reality“best pass may be to a receiver farther away from the goal than the passerthe receiver that is most like to successfully receive the pass may not be the one that will subsequently act most favorable for the team
![Page 4: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/4.jpg)
backward pass situation
![Page 5: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/5.jpg)
Pass Selection - a team behavior
learn how to act strategically as part of a teamrequires understanding of long-term effects of local decisionsgiven the behaviors and abilities of teammates and opponentsmeasured by the team‘s long-term success in a real game-> must be trained on-line against an opponent
![Page 6: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/6.jpg)
ML algorithm characteristics for pass selection
On-lineCapable of dealing with a large state space despite limited trainingCapable of learning based on long-term, delayed rewardCapable of dealing with shifting conceptsWorks in a team-partitioned scenarioCapable of dealing with opaque transitions
![Page 7: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/7.jpg)
TPOT-RL succeeds by:Partitioning the value function among multiple agentsTraining agents simultaneously with a gradually decreasing exploration rateUsing action-dependent features to aggressively generalize the state spaceGathering long-term, discounted reward directly from the environment
![Page 8: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/8.jpg)
TPOT-RL: policy mapping(S -> A)
State generalizationValue function learningAction selection
![Page 9: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/9.jpg)
State generalization IMapping state space to feature vectorf : S -> VUsing action-dependent feature functione : S x A -> UPartitioning state space among agentsP : S -> M
![Page 10: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/10.jpg)
State generalization II|M| >= m ... Number of agents in teamA = {a0, ..., an-1}
f(s) = <e(s, a0), ..., e(s, an-1), P(s)>V = U|A| x M
![Page 11: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/11.jpg)
Value function learning IValue function Q(f(s), ai)Q : V x A -> reellDepends on e(s, ai)independent of e(s, aj) j iQ-table has |U|1 * |M| * |A| entries
![Page 12: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/12.jpg)
Value function learning IIf(s) = vQ(v, a) = Q(v, a) + * (r – Q(v, a))r is derived from observable environmental characteristicsReward function R : Stlim -> reellRange of R is [-Qmax, Qmax]Keep track of action taken ai and feature vector v at that time
![Page 13: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/13.jpg)
Action selectionExploration vs. ExploitationReduce number of free variables with action filterW U: if e(s, a) W -> a shouldn‘t be a potential action in sB(s) = {a A | e(s, a) W}B(s) = {} (W U) ?
![Page 14: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/14.jpg)
TPOT-RL applied to simulated robotic soccer
8 possible actions in A (see action space)Extend definition of (Section 6)
Input for L3 is DT from L2 to define e
![Page 15: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/15.jpg)
action space
![Page 16: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/16.jpg)
State generalization using a learned feature I
M = team‘s set of positions (|M| = 11)P(s) = player‘s current positionDefine e using DT (C = 0.734)
W = {Success}
![Page 17: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/17.jpg)
State generalization using a learned feature II
|U| = 2V = U8 x {PlayerPositions}|V| = |U||A| * |M| = 28 * 11Total number of Q-values:|U| * |M| * |A| = 2 * 11 * 8With action filtering (W):each agent learns |W| * |A| = 8 Q-values10 training examples per 10-minute game
![Page 18: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/18.jpg)
Value function learning via intermediate reinforcement I
Rg: if goal is scoredr = Qmax / t ... t tlim
Ri: notice t, xt 3 conditions to fix reward Ball is goes out of bounds at t+t0 (t0 < tlim) Ball returns to agent at t+tr (tr < tlim) Ball still in bounds at t+tlim
![Page 19: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/19.jpg)
Ri: Case 1Reward r is based on value r0
tlim = 30 seconds (300 sim. cycles)Qmax = 100 = 10
![Page 20: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/20.jpg)
reward function
![Page 21: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/21.jpg)
Ri: Cases 2 & 3r based on average x-position of ball
xog= x-coordinate of opponent goalxlg = x-coordinate of learner‘s goal
![Page 22: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/22.jpg)
Value function learning via intermediate reinforcement II
After taking ai and receiving r, update QQ(e(s, ai), ai) =
(1 - ) * Q(e(s, ai), ai) + r = 0.02
![Page 23: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/23.jpg)
Action selection for multiagent training
Multiple agents are concurrently learning-> domain is non-stationaryTo deal with this: Each agent stays in the same state
partition throughout training Exploration rate is very high at first,
then gradually decreasing
![Page 24: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/24.jpg)
State partitioningDistribute training into |M| partitionseach with a lookup-table of size |A| * |U|After training, each agent can be given the trained policy for all partitions
![Page 25: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/25.jpg)
Exploration rateEarly exploitation runs the risk of ignoring the best possible actionsWhen in state s choose Action with highest Q-value with prob. p
(ai such that j, Q(f(s),ai) Q(f(s),aj)) Random Action with probability (1 – p)p increases gradually from 0 to 0.99
![Page 26: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/26.jpg)
Results IAgents start out acting randomly with empty Q-tables v V, a A, Q(v,a) = 0
Probability of acting randomly decreases linearly over periods of 40 games to 0.5 in game 40 to 0.1 in game 80 to 0.01 in game 120
Learning agents use Ri
![Page 27: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/27.jpg)
Results II
![Page 28: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/28.jpg)
Result II Statistics160 10-minute games|U| = 1Each agent gets 1490 action-reinforcement pairs -> reinforcement 9.3 times per game tried each action 186.3 times -> each action only once per game
![Page 29: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/29.jpg)
Results III
![Page 30: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/30.jpg)
Results IV
![Page 31: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/31.jpg)
Results IV StatisticsAction predicted to succeed vs. Selected 3 of 8 „attack“ actions (37.5%):
6437 / 9967 = 64.6%Action filtering 39.6% of action options were filtered
out 10400 action opportunities B(s) {}
![Page 32: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/32.jpg)
Results V
![Page 33: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/33.jpg)
Domain characteristics for TPOT-RL:
There are multiple agents organized in a team.There are opaque state transitions.There are too many states and/or not enough training examples for traditional RL.The target concept is non-stationary.There is long-range reward available.There are action-dependent features available.
![Page 34: Top level learning](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815acb550346895dc89bee/html5/thumbnails/34.jpg)
Examples for such domains:
Simulated robotic soccerNetwork packet-routingInformation networksDistributed logistics