Post on 05-Jun-2020
Assignment 1} Highest original score: 92.25, so everyone got 7.75 bonus
point} How well everyone is doing?
} 7: 20%} 6: 10%} 5: 20%} 4: 23%
} There’s a couple of submissions that originally has no report mark (fixed) } Bulk download from turnitin does not download these the
submissions (it seems if the author starts not with a letter, it’s excluded)
} A couple of submissions have issue with demo mark (fixed)} 1 group mark is still pending due to group problem
Assignment 2} Support code out yesterday} Amendment for one of the inputs, it’ll make computing
transition easier} Group registration due last Tue! If you still want to work in
a group, please register in the group registration website ASAP. We’ll leave it open until this Monday morning.
} Help sessions during swotvac} Tue 11am-1pm & Thu 11am-1pm, usual tutorial room} Depending on how students go, we may have 1-2 additional help
sessions during week-1 of exam (if we do, we’ll announce it in piazza)
} Questions?
COMP3702/7702 Artificial IntelligenceLecture 13: Introduction to Machine Learning
and Reinforcement Learning
Hanna Kurniawati
Today} What is machine learning?} Where is it used?} Types of machine learning algorithms.
} Supervised learning} Unsupervised learning} Reinforcement learning
Reinforcement Learning} What is Reinforcement Learning?} Methods for solving
More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being
learned?} Passive vs Active: How the data are being
generated?
Passive ActiveModel-based ✔
Model-free
Model-based, Active} We need a way to decide which data to use
} Classic: Interact with the world directly.} Decide the action we use to interact with the world, so as to
balance gaining information and reaching the goal} Nowadays: Can perform the trials in high fidelity
simulator} Decide the action we use to interact with the simulated world (of
course the hope is the simulation is close to reality), so as to balance gaining information and reaching the goal
} Need to consider how well transfers from simulator to the real world is.
Bayesian Reinforcement Learning} Bayesian view:
} The parameters (T & R) we want to estimate are represented as random variables
} Start with a prior over models} Compute posterior based on data
} Quite useful when the agent actively gather data} Can decide how to balance exploration &
exploitation or how to improve the model & solve the problem optimally} Often represented as Partially Observable Markov
Decision Processes (POMDPs)
Bayesian Reinforcement Learning} The problem of finding solving MDP with unknown T &
R can be represented as a POMDP with partially observed MDP model
} POMDP model:} S: MDP states X T X R} A: MDP action} T(s, a, s’): The transition assuming the MDP model is as
described by POMDP state s} Ω: The resulting next state and reward of the MDP} Z(s, a, o): Perceived next state & reward assuming the MDP
model is as described by POMDP state s} R(s, a): The reward assuming the MDP model is as described
by POMDP state s
Bayesian Reinforcement Learning} Optimal policy of the POMDP is optimal
exploration vs exploitation} It will try to balance building the most accurate model &
working directly towards achieving the goal.} Will make the MDP agent receives the maximum
reward given the initially unknown T & R.} Building the best model is just an intermediate
step, not the end goal!
More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being
learned?} Passive vs Active: How the data are being
generated?
Passive ActiveModel-based ✔ ✔
Model-free
Model-free} Two flavors:
} Learn the value functions / Q-value functions and then compute the policy.
} Learn the policy directly
} First, we’ll see learning (estimating) the value/Q-value functions} Monte Carlo} Temporal Difference
Monte Carlo} Goal: Given a policy, learn the value of the policy
when T & R are unknown} Assumption: Episodic MDP
} Each episode (i.e., each run) is guaranteed to terminate within a finite amount of time.
} Loop over:} Generate an episode} Compute the total discounted reward for the episode} Update the value
Monte Carlo update} Suppose we have the following episode
} For each si, R(si) = ri + γri+1 + … + γn-1rn
} Value update:
s1 s2 s3 s4 s5
r1 r2 r3 r4 r5
sn
rn
…π(s1) π(s2) π(s3) π(s4) π(sn-1)
Monte Carlo} First visit:
} Only update the value of a state if it is visited the first time in the sampled episode
} Every visit:} Update the value of a state whenever it is visited
(regardless at which time step)} Converge to the true value by law of large
number
Law of Large Numbers} Weak law of large numbers.
} Strong law of large numbers.
} What’s the difference ?
} One of the most famous Reinforcement Learning approach
} Idea: Iteratively reduce the difference between the value or Q-value estimates
𝑄 𝑠#, 𝑎# = 𝑄 𝑠#, 𝑎# + 𝛼 𝑟# + 𝛾𝑄 𝑠#+,, 𝑎#+, − 𝑄 𝑠#, 𝑎#V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑉 𝑠#+, − 𝑉 𝑠#
where 𝛼 is a constant in [0, 1], representing the learning rate. In some implementations, it decreases as #data increases, e.g., set to 1/(#visit + 1).
Temporal Difference (TD) Learning
Temporal Difference (TD) Learning} Given a policy π , suppose the reinforcement
learning agent traverse the following episode
} Value update:
s1 s2 s3 s4 s5
r1 r2 r3 r4 r5
sn
rn
…π(s1) π(s2) π(s3) π(s4) π(sn-1)
Monte Carlo vs Temporal Difference Learning} Monte Carlo approach updates value after the
episode is finished} Temporal difference updates value after each step
Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need
to get to your home} Monte Carlo:
Record time in day-i
Next day, update based on recorded time in day-i
α = 1
Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need
to get to your home} Temporal Difference:
Record time in day-iUpdate as the episode progresses
α = 1
TD Learning - Variants } Q-learning} SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)
Q-Learning: Off-Policy TD Control} Off-policy: Update the Q-value based on the
(estimated) best next actions, even though it’s not the action performed} The policy being followed is not the same as the policy
being evaluated.
SARSA: On-policy TD Control } Consider the actual action that the agent will take
at the next state} Data is (s, a, r, s’, a’)
Example: Q-learning vs SARSA} Mouse moving next to a cliff
} Blue: Mouse, Green: Cheese, Red: Cliff
Q-learning SARSA
TD Learning - Variants ✔Q-learning✔SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)
A more General TD Learning} We can actually do more steps
n-steps TD method for Value Estimation} Given a policy π , suppose the reinforcement
learning agent traverse the following episode
} For each si, Rn(si) = ri + γri+1 + … + γnri+n} Value update:
s1 s2 s3 s4 s5
r1 r2 r3 r4 r5
sn
rn
…π(s1) π(s2) π(s3) π(s4) π(sn-1)
V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅3 𝑠# + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#
V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑟#+, + 𝛾4𝑟4 + ⋯+ 𝛾#+3𝑟#+3 + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#
TD(𝜆) – Update Rule} Weighted sum of the total reward over sequences of
different length
(1- 𝜆)
(1- 𝜆)𝜆
(1- 𝜆)𝜆4
(1- 𝜆)𝜆37,
TD(𝜆) – Update Rule} Let
𝑅8 𝑠# = (1 − 𝜆);𝜆<#3=
<>?
𝑅<(𝑠#)
The update rule of TD(𝜆) is
V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅8(𝑠#) − 𝑉 𝑠#
TD(𝜆) – Update Rule} Let
𝑅8 𝑠# = (1 − 𝜆);𝜆<#3=
<>?
𝑅<(𝑠#)
The update rule of TD(𝜆) is
V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅8(𝑠#) − 𝑉 𝑠#
TD(𝜆) – Implementation } Use eligibility trace as weight of a visited state
} The eligibility trace of a state at time t, denoted as et(s), represents the eligibility of the state for undergoing learning changes.
} Defined as:
} When the state is visited recently, the eligibility is high, as time progresses, its eligibility decreases.
𝑒< 𝑠 = A𝛾𝜆𝑒<7, 𝑠 𝑖𝑓𝑠 ≠ 𝑠<𝛾𝜆𝑒<7, 𝑠 + 1𝑖𝑓𝑠 = 𝑠<
TD(𝜆) – Algorithm
Reinforcement Learning✔What is Reinforcement Learning✔Methods for Solving
Passive ActiveModel-based Some supervised
learning can be usedPOMDP for Bayes RL
Model-free Monte CarloTemporal Difference (TD) & its variants (Q-learning, SARSA)TD(𝜆)
✔ ✔
✔✔
✔
COMP3702/7702: Artificial IntelligenceHow to develop agents that can:
1. Make good decisions when information about the problem is accurate and abundant.} Agent design problem, search in discrete space, search in continuous
space with application to motion planning, logical representation, validity (model checking, theorem proving), satisfiability (DPLL, GSAT).
2. Make good decisions when information about the problem is inaccurate and limited.} Worst case: And-Or tree, min-max tree, minimax algorithm, alpha-beta
pruning} Stochastic: Utility theory, MDP, value iteration, policy iteration, online
solving, POMDP.3. Learn and improve their decision-making capability over
time.} Reinforcement learning: Multi-arm bandit, Bayes-RL (POMDP), TD
learning, Monte Carlo, Q-learning, SARSA
We didn’t talk about ethic much, but …} With knowledge comes responsibility } AI (including machine learning) system can be very
biased!!!} AI system (at least at the moment) cannot and should not
be used as a justification for discriminatory behavior} At the very least, please make sure your users are
aware of the possible bias} AI system, as any other system, is a tool. It can be
used for good or for bad. Hopefully you’ll use it for good J
What’s next?} Machine Learning (COMP4702 / COMP7702)} Research?
} Decision making under uncertainty, including its relation to machine learning
} But, this is my last semester at UQ} I’ll be moving to ANU in January, and so are my research
Thank youHope you learn a thing or two