MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value...

46
MAKING COMPLEX DEClSlONS

Transcript of MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value...

Page 1: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

MAKING COMPLEX DEClSlONS

Page 2: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

outline• MDPs(Markov Decision Processes)

Sequential decision problemsValue iteration&Policy iteration

• POMDPsPartially observable MDPsDecision-theoretic Agents

• Game Theory Decisions with Multiple Agents: Game TheoryMechanism Design

2ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 3: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

An example

3ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 4: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

Game rules:

• 4 x 3 environment shown in Figure 17.l(a)

• Beginning in the start state

• choose an action at each time step

• End in the goal states, marked +1 or -1.

• Actions eaquls {Up, Down, Left, Right}

• the environment is fully observable

• Teminal states have reward +1 and -1,respectively

• All other states have a reward of -0.044ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 5: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• The particular model of stochastic motion that we adopt is illustrated in Figure 17.l(b).

• Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction.

• Furthermore, if the agent bumps into a wall, it stays in the same square.

5ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 6: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• Transition model ---A specification of the outcome probabilities for each

action in each possible state• environment history --- a sequence of states• utility of an environment history ---the sum of the rewards(positive or negative) received

6ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 7: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems• Definition of MDP

Markov Decision Process: The specification of a sequential decision problem for a fully observable environment with a Markovian transition model and additive rewards

• An MDP is defined byInitial State: S0

Transition Model: T ( s , a, s')Reward Function:' R(s)

7ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 8: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• Policy(denoted by π)a solution which specify what the agent should do for any state that the agent might reach

• optimal policy(denote by π*)a policy that yields the highest expected utility

8ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 9: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problemsAn optimal policy for the world of Figure 17.1

9ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 10: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• The balance of risk and reward changes depends on the value of R(s) for the nonterminal states

• Figure 17.2(b) shows optimal policies for four different ranges of R(s)

10ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 11: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• a finite horizon(i)A finite horizon means that there is a fixed time N after which nothing matters-the game is over(ii)the optimal policy for finite horizon is nonstationary(the optimal action in a given state could change over time)(iii)complex

11ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 12: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• an infinite horizon (i)A finite horizon means that there is not a fixed time N after which nothing matters-the game is over(ii)the optimal policy for finite horizon is stationary(iii)simpler

12ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 13: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems• calculate the utility of state sequences

(i)Additive rewards: The utility of a state sequence is

(ii)Discounted rewards: The utility of a state sequences is

where the discount factory γ is a number between 0 and 1

13ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 14: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• infinite horizons

• A policy that is guaranteed to reach a terminal state is called a proper policy(γ=1)

• Another possibility is to compare infinite sequences in terms of the average reward obtained per time step

14ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 15: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Sequential decision problems

• How to choose between policiesthe value of a policy is the expected sum of discounted rewards obtained, where the expectation is taken over all possible state sequences that could occur, given that the policy is executed.

• An optimal policy n* satisfies

15ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 16: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state.

16ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 17: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• Utilities of states

let st be the state the agent is in after executing π for t steps (note that st is a random variable)

17ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 18: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• Figure 17.3 shows the utilities for the 4 x 3 world

18ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 19: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• choose: the action that maximizes the expected utility of the subsequent state

• the utility of a state is given by Bellman equation

19ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 20: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• Let us look at one of the Bellman equations for the 4 x 3 world. The equation for the state (1,l) is

20ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 21: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• The value iteration algorithma Bellman update, looks like this

• VALUE-ITERATION algorithm as follows

21ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 22: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

22ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 23: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• Starting with initial values of zero, the utilities evolve as shown in Figure 17.5(a)

23ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 24: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• two important properties of contractions:(i)A contraction has only one fixed point; if there were two fixed points they would not get closer together when the function was applied, so it would not be a contraction.

(ii)When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction always reaches the fixed point in the limit.

24ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 25: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• Let Ui denote the vector of utilities for allthe states at the ith iteration. Then the Bellman update equation can be written as

25ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 26: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• use the max norm, which measures the length of a vector by the length of its biggest component:

• Let Ui and Ui' be any two utility vectors. Then we have

26ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 27: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• In particular, we can replace U,' in Equation (17.7) with the true utilities U, for which B U = U. Then we obtain the inequality

• Figure 17.5(b) shows how N varies with y, for different values of the ratio

27ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 28: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

28ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 29: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• From the contraction property (Equation (17.7)), it can be shown that if the update is small (i.e., no state's utility changes by much), then the error, compared with the true utility function, allso is small. More precisely,

29ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 30: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• policy lossUπi (s) is the utility obtained if πi is executed starting in s, policy loss is the most the agent can lose by executing πi instead of the optimal policy π*

30ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 31: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Value iteration

• The policy loss of πi is connected to the error in Ui by the following inequality:

31ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 32: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Policy iteration

• The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 :

Policy evaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed.

Policy improvement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui (as in Equation (17.4)).

32ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 33: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Policy iteration

33ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 34: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Policy iteration

• For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3) by standard linear algebra methods. For large state spaces, O(n3) time might be prohibitive

• modified policy iteration

The simplified Belllman update for this process

34ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 35: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Policy iteration

• In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or simplified value iteration) to that subset. This very general algorithm is called asynchronous policy iteration.

35ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 36: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Partially observable MDPs

• When the environment is only partially observable, MDPs turns into Partially observable MDPs(or POMDPs pronounced "pom-dee-pees")

36ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 37: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Partially observable MDPs

• an example for POMDPs

37ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 38: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Partially observable MDPs

• POMDPs ’s elements:elements of MDP ( transition

model 、 reward function)Observation model

38ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 39: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Partially observable MDPs

• How to calculate belief state ?

• Decision cycle of a POMDP agent:1.Given the current belief state b, execute the

actioin a = ∏* (b) .2.Receive observation o.3.Set the current belief state to FORWARD(b, a, o)

and repeat.39ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 40: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Decision –theoretic Agents

• Basic elements of approach to agent designDynamic Bayesian networkDynamic decision network(DDN)A filtering algorithmMake decisions• A dynamic decision network as follows:

40ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 41: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Decision –theoretic Agents

41ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 42: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Decision –theoretic Agents

• Part of the look-ahead solution of the DDN

42ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 43: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Game Theory

• Where it can be used?Agent designMechanism design• Components of a game in game theoryPlayersActionsA payoff matrix

43ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 44: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Game Theory

• Strategy of playersPure strategy(deterministic policy)Mixed strategy(randomized policy)• Strategy profile

an assignment of a strategy to each player• Solution

a strategy profile in which each player adopts a rational strategy.

44ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 45: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Game Theory

Game theory describes rational behavior for agents in situations where multiple agents interact simultaneously. Solutions of games are Nash equilibria - strategy profiles in which no agent has an incentive to deviate from the specified strategy.

45ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Page 46: MAKING COMPLEX DEClSlONS. outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable.

Mechanism Design

Mechanism design can be used to set the rules by which agents will interact, in order to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent to consider the choices made by other agents.

46ARTIFICIAL INTELLIGENCE-TJU 2008 FALL