Bernoulli 0704

download Bernoulli 0704

of 31

Transcript of Bernoulli 0704

  • 7/29/2019 Bernoulli 0704

    1/31

    Q-Learning and DynamicTreatment Regimes

    S.A. Murphy

    Univ. of Michigan

    IMS/Bernoulli: July, 2004

  • 7/29/2019 Bernoulli 0704

    2/31

    Outline

    Dynamic Treatment Regimes

    Optimal Q-functions and Q-learning

    The Problem & Goal

    Finite Sample Bounds

    Outline of Proof

    Shortcomings and Open Problems

  • 7/29/2019 Bernoulli 0704

    3/31

    ---- Multi-stage decision problems: repeated decisions

    are made over time on each patient.

    ---- Used in the management of Addictions, MentalIllnesses, HIV infection and Cancer

    Dynamic Treatment Regimes

  • 7/29/2019 Bernoulli 0704

    4/31

    k Decisions

    Observations made prior to tth decision

    Action at tth decision

    Primary Outcome:

  • 7/29/2019 Bernoulli 0704

    5/31

    A dynamic treatment regime is a vector of decision

    rules, one per decision

    If the regime is implemented then

  • 7/29/2019 Bernoulli 0704

    6/31

    Goal: Estimate the decision rules that maximize mean

    Data: Data set of n finite horizon trajectories, each withrandomized actions.

    are randomization probabilities.

  • 7/29/2019 Bernoulli 0704

    7/31

    Optimal Q-functions and Q-learning:

    Definition:

    denotes expectation when the actions are chosen

    according to the regime

  • 7/29/2019 Bernoulli 0704

    8/31

    Q-functions:

    The Q-functions foroptimalregime, are givenrecursively by

    For t=k,k-1,.

  • 7/29/2019 Bernoulli 0704

    9/31

    Q-functions:

    The optimal regime is given by

  • 7/29/2019 Bernoulli 0704

    10/31

    Q-learning:

    Given a model for the Q-functions, minimize

    over

    Set

  • 7/29/2019 Bernoulli 0704

    11/31

    Q-learning:

    For each t=k-1,,1 minimize

    over

    And set

    and so on.

  • 7/29/2019 Bernoulli 0704

    12/31

    Q-Learning:

    The estimated regime is given by

  • 7/29/2019 Bernoulli 0704

    13/31

    The Problem & Goal:

    Most learning (e.g. estimation) methods utilize a model

    for all or parts of the multivariate distribution of

    implicitly constrains the class of possible decision rules in

    the dynamic treatment regime: call this constrained class,

    is a vector with many components (high dimensional) thus

    the model is likely incorrect; view and as approximation

    classes.

  • 7/29/2019 Bernoulli 0704

    14/31

    Goal:Given a learning method and approximation classes

    assess the ability of learning method to produce the best decision

    rules in the class.

    Ideally construct an upper bound for

    where is the estimator of the regime

    denotes expectation when the actions are chosen according

    to the rule

  • 7/29/2019 Bernoulli 0704

    15/31

    Goal:Given a learning method, model and approximationclass construct a finite sample upper bound for

    This upper bound should be composed of quantities that are

    minimized in the learning method.

    Learning Method is Q-learning.

  • 7/29/2019 Bernoulli 0704

    16/31

    Finite Sample Bounds:

    Primary Assumptions:

    (1)

    for L>1.

    (2) Number of possible actions is finite.

  • 7/29/2019 Bernoulli 0704

    17/31

    Definition:

    where E, without a subscript, denotes expectation when

    the actions are randomized.

  • 7/29/2019 Bernoulli 0704

    18/31

    Results:

    Approximation Error:

    The minimum is over with

  • 7/29/2019 Bernoulli 0704

    19/31

    Define

    The estimation error involves the complexity of this

    space.

  • 7/29/2019 Bernoulli 0704

    20/31

    Estimation Error:

    For with probability at least 1-

    for n satisfying

  • 7/29/2019 Bernoulli 0704

    21/31

    If is finite then n needs only to satisfy

    that is,

  • 7/29/2019 Bernoulli 0704

    22/31

    Outline of Proof:

    The Q-functions for regime are given by

  • 7/29/2019 Bernoulli 0704

    23/31

    Proof Outline

    (1)

  • 7/29/2019 Bernoulli 0704

    24/31

    Proof Outline

    (2)

    It turns out that also

  • 7/29/2019 Bernoulli 0704

    25/31

    Proof Outline

    (3)

  • 7/29/2019 Bernoulli 0704

    26/31

    Shortcomings and Open Problems

  • 7/29/2019 Bernoulli 0704

    27/31

    Recall Estimation Error:

    For with probability at least 1-

    for n satisfying

  • 7/29/2019 Bernoulli 0704

    28/31

    Open Problems

    Is there a learning method that can learn the best

    decision rule in an approximation class given a dataset of n finite horizon trajectories?

    Sieve Estimators or Regularized Estimators?

    Dealing with high dimensional X-- feature extraction-

    --feature selection.

  • 7/29/2019 Bernoulli 0704

    29/31

    This seminar can be found at:

    http://www.stat.lsa.umich.edu/~samurphy/seminars/ims_bernoulli_0704.ppt

    The paper can be found at :

    http://www.stat.lsa.umich.edu/~samurphy/papers/

    Qlearning.pdf

    [email protected]

  • 7/29/2019 Bernoulli 0704

    30/31

    Recall Proof Outline

    (2)

    It turns out that also

  • 7/29/2019 Bernoulli 0704

    31/31

    Recall Proof Outline

    (1)