Introduction of "TrailBlazer" algorithm

BLAZING THE TRAILS BEFORE BEATING THE PATH:

SAMPLE-EFFICIENT MONTE-CARLO PLANNING

KATSUKI OHTO

@NIPS2016-YOMI

2017/1/19

INTRODUCED PAPER• Blazing the trails before beating the path:

Sample - efficient Monte-Carlo planning(JB. Grill, M. Valko and R. Munos)

• NIPS 2016 accepted paper (poster session)• Abstract starts with “You are a robot…”• http://papers.nips.cc/paper/6253-blazing-the-trails-before-

beating-the-path-sample-efficient-monte-carlo-planning

TRAILBLAZER

• Nested-fashion Monte-Carlo Planning Algorithm• Problem settings:

MDP (contains MAX nodes and AVG nodes)Actions per each state : Finite State transition candidates : Finite or Infinite• Strong theoretical guarantee

MAX

AVG

AIM• Input : an MDP (Markov Decision Process)

(discount factor , maximum number of valid actions ), (> 0), (0 < < 1)

• Output : estimated value of current state

• Aim : Get good estimation of real value of current statesuch as

（ means probability of ）with the minimum number of calls to the generative model (state transition function)

1 PLAYER TREE MODELIN STOCHASTIC ENVIRONMENT• Each MAX node means an

opportunity to decide action

• Each AVG node means stochastic state transition

MAX

AVG

ALGORITHM OVERVIEW

• Global Initializationset , as global valueset as an argument of root node

• Recursive algorithm

)

ALGORITHM OVERVIEW 2• In both MAX nodes and AVG nodes,

arguments are (desired branching factor)and (admissible estimation error)

• If is large, we can search many children, but we need much time (dilemma)

• If is small, we can search deeply, but we need much time (dilemma)

ALGORITHMFOR AVG NODES• Input : and • Output : estimated value• If admissible error is large, ignore

successive reward• Fill transition samples

(and store immediate reward)• search all of sampled next states• return averaged immediate reward +

estimated successive reward

ALGORITHMFOR MAX NODES• Input : and • Output : estimated value• Fill candidate action pool by all valid actions• is a value like standard error of estimation• Search candidate actions repeatedly until

“Only 1 action left” or “Error might be small”• If “Error might be small”

then return estimated value of best actionelse search best action 1 more time carefully

SAMPLE COMPLEXITY OF TRAILBLAER

• Sample Complexity is a measure of performance of algorithm

• If N (the number of next states) is finite, on condition that (in detail in the paper)else on condition that is a measure of difficulty to identify near-optimal nodes

Introduction of "TrailBlazer" algorithm

Technology

Transcript of Introduction of "TrailBlazer" algorithm