Optimal Sensor Scheduling via Classification Reduction of Policy Search (CROPS) ICAPS Workshop 2006...

27
Optimal Sensor Optimal Sensor Scheduling via Scheduling via Classification Reduction Classification Reduction of of Policy Search (CROPS) Policy Search (CROPS) ICAPS Workshop 2006 ICAPS Workshop 2006 Doron Blatt and Alfred Doron Blatt and Alfred Hero Hero University of Michigan University of Michigan

Transcript of Optimal Sensor Scheduling via Classification Reduction of Policy Search (CROPS) ICAPS Workshop 2006...

Optimal Sensor Scheduling via Optimal Sensor Scheduling via Classification Reduction ofClassification Reduction of

Policy Search (CROPS)Policy Search (CROPS)

ICAPS Workshop 2006ICAPS Workshop 2006

Doron Blatt and Alfred HeroDoron Blatt and Alfred Hero

University of MichiganUniversity of Michigan

Motivating Example: Landmine DetectionMotivating Example: Landmine Detection

A vehicle carries three A vehicle carries three sensors for land-mine sensors for land-mine detection, each with its detection, each with its own characteristics.own characteristics.

The goal is to optimally The goal is to optimally schedule the three schedule the three sensors for mine detection.sensors for mine detection.

This is a sequential choice This is a sequential choice of experiment problem of experiment problem (DeGroot 1970).(DeGroot 1970).

We do not know the model We do not know the model but can generate data but can generate data through experiments and through experiments and simulations.simulations.

Plastic Anti-personnel Mine

EMI GPR Seismic

Nail

Rock

Plastic Anti-tank Mine

New location

Seismic dataGPR dataEMI data

EMI GPR Seismic

EMI Seismic

Final detection Seismic dataEMI data

Seismic data Final detection

Final detection

Reinforcement LearningReinforcement LearningGeneral objective: To find optimal policies General objective: To find optimal policies for controlling stochastic decision for controlling stochastic decision processes:processes: without an explicit model.without an explicit model. when the exact solution is intractable.when the exact solution is intractable.

Applications:Applications: Sensor scheduling.Sensor scheduling. Treatment design.Treatment design. Elevator dispatching.Elevator dispatching. Robotics.Robotics. Electric power system control.Electric power system control. Job-shop Scheduling.Job-shop Scheduling.

The Optimal PolicyThe Optimal PolicyThe optimal policy satisfiesThe optimal policy satisfies

Can be found via dynamic programming:Can be found via dynamic programming:

where the policy where the policy qqt t corresponds to random action selection. corresponds to random action selection.

The Generative Model AssumptionThe Generative Model Assumption

Generative model assumption (Kearns et. al. 00’)Generative model assumption (Kearns et. al. 00’) Explicit model is unknown.Explicit model is unknown. Possible to generate trajectories by simulation or Possible to generate trajectories by simulation or

experimentexperiment

O0

O11

a0=1a0=0

O200

O3000 O3

001 O3010 O3

011 O3100 O3

101 O3110 O3

111

O201 O2

10 O211

O10

a1=0

a2=0 a2=0 a2=0 a2=0

a1=0 a1=1

a2=1a2=1

a1=1

a2=1 a2=1

M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.

Learning from Generative ModelsLearning from Generative ModelsIt is possible to evaluate the value of any policy It is possible to evaluate the value of any policy from trajectory trees: from trajectory trees:

Let be the sum of rewards on the path that agrees with policy Let be the sum of rewards on the path that agrees with policy on on the the iith tree. Then,th tree. Then,

O0

O11

a0=1a0=0

O200

O3000 O3

001 O3010 O3

011 O3100 O3

101 O3110 O3

111

O201 O2

10 O211

O10

a1=0

a2=0 a2=0 a2=0 a2=0

a1=0 a1=1

a2=1a2=1

a1=1

a2=1 a2=1

Three sources of error in RLThree sources of error in RLMisallocation of approximation resources to state space: without knowing the optimal policy one cannot sample from the distribution that it induces on the stochastic system’s state spaceCoupling of optimal decisions at each stage: finding the optimal decision rule at a certain stage hinges on knowing the optimal decision rule for future stagesInadequate control of generalization errors: without a model ensemble averages must be approximated from training trajectories

J. Bagnell, S. Kakade, A. Ng, and J. Schneider, “Policy search by dynamic programming,” in Advances in Neural Information Processing Systems, vol. 16. 2003.

A. Fern, S. Yoon, and R. Givan, “Approximate policy iteration with a policy language bias,” in Advances in Neural Information Processing Systems, vol. 16, 2003.

M. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” in Proceedings of the Twentieth International Conference on Machine Learning, 2003.

J. Langford and B. Zadrozny, “Reducing T-step reinforcement learning to classification,” http://hunch.net/ jl/projects/reductions/reductions.html, 2003.∼

M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.

S. A. Murphy, “A generalization error for Q-learning,” Journal of Machine Learning Research, vol. 6, pp. 1073–1097, 2005.

Learning from Generative ModelsLearning from Generative Models

Drawbacks:Drawbacks: The combinatorial optimization problem:The combinatorial optimization problem:

can only be solved for small can only be solved for small nn and small and small ..

Our remedies:Our remedies: Break the multi-stage search problem into a sequence of Break the multi-stage search problem into a sequence of

single-stage optimization problems.single-stage optimization problems. Use a convex surrogate to simplify each optimization Use a convex surrogate to simplify each optimization

problem.problem.

Will obtain generalization bounds similar to Will obtain generalization bounds similar to (Kearns…,’00) but that (Kearns…,’00) but that apply to the case in which the decision rules are estimated sequentially by reduction to classification

Fitting the Hindsight PathFitting the Hindsight Path

Zadrozny & Langford 2003: on each tree find the reward Zadrozny & Langford 2003: on each tree find the reward maximizing path.maximizing path.

Fit T+1 classifiers to these paths.Fit T+1 classifiers to these paths.Driving the classification error to zero is equivalent to finding Driving the classification error to zero is equivalent to finding the optimal policy.the optimal policy.Drawback: In stochastic problems, no classifier can predict the Drawback: In stochastic problems, no classifier can predict the hindsight action choices.hindsight action choices.

O0

O11

a0=1a0=0

O200

O3000 O3

001 O3010 O3

011 O3100 O3

101 O3110 O3

111

O201 O2

10 O211

O10

a1=0

a2=0 a2=0 a2=0 a2=0

a1=0 a1=1

a2=1a2=1

a1=1

a2=1 a2=1

Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming ApproachAssume the policy class has the form:Assume the policy class has the form:

Estimating Estimating TT via tree pruning: via tree pruning:

This is the empirical equivalent of:This is the empirical equivalent of:

Call the resulting policy Call the resulting policy

O0

a0=0

O3010 O3

011

O201

O10

a2=0 a2=1

a1=1

Choose random actions

Solve single-stage RL problem

Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming ApproachEstimating Estimating T-1T-1 given via tree pruning: given via tree pruning:

This is the empirical equivalent of:This is the empirical equivalent of:

O0

a0=0

O200

O3000 O3

011

O201

O10

a1=0

a2=0 a2=1

a1=1

Choose random actions

Solve single-stage RL problem

Propagate rewards according to

Our Approximate Dynamic Programming ApproachOur Approximate Dynamic Programming Approach

Estimating Estimating T-2T-2==00 given and via tree pruning: given and via tree pruning:

This is the empirical equivalent of:This is the empirical equivalent of:

Solve single-stage RL problem

O0

O11

a0=1a0=0

O3011 O3

101

O201 O2

10

O10

a1=0

a2=1

a1=1

a2=1

Propagate rewards according to

Reduction to Weighted ClassificationReduction to Weighted ClassificationOur approximate dynamic programming algorithm converts the multi-stage Our approximate dynamic programming algorithm converts the multi-stage optimization problem into a sequence of single-stage optimization problems. optimization problem into a sequence of single-stage optimization problems. Unfortunately each sequence is still a combinatorial optimization problem. Unfortunately each sequence is still a combinatorial optimization problem. Our solution: reduce this to learning classifiers with convex surrogate.Our solution: reduce this to learning classifiers with convex surrogate.This classification reduction is different from previous workThis classification reduction is different from previous work

Consider a single-stage RL problem:Consider a single-stage RL problem:

Consider a class of real valued functions Consider a class of real valued functions Each induces a policy:Each induces a policy:We would like to maximizeWe would like to maximize

O0

O11

a0=1a0=-1

O1-1

Reduction to Weighted ClassificationReduction to Weighted ClassificationNote that Note that

Therefore, solving a single stage RL problem is equivalent to:Therefore, solving a single stage RL problem is equivalent to:

where where

Reduction to Weighted ClassificationReduction to Weighted Classification

It is often much easier to solveIt is often much easier to solve

where where is a convex function. is a convex function.For example:For example: In neural network training In neural network training is the truncated quadratic loss. is the truncated quadratic loss. In boosting In boosting is the exponential loss. is the exponential loss. In support vector machines In support vector machines is the hinge loss. is the hinge loss. In logistic regression In logistic regression is the scaled deviance. is the scaled deviance.

The effect of introducing The effect of introducing is well understood for the is well understood for the classification problem and the results can be applied classification problem and the results can be applied to the single-stage RL problem as well.to the single-stage RL problem as well.

Reduction to Weighted ClassificationReduction to Weighted ClassificationMulti-Stage ProblemMulti-Stage Problem

Let Let be the policy estimated be the policy estimated by the approximate dynamic programming algorithm, where by the approximate dynamic programming algorithm, where each single-stage RL problem is solved via each single-stage RL problem is solved via minimization. minimization.

Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with probability greater than 1-probability greater than 1- over the set of trajectory trees, over the set of trajectory trees,

for n satisfyingfor n satisfying

Proof uses recent results in Proof uses recent results in P. L. Bartlett, M. I. Jordan, and J. D. McAulie, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, March 2006.

Tighter than analogous Q-learning bound (Murphy:JMLR2005). Tighter than analogous Q-learning bound (Murphy:JMLR2005).

Application to Landmine Sensor SchedulingApplication to Landmine Sensor Scheduling

A sand box experiment was A sand box experiment was conducted by Jay Marble to conducted by Jay Marble to extract features of the three extract features of the three sensors for different types of sensors for different types of land-mines and clutter.land-mines and clutter.

Based on the results the Based on the results the sensors’ outputs were sensors’ outputs were simulated as a Gaussian simulated as a Gaussian mixture.mixture.

Feed forward neural networks Feed forward neural networks were trained to perform both were trained to perform both the classification task and the the classification task and the weighted classification talks.weighted classification talks.

Performance where evaluated Performance where evaluated on a separate data set.on a separate data set.

Plastic Anti-personnel Mine

EMI GPR Seismic

Nail

Rock

Plastic Anti-tank Mine

New location

Seismic dataGPR dataEMI data

EMI GPR Seismic

EMI Seismic

Final detection Seismic dataEMI data

Seismic data Final detection

Final detection

Always deploy three sensors

Always deploy best of two sensors: GRP + Seismic

Always deploy best single sensor: EMI

+

Incr

easin

g se

nsor

deplo

ymen

t cos

t+++

+

+

+

+

Performance obtained by randomized sensor allocation

Performance obtained by optimal sensor scheduling

Reinforcement Learning for Sensor Scheduling Reinforcement Learning for Sensor Scheduling Weighted Classification ReductionWeighted Classification Reduction

               Object TypeObject Type            

  

  

11 22 33 44 55 66 77 88 FeatureFeature

   M-ATM-AT M-APM-AP P-ATP-AT P-APP-AP Cltr-1Cltr-1 Cltr-2Cltr-2 Cltr-3Cltr-3 BkgBkg DescriptionDescription

  

EMIEMI(1)(1)

HighHigh HighHighMediuMediumm HighHigh HighHigh LowLow LowLow LowLow ConductivityConductivity

SensoSensorr HighHigh HighHigh HighHigh

MediuMediumm

MediuMediumm LowLow LowLow LowLow SizeSize

  

GPRGPR(2)(2)

HighHighMediuMediumm HighHigh

MediuMediumm LowLow LowLow LowLow LowLow DepthDepth

   HighHighMediuMediumm HighHigh

MeduMedumm HighHigh HighHigh HighHigh LowLow RCSRCS

  SeismiSeismic (3)c (3) HighHigh

MediuMediumm HighHigh

MediuMediumm

MediuMediumm

MediuMediumm LowLow LowLow ResonanceResonance

Policy for specific scenarios:

23D

Optimal sequence for mean state

21D

23D

213D

23D

23D

23D

23D

Optimal Policy for Mean StatesOptimal Policy for Mean States

Application to waveform selection: Landsat Application to waveform selection: Landsat MSS ExperimentMSS Experiment

Data consists of 4435 training cases and 2000 test cases. Data consists of 4435 training cases and 2000 test cases. Each case is a 3x3x4 image stack in 36 dimensions having 1 class attributeEach case is a 3x3x4 image stack in 36 dimensions having 1 class attribute

(1) Red soil, (2) Cotton, (3)Vegetation stubble, (4) Gray soil, (5) Damp gray soil, (6)Very damp gray soil(1) Red soil, (2) Cotton, (3)Vegetation stubble, (4) Gray soil, (5) Damp gray soil, (6)Very damp gray soil

• For each image location we adopt two stage policy to classify its label: • Select one of 6 possible pairs of 4 MSS bands for initial illumination• Based on initial measurement either:

• Make final decision on terrain class and stop• Illuminate with remaining two MSS bands and make final decision

• Reward is average probability of correct decision minus stopping time (energy)

Waveform Scheduling: CROPSWaveform Scheduling: CROPS

New location

Bands (2,4)Bands (2,3)

Classify

Bands (3,4)Bands (1,2) Bands (1,3) Bands (1,4)

Bands (1,4)

Classify

Reward=I(correct)

Reward=I(correct)-c

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20.095

0.1

0.105

0.11

0.115

0.12

0.125

0.13

0.135

C = 0.03

C = 0

C = 0.06C = 0.09

number of dwells

Pe

C = 0

C = 0.02

C = 0.04C = 0.06

C = 0.08

C = 0.1

C = 0.12

C = 0.14

C = 0.16

C = 0.18

Neural Networkk-Nearest Neighbors

Reinforcement Learning for Sensor Scheduling Reinforcement Learning for Sensor Scheduling Weighted Classification ReductionWeighted Classification Reduction

Best myopic initial pair: (1,2)

Non-myopic initial pair: (2,3)

* C is the cost of using the additional two bands.

Performance with all four bands

LANDSAT data: total of 4 bands, each produce a 9 dimensional vector.

Performance of all four bands

Sub-band optimal schedulingSub-band optimal schedulingOptimal initial sub-bands are 1+2 Optimal initial sub-bands are 1+2

Clutter Clutter

typetype11 22 33 44 55 66 PcPc

Performance Performance of sub-bands of sub-bands 1+21+2

0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806

Performance Performance of all sub-of all sub-bandsbands

0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862

Policy Policy

statisticsstatistics

Use full Use full spectrum spectrum only 60% only 60% of timeof time

Performance Performance of optimal of optimal schedulerscheduler

0.980.98 0.940.94 0.930.93 0.510.51 0.840.84 0.820.82 0.8610.861

*

*

Cla

ssify

Add

ition

al

band

s

ConclusionsConclusions

Elements of CROPSElements of CROPS Gauss-Seidel-type DP approximation reduces multi-Gauss-Seidel-type DP approximation reduces multi-

stage to sequence of single-stage RL problemsstage to sequence of single-stage RL problems Classification reduction is used to solve each of these Classification reduction is used to solve each of these

signal stage RL problemssignal stage RL problems

Obtained tight finite sample generalization error Obtained tight finite sample generalization error bounds for RL based on classification theorybounds for RL based on classification theory

CROPS methodology illustrated for energy CROPS methodology illustrated for energy constrained landmine detection and waveform constrained landmine detection and waveform selectionselection

PublicationsPublications

Blatt D., “Adaptive Sensing in Uncertain Environments ,” Blatt D., “Adaptive Sensing in Uncertain Environments ,” PhD Thesis, Dept EECS, University of Michigan, 2006.PhD Thesis, Dept EECS, University of Michigan, 2006.

Blatt D. and Hero A. O.,  "From weighted classification to Blatt D. and Hero A. O.,  "From weighted classification to policy search", Nineteenth Conference on Neural Information policy search", Nineteenth Conference on Neural Information Processing Systems (NIPS), 2005.Processing Systems (NIPS), 2005.

Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive multi-modalitysensor scheduling for detection and tracking of multi-modalitysensor scheduling for detection and tracking of smart targets'', Digital Signal Processing, 2005.smart targets'', Digital Signal Processing, 2005.

Blatt D., Murphy S.A., and Zhu J. "A-learning for Blatt D., Murphy S.A., and Zhu J. "A-learning for Approximate Planning",Approximate Planning", Technical Report 04-63, The Technical Report 04-63, The Methodology Center, Pennsylvania State University.Methodology Center, Pennsylvania State University. 2004. 2004.

Simulation DetailsSimulation DetailsDimension reduction: PCA subspace explaining 99.9% (13-18D)Dimension reduction: PCA subspace explaining 99.9% (13-18D)

sub-bands     sub-bands     Dim Dim ---------     ---------     --- --- 1+2           1+2           13 13 1+3           1+3           17 17 1+4           1+4           17 17 2+3           2+3           15 15 2+4           2+4           15 15 3+4           3+4           15 15 1+2+3+4       1+2+3+4       1818

State at time t: projection of collected data onto PCA subspace. State at time t: projection of collected data onto PCA subspace. Policy search:Policy search:

Weighted classification building block: Weighted classification building block: Weights sensitive combination of [5,2] and [6,2] [tansig, logsig] NN.Weights sensitive combination of [5,2] and [6,2] [tansig, logsig] NN.

Label classifer:Label classifer: Unweighted classification building block: Unweighted classification building block:

Combination of [5,6] and [6,6] [tansig, logsig] feed forward NN.Combination of [5,6] and [6,6] [tansig, logsig] feed forward NN.

Training used 1500 trajectories for label classifiers and 2935 trajectories for policy Training used 1500 trajectories for label classifiers and 2935 trajectories for policy searchsearch

Adaptive length gradient learning with momentum termAdaptive length gradient learning with momentum term Reseeding applied to avoid local minimaReseeding applied to avoid local minima

Performance evaluation using 2000 trajectories.Performance evaluation using 2000 trajectories.

Sub-band performance matrixSub-band performance matrix11 22 33 44 55 66 PcPc

1+21+2 0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806

1+31+3 0.900.90 0.840.84 0.910.91 0.550.55 0.560.56 0.80.8 0.7960.796

1+41+4 0.960.96 0.930.93 0.920.92 0.480.48 0.560.56 0.760.76 0.8030.803

2+32+3 0.910.91 0.940.94 0.840.84 0.560.56 0.650.65 0.820.82 0.8120.812

2+42+4 0.900.90 0.920.92 0.90.9 0.180.18 0.760.76 0.870.87 0.8050.805

3+43+4 0.860.86 0.920.92 0.760.76 0.50.5 0.420.42 0.790.79 0.7390.739

AllAll 0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862

SBCLT

Best myopic choice.

Best non-myopic choice when likely to take more than one observation.