Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...

River monitoring

Want to monitor ecological condition of river Need to decide where to make observations!

Mixing zone of San Joaquin and Merced rivers

Position along transect (m)pH

NIMS(UCLA)

Observation Selection for Spatial prediction

Gaussian processes Distribution over functions (e.g., how pH varies in space) Allows estimating uncertainty in prediction

Horizontal position

eobservations

Unobservedprocess

Prediction

Confidencebands

Mutual Information[Caselton Zidek 1984]

Finite set of possible locations V For any subset A µ V, can compute

Want: A* = argmax MI(A) subject to |A| ≤ k

Finding A* is NP hard optimization problem

M I (A) = H (V nA) ¡ H (V nA j A)

Entropy ofuninstrumented

locationsafter sensing

Entropy ofuninstrumented

locationsbefore sensing

Want to find: A* = argmax|A|=k MI(A) Greedy algorithm:

Start with A = ; For i = 1 to k

s* := argmaxs MI(A [ {s}) A := A [ {s*}

The greedy algorithm for finding optimal a priori sets

M I (Agreedy) ¸ (1 ¡ 1=e) maxA:jAj=k

M I (A) ¡ "M I

Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh]

Optimalsolution

Result ofgreedy algorithm

Constant factor, ~63%

Sequential design

Observed variables depend on previous measurements and observation policy

MI() = expected MI score over outcome of observations

X3 =? X2 =?

<20°C ¸ 20°C

>15°C

MI(X5=17, X3=16,

X7=19) = 3.4

X5=17X5=21

X3 =16

X7 =19 X12=? X23 =?

¸ 18°C<18°C

MI(…) = 2.1 MI(…) = 2.4

Observation

policy

MI() = 3.1

A priori vs. sequential Sets are very simple policies. Hence:

maxA MI(A) · max MI() subject to |A|=||=k

Key question addressed in this work:

How much better is sequential vs. a priori design?

Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!

GPs slightly more formally Set of locations V Joint distribution P(XV) For any A µ V, P(XA) Gaussian GP defined by

Prior mean (s) [often constant, e.g., 0] Kernel K(s,t)

Position along transect (m)

V… …

1: Variance (Amplitude)

2: Bandwidth

K(s;t) = µ1 expµ

¡ks ¡ tk2

Example: Squaredexponential kernel

4 2 0 2 4 0

Distance

Known parametersKnown parameters

(bandwidth, variance, etc.)

No benefit in sequential design!

maxA MI(A) = max MI()

Mutual Information does not depend on observed values:

H(XB j XA = xA ) / logj§ (µ)BjA j

Mutual Information does depend on observed values!

P (xB j xA ) =X

P (µ j xA )N (xB ;¹ (µ)BjA ;§ (µ)

Unknown parametersUnknown (discretized)

parameters: Prior P( = )

Sequential design can be better!

maxA MI(A) · max MI()

depends on observations!

Theorem:

Key result: How big is the gap?

If = known: MI(A*) = MI(*) If “almost” known: MI(A*) ¼ MI(*)

MIMI(A*) MI(*)0

Gap depends on H()

M I (¼¤) · M I (A¤) + O(1)H (£ )

MI of best policyMI of best set Gap size

M I (¼¤) ·X

P (µ) maxjA j=k

M I (A j µ) + H(£)

As H() ! 0:

MI of best policy MI of best param. spec. set

Near-optimal policy if parameter approximately known Use greedy algorithm to optimize

MI(Agreedy | ) = P() MI(Agreedy | )

Note: | MI(A | ) – MI(A) | · H() Can compute MI(A | ) analytically, but not MI(A)

M I (Agreedy) ¸ (1 ¡ 1=e)M I (¼¤)¡ "¡ O(1)H (£ )Corollary [using our result from ICML 05]

Optimalseq. plan

Result ofgreedy algorithm

~63% Gap ≈ 0(known par.)

Exploration—Exploitation for GPsReinforcementLearning

Active Learning in GPs

Parameters P(St+1|St, At), Rew(St) Kernel parameters

Known parameters:Exploitation

Find near-optimal policy by solving MDP!

Find near-optimal policy by finding best set

Unknown parameters:Exploration

Try to quickly learn parameters! Need to waste only polynomially many robots!

Try to quickly learn parameters. How many samples do we need?

Parameter info-gain exploration (IGE) Gap depends on H() Intuitive heuristic: greedily select

s* = argmaxs I(; Xs) = argmaxs H() – H( | Xs)

Does not directly try to improve spatial prediction No sample complexity bounds

Parameter entropybefore observing s

P.E. after observing s

Implicit exploration (IE) Intuition: Any observation will help us reduce H() Sequential greedy algorithm: Given previous

observations XA = xA, greedily select

s* = argmaxs MI ({Xs} | XA=xA, )

Contrary to a priori greedy, this algorithm takes observations into account (updates parameters)

Proposition: H( | X) · H() “Information never hurts” for policies

No samplecomplexity bounds

Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!

Learning the bandwidthKernel

Bandwidth

Sensors withinbandwidth are

correlated

Sensors outsidebandwidth are

≈ independent

-4 -2 0 2 40

Square exponential kernel:

Choose pairs of samples at distance to test correlation!

K(s;t) = expµ

¡ks ¡ tk2

Hypothesis testing:Distinguishing two bandwidths

Correlationunder BW=1

Correlationunder BW=3

At this distance correlation gap largest

BW = 1

BW = 3

-2 0 2

Hypothesis testing:Sample complexity

Theorem: To distinguish bandwidths with minimum gap in correlation and error < we need independent samples.

In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)

Other tests can be used for variance/noise etc. What if we want to distinguish more than two

bandwidths?

N̂ = O³

1½2 log2 1

1 2 3 4 50

Find “most informative split” at posterior median

Hypothesis testing:Binary searching for bandwidth

Testing policy ITE needs only

logarithmically many tests!

ET [M I (¼I T E + Agreedy j £ )] ¸ (1¡ 1=e)M I (¼¤) ¡ k"M I ¡ O("T )

Theorem: If we have tests with error < T then

Exploration—Exploitation Algorithm Exploration phase

Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase,

otherwise continue exploring. Exploitation phase

Use a priori greedy algorithm select remaining samples

For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!

0 5 10 15 20

0 5 10 15 20 25

Results

None of the strategies dominates each other Usefulness depends on application

More observations More observationsMore

SERVER

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE50

242526283032

Temperature dataIGE: Parameter info-gain

ITE: Hypothesis testingIE: Implicit exploration

10 20 30 40 500

Coordinates (m)

Nonstationarity by spatial partitioning Isotropic GP for each

region, weighted by region membership

spatially varying linear combination

Stationary fit Nonstationary fit Problem: Parameter space grows exponentially in #regions! Solution: Variational approximation (BK-style) allows efficient

approximate inference (Details in paper)

0 10 20 30 400

IE,nonstationary

IE,isotropic

a priori,nonstationary

Results on river data

Nonstationary model + active learning lead to lower RMS error

More observations Larger bars = later sample

10 20 30 40 500

1(14.54/0.04)

(13.10/0.03)(13.82/0.10)

(14.49/0.02)

Coordinates (m)

0 5 10 15 200.5

IE,isotropic

IGE,nonstationary

IE,nonstationary

Random,nonstationary

0 5 10 15 20 25 306.5

IEnonstationary

IGEnonstationary

Results on temperature data

IE reduces error most quickly IGE reduces parameter entropy most quickly

More observations More

tyMore observations

Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves

near-optimal exploitation If parameters unknown, perform exploration

Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample

complexity bounds! Each exploration strategy has its own advantages

Can use bound to compute stopping criterion Presented extensive evaluation on real world data

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...

Documents

Transcript of Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A

Decision Trees - University of Washington · Decision Trees Machine Learning – CSE546 Carlos Guestrin (by Sameer Singh) University of Washington October 16, 2014 ©Carlos Guestrin

Carnegie Mellon Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron TexPoint fonts used in EMF. Read the.

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Expectation Maximizationguestrin/Class/10701-S07/Slides/em... · ©2005-2007 Carlos Guestrin Expectation Maximization Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A

Metro Maps of Dafna Shahaf Carlos Guestrin Eric Horvitz.

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

TexPoint fonts used in EMF. Read the TexPoint manual ... Lecture 3.pdf · Read the TexPoint manual before you delete this box.: AAAAAA ... Any shock to agent’s income permanently

DM in the Galaxy James Binney Oxford University TexPoint fonts used in EMF. Read the TexPoint manual…

TexPoint fonts used in EMF.

Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.

Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processesjaillet/general/icml2014.pdf · Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes correlation structure

TexPoint fonts used in EMF.

Reti Complesse seconda lezione TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Carlos Guestrin- Neural Networks