Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...

River monitoring

Want to monitor ecological condition of river Need to decide where to make observations!

Mixing zone of San Joaquin and Merced rivers

7.4

7.6

7.8

8

Position along transect (m)pH

valu

e

NIMS(UCLA)

Observation Selection for Spatial prediction

Gaussian processes Distribution over functions (e.g., how pH varies in space) Allows estimating uncertainty in prediction

Horizontal position

pH

valu

eobservations

Unobservedprocess

Prediction

Confidencebands

Mutual Information[Caselton Zidek 1984]

Finite set of possible locations V For any subset A µ V, can compute

Want: A* = argmax MI(A) subject to |A| ≤ k

Finding A* is NP hard optimization problem

M I (A) = H (V nA) ¡ H (V nA j A)

Entropy ofuninstrumented

locationsafter sensing

Entropy ofuninstrumented

locationsbefore sensing

Want to find: A* = argmax|A|=k MI(A) Greedy algorithm:

Start with A = ; For i = 1 to k

s* := argmaxs MI(A [ {s}) A := A [ {s*}

The greedy algorithm for finding optimal a priori sets

M I (Agreedy) ¸ (1 ¡ 1=e) maxA:jAj=k

M I (A) ¡ "M I

Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh]

Optimalsolution

Result ofgreedy algorithm

Constant factor, ~63%

1

2

3

4

5

Sequential design

Observed variables depend on previous measurements and observation policy

MI() = expected MI score over outcome of observations

X5=?

X3 =? X2 =?

<20°C ¸ 20°C

X7 =?

>15°C

MI(X5=17, X3=16,

X7=19) = 3.4

X5=17X5=21

X3 =16

X7 =19 X12=? X23 =?

¸ 18°C<18°C

MI(…) = 2.1 MI(…) = 2.4

Observation

policy

MI() = 3.1

A priori vs. sequential Sets are very simple policies. Hence:

maxA MI(A) · max MI() subject to |A|=||=k

Key question addressed in this work:

How much better is sequential vs. a priori design?

Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!

GPs slightly more formally Set of locations V Joint distribution P(XV) For any A µ V, P(XA) Gaussian GP defined by

Prior mean (s) [often constant, e.g., 0] Kernel K(s,t)

7.4

7.6

7.8

8

Position along transect (m)

pH

valu

e

V… …

XV

1: Variance (Amplitude)

2: Bandwidth

K(s;t) = µ1 expµ

¡ks ¡ tk2

2

µ22

¶

Example: Squaredexponential kernel

4 2 0 2 4 0

0.5

1

Distance

Cor

rela

tion

Known parametersKnown parameters

(bandwidth, variance, etc.)

No benefit in sequential design!

maxA MI(A) = max MI()

Mutual Information does not depend on observed values:

H(XB j XA = xA ) / logj§ (µ)BjA j

Mutual Information does depend on observed values!

P (xB j xA ) =X

µ

P (µ j xA )N (xB ;¹ (µ)BjA ;§ (µ)

BjA )

Unknown parametersUnknown (discretized)

parameters: Prior P( = )

Sequential design can be better!

maxA MI(A) · max MI()

depends on observations!

Theorem:

Key result: How big is the gap?

If = known: MI(A*) = MI(*) If “almost” known: MI(A*) ¼ MI(*)

MIMI(A*) MI(*)0

Gap depends on H()

M I (¼¤) · M I (A¤) + O(1)H (£ )

MI of best policyMI of best set Gap size

M I (¼¤) ·X

µ

P (µ) maxjA j=k

M I (A j µ) + H(£)

As H() ! 0:

MI of best policy MI of best param. spec. set

Near-optimal policy if parameter approximately known Use greedy algorithm to optimize

MI(Agreedy | ) = P() MI(Agreedy | )

Note: | MI(A | ) – MI(A) | · H() Can compute MI(A | ) analytically, but not MI(A)

M I (Agreedy) ¸ (1 ¡ 1=e)M I (¼¤)¡ "¡ O(1)H (£ )Corollary [using our result from ICML 05]

Optimalseq. plan

Result ofgreedy algorithm

~63% Gap ≈ 0(known par.)

Exploration—Exploitation for GPsReinforcementLearning

Active Learning in GPs

Parameters P(St+1|St, At), Rew(St) Kernel parameters

Known parameters:Exploitation

Find near-optimal policy by solving MDP!

Find near-optimal policy by finding best set

Unknown parameters:Exploration

Try to quickly learn parameters! Need to waste only polynomially many robots!

Try to quickly learn parameters. How many samples do we need?

Parameter info-gain exploration (IGE) Gap depends on H() Intuitive heuristic: greedily select

s* = argmaxs I(; Xs) = argmaxs H() – H( | Xs)

Does not directly try to improve spatial prediction No sample complexity bounds

Parameter entropybefore observing s

P.E. after observing s

Implicit exploration (IE) Intuition: Any observation will help us reduce H() Sequential greedy algorithm: Given previous

observations XA = xA, greedily select

s* = argmaxs MI ({Xs} | XA=xA, )

Contrary to a priori greedy, this algorithm takes observations into account (updates parameters)

Proposition: H( | X) · H() “Information never hurts” for policies

No samplecomplexity bounds

Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!

Learning the bandwidthKernel

Bandwidth

Sensors withinbandwidth are

correlated

Sensors outsidebandwidth are

≈ independent

AB C

-4 -2 0 2 40

0.5

1

Square exponential kernel:

Choose pairs of samples at distance to test correlation!

K(s;t) = expµ

¡ks ¡ tk2

2

µ2

¶

Hypothesis testing:Distinguishing two bandwidths

Correlationunder BW=1

Correlationunder BW=3

At this distance correlation gap largest

BW = 1

BW = 3

-2 0 2

-2

0

2

-2 0 2

-2

0

2

Hypothesis testing:Sample complexity

Theorem: To distinguish bandwidths with minimum gap in correlation and error < we need independent samples.

In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)

Other tests can be used for variance/noise etc. What if we want to distinguish more than two

bandwidths?

N̂ = O³

1½2 log2 1

"

´

1 2 3 4 50

0.2

0.4

0.6

P(

)

Find “most informative split” at posterior median

Hypothesis testing:Binary searching for bandwidth

Testing policy ITE needs only

logarithmically many tests!

ET [M I (¼I T E + Agreedy j £ )] ¸ (1¡ 1=e)M I (¼¤) ¡ k"M I ¡ O("T )

Theorem: If we have tests with error < T then

Exploration—Exploitation Algorithm Exploration phase

Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase,

otherwise continue exploring. Exploitation phase

Use a priori greedy algorithm select remaining samples

For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!

0 5 10 15 20

0.3

0.35

0.4

0.45

0.5

IE

ITE

IGE

0 5 10 15 20 25

0.5

1

1.5

2

IE

IGE

ITE

Results

None of the strategies dominates each other Usefulness depends on application

More

RM

S e

rror

More observations More observationsMore

para

m.

unce

rtain

ty

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE50

51

52 53

54

46

48

49

47

43

45

44

42 41

3739

38 36

33

3

6

10

11

12

13 14

1516

17

19

2021

22

242526283032

31

2729

23

18

9

5

8

7

4

34

1

2

3540

Temperature dataIGE: Parameter info-gain

ITE: Hypothesis testingIE: Implicit exploration

10 20 30 40 500

1

Coordinates (m)

Nonstationarity by spatial partitioning Isotropic GP for each

region, weighted by region membership

spatially varying linear combination

Stationary fit Nonstationary fit Problem: Parameter space grows exponentially in #regions! Solution: Variational approximation (BK-style) allows efficient

approximate inference (Details in paper)

0 10 20 30 400

0.05

0.1

0.15

0.2

IE,nonstationary

IE,isotropic

a priori,nonstationary

Results on river data

Nonstationary model + active learning lead to lower RMS error

More

RM

S e

rror

More observations Larger bars = later sample

10 20 30 40 500

1(14.54/0.04)

(13.10/0.03)(13.82/0.10)

(14.49/0.02)

Coordinates (m)

0 5 10 15 200.5

1

1.5

IE,isotropic

IGE,nonstationary

IE,nonstationary

Random,nonstationary

0 5 10 15 20 25 306.5

7

7.5

8

8.5

9

9.5

10

IEnonstationary

IGEnonstationary

Results on temperature data

IE reduces error most quickly IGE reduces parameter entropy most quickly

More

RM

S e

rror

More observations More

para

m.

un

cert

ain

tyMore observations

Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves

near-optimal exploitation If parameters unknown, perform exploration

Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample

complexity bounds! Each exploration strategy has its own advantages

Can use bound to compute stopping criterion Presented extensive evaluation on real world data

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...

Documents

Transcript of Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas...