IDEAL @ UT - Spectral learning algorithms for …Geoff Gordon—UT Austin—Feb, 2012 Given past...

Spectral learning algorithms for dynamical systems

Geoff Gordonhttp://www.cs.cmu.edu/~ggordon/

Machine Learning DepartmentCarnegie Mellon University

joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola

Geoff Gordon—UT Austin—Feb, 2012

. . . . . .

What’s out there?

2

ot-2 ot-1 ot ot+1 ot+2

video frames, as vectors of pixel intensities


Given past observations from a partially observable system

Predict future observations

A dynamical system

3

. . . . . .ot-2 ot-1 ot ot+1 ot+2

← Past Future →

State


This talk

• General class of models for dynamical systems

• Fast, statistically consistent learning algorithm

‣ no local optima

• Includes many well-known models & algorithms as special cases

• Also includes new models & algorithms that give state-of-the-art performance on interesting tasks

4


Includes models

• hidden Markov model

‣ n-grams, regexes, k-order HMMs

• PSR, OOM, multiplicity automaton, RR-HMM

• LTI system

‣ Kalman filter, AR, ARMA

• Kernel versions and manifold versions of above

5


Includes algorithms

• Subspace identification for LTI systems

‣ recent extensions to HMMs, RR-HMMs

• Tomasi-Kanade structure from motion

• Principal components analysis (e.g., eigenfaces); Laplacian eigenmaps

• New algorithms for learning PSRs, OOMs, etc.

• A new way to reduce noise in manifold learning

• A new range-only SLAM algorithm

6


Interesting applications

• Structure from motion

• Simultaneous localization and mapping

‣ Range-only SLAM

‣ “SLAM” from inertial sensors

‣ very simple vision-based SLAM (so far)

• Video textures

• Opponent modeling, option pricing, audio event classification, …

7


Bayes filter (HMM, Kalman, etc.)

• Goal: given ot, update Pt-1(st-1) to Pt(st)

• Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st | st-1) P(ot | st)

• Marginalize: Pt-1(ot, st) = ! Pt-1(st-1, ot, st) dst-1

• Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot)8

. . . . . .ot-2 ot-1 ot ot+1 ot+2

Belief Pt(st) = P(st | o1:t)

. . . . . .st-2 st-1 st st+1 st+2


A common form for Bayes filters

• Find covariances as (linear) fns of previous state

‣ Σo(st-1) = E(!(ot) !(ot)T | st-1)

‣ Σso(st-1) = E(st !(ot) | st-1)‣ nb: uncentered covars; st & !(ot) include constant

• Linear regression to get current state

‣ st = Σso Σo !(ot)

• Exact if discrete (HMM, PSR), Gaussian (Kalman, AR), RKHS w/ characteristic kernel [Fukumizu et al.]

• Approximates many more9

-1


Why this form?

• If Bayes filter takes (approximately) the above form, can design a simple spectral algorithm to identify the system

• Intuitions:

‣ predictive state

‣ rank bottleneck

‣ observable representation

10


Predictive state

• If Bayes filter takes above form, we can use a vector of predictions of observables as our state

‣ E(!(ot+k) | st) = linear fn of st

‣ for big enough k, E([!(ot+1) … !(ot+k)] | st) = invertible linear fn of st

‣ so, take E([!(ot+1) … !(ot+k)] | st) as our state

11


Predictive state: example

12


Predictive state: minimal example

• For big enough k, E([!(ot+1) … !(ot+k)] | st) = Wst invertible linear fn (if system observable)

13

P(st | st−1) =

23 0 1

413

23 0

0 13

34

= T

P(ot | st−1) =

�12 1 012 0 1

�= O

OT =

�23

23 0

13

13

78

�


Predictive state: minimal example

• For big enough k, E([!(ot+1) … !(ot+k)] | st) = Wst invertible linear fn (if system observable)

13

P(st | st−1) =

23 0 1

413

23 0

0 13

34

= T

P(ot | st−1) =

�12 1 012 0 1

�= O

OT =

�23

23 0

13

13

78

�W


Predictive state: summary

• E([!(ot+1) … !(ot+k)] | st) is our state rep’n

‣ interpretable

‣ observable—a natural target for learning

14


Predictive state: summary

• E([!(ot+1) … !(ot+k)] | st) is our state rep’n

‣ interpretable

‣ observable—a natural target for learning

14

• To find st, predict !(ot+1) … !(ot+k) from o1:t


Rank bottleneck

stat

ecompress expand

bottleneck

predict

data about past(many samples)

data about future(many samples)

15

Can find best rank-k bottleneck via matrix factorization ⇒ spectral method


Finding a compact predictive state

16

Y X"

W

linear function predicts Yt from Xt


Finding a compact predictive state

16

Y X"

W

linear function predicts Yt from Xt

rank(W) # k


Observable representation

• Now that we have a compact predictive state, need to estimate fns Σo(st) and Σso(st)

• Insight: parameters are now observable: the problem is just to estimate some covariances from data

‣ to get Σo, regress (!(o) $ !(o)) ← state

‣ to get Σso, regress (!(o) $ future+) ← state

17

Future+ →

. . . . . .ot-2 ot-1 ot ot+1 ot+2

← Past Future →

State


Algorithm

• Find compact predictive state: regress future ← past

‣ use any features of past/future, or a kernel, or even a learned kernel (manifold case)

‣ constrain rank of prediction weight matrix

• Extract model: regress

‣ (o $ future+) ← [predictive state]

‣ (o $ o) ← [predictive state]

‣ future+: use same rank-constrained basis as for predictive state

18


Does it work?

• Simple HMM: 5 states, 7 observations

• N=300 thru N=900k

19

Tran

sitio

nsO

bser

vatio

nsXT

YT


103 104 10510 2

10 1

100

run 1run 2c/sqrt(n)

Does it work?

20

mea

n ab

s er

r(1

00 =pr

edic

t un

iform

)

training data

Model True

P(o 1

, o2,

o 3)


Discussion

• Impossibility—learning DFA in poly time = breaking crypto primitives [Kearns & Valiant 89]

‣ so, clearly, we can’t always be statistically efficient

‣ but, see McDonald [11], HKZ [09], us [09]: convergence depends on mixing rate

• Nonlinearity—Bayes filter update is highly nonlinear in state (matrix inverse), even though we use a linear regression to identify the model

‣ nonlinearity is essential for expressiveness

21


Range-only SLAM

• Robot measures distance from current position to fixed beacons (e.g., time of flight or signal strength)

‣ may also have odometry

• Goal: recover robot path, landmark locations

22


Typical solution: EKF

• Problem: linear/Gaussian approximation very bad23

2 0 2

2

0

2

2 0 2

2

0

2

PriorAfter 1 range measurement

to known landmark

Baye

s fil

ter

vs. E

KF


Spectral solution (simple version)

24

landmark 1, time step 2

d211 d212 . . . d21Td221 d222 . . . d22T...

......

...d2N1 d2N2 . . . d2NT

landmark xy


Spectral solution (simple version)

24

landmark 1, time step 2

d211 d212 . . . d21Td221 d222 . . . d22T...

......

...d2N1 d2N2 . . . d2NT

1 x1 y1 (x21 + y21)/2

1 x2 y2 (x22 + y22)/2

......

......

1 xN yN (x2N + y2N )/2

(u21 + v21)/2 (u2

2 + v22)/2 . . . (u2T + v2T )/2

−u1 −u2 . . . uT

−v1 −v2 . . . vT1 1 . . . 1

landmark position x, y

landmark xy

robot position u, v

1/2

=


Experiments (full version)

25

20

0

20

30

40

50

60

0

0

20

30

40

50

60

0

C.B. Plaza 2

Fig. 4. The autonomous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot traveled 1.9kmreceiving 3,529 range measurements. This path minimizes the effect of heading error by balancing the number of left turns with an equal number of rightturns (a commonly used pattern for lawn mowing). Spectral SLAM recovers the robot’s path and landmark positions accurately. C.) In the second experiment,the robot traveled 1.3km receiving 1,816 range measurements. This path highlights the effect of heading error on dead reckoning performance by turning inthe same direction repeatedly. Again, spectral SLAM is able to recover the path and landmarks accurately.

VI. CONCLUSION

We proposed a novel formulation and solution for the range-only SLAM problem that differs substantially from previousapproaches. The essence of this new approach is to formulateSLAM as a factorization problem, which allows us to derive alocal-minimum free spectral learning method that is closely re-lated to SfM and spectral approaches to system identification.We provide theoretical guarantees for our algorithm, discusshow to derive an online algorithm, and show how to generalizeto a full robot system identification algorithm. Finally, wedemonstrate that our spectral approach to SLAM improves onother state-of-the-art SLAM approaches in real-world range-only SLAM problems.

REFERENCES

[1] Michael Biggs, Ali Ghodsi, Dana Wilkinson, and MichaelBowling. Action respecting embedding. In In Proceedings of the

Twenty-Second International Conference on Machine Learning,pages 65–72, 2005.

[2] Byron Boots and Geoff Gordon. Predictive state temporaldifference learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural

Information Processing Systems 23, pages 271–279. 2010.[3] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing

the learning-planning loop with predictive state representations.In Proceedings of Robotics: Science and Systems VI, 2010.

[4] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An onlinespectral learning algorithm for partially observable nonlineardynamical systems. In Proceedings of the 25th National

Conference on Artificial Intelligence (AAAI-2011), 2011.[5] Matthew Brand. Fast low-rank modifications of the thin singular

value decomposition. Linear Algebra and its Applications, 415(1):20–30, 2006.

[6] Joseph Djugash. Geolocation with range: Robustness, efficiencyand scalability. PhD. Thesis, Carnegie Mellon University, 2010.

[7] Joseph Djugash and Sanjiv Singh. A robust method of localiza-tion and mapping using only range. In International Symposium

on Experimental Robotics, July 2008.[8] Joseph Djugash, Sanjiv Singh, and Peter Ian Corke. Further

results with localization and mapping using range from radio.In International Conference on Field and Service Robotics (FSR

’05), July 2005.[9] Brian Ferris, Dieter Fox, and Neil Lawrence. WiFi-SLAM using

Gaussian process latent variable models. In Proceedings of

the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 2480–2485, San Francisco, CA, USA, 2007.Morgan Kaufmann Publishers Inc.

[10] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectralalgorithm for learning hidden Markov models. In COLT, 2009.

[11] George A Kantor and Sanjiv Singh. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE

Conference on Robotics and Automation (ICRA ’02), volume 2,pages 1818 – 1823, May 2002.

[12] Athanasios Kehagias, Joseph Djugash, and Sanjiv Singh.Range-only SLAM with interpolated range data. TechnicalReport CMU-RI-TR-06-26, Robotics Institute, May 2006.

[13] Derek Kurth, George A Kantor, and Sanjiv Singh. Experimentalresults in range-only localization with radio. In 2003 IEEE/RSJ

International Conference on Intelligent Robots and Systems

(IROS ’03), volume 1, pages 974 – 979, October 2003.[14] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben

Wegbreit. FastSLAM: A factored solution to the simultaneouslocalization and mapping problem. In In Proceedings of the

AAAI National Conference on Artificial Intelligence, pages 593–598. AAAI, 2002.

[15] Matthew Rosencrantz, Geoffrey J. Gordon, and SebastianThrun. Learning low dimensional predictive representations.In Proc. ICML, 2004.

[16] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth

International Conference on Artificial Intelligence and Statistics

(AISTATS-2010), 2010.[17] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory.

Academic Press, 1990.[18] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilis-

tic Robotics (Intelligent Robotics and Autonomous Agents). TheMIT Press, 2005. ISBN 0262201623.

[19] Michael E. Tipping and Chris M. Bishop. Probabilistic principalcomponent analysis. Journal of the Royal Statistical Society,

Series B, 61:611–622, 1999.[20] Carlo Tomasi. Shape and motion from image streams under

orthography: a factorization method. International Journal of

Computer Vision, 9:137–154, 1992.[21] P. Van Overschee and B. De Moor. Subspace Identification for

Linear Systems: Theory, Implementation, Applications. Kluwer,1996.

[22] Takehisa Yairi. Map building without localization by dimen-sionality reduction techniques. In Proceedings of the 24th

international conference on Machine learning, ICML ’07, pages1071–1078, New York, NY, USA, 2007. ACM.

autonomous lawn mower

Plaza 1

Plaza 2

true Landmarkest. Landmark

dead reackoning Pathtrue Pathest. Path


VI. CONCLUSION


REFERENCES






































[Dju

gash

& S

ingh

, 200

8]


Experiments (full version)

25

20

0

20

30

40

50

60

0

0

20

30

40

50

60

0

C.B. Plaza 2


VI. CONCLUSION


REFERENCES






































autonomous lawn mower

Plaza 1

Plaza 2

true Landmarkest. Landmark

dead reackoning Pathtrue Pathest. Path


VI. CONCLUSION


REFERENCES






































[Dju

gash

& S

ingh

, 200

8]

of the expected state at time t + 1 around the current MAPstate st, given the current action at:

st+1 − st ≈ N [φ(st, at) +dφds

��st(st − st)] (23)

dφds

��s= (s⊗ I + I ⊗ s)⊗ at ⊗ at (24)

We simply plug this Taylor approximation into the standardKalman filter motion update (e.g., [18]).

V. EXPERIMENTAL RESULTS

We perform several SLAM and robot navigation experi-ments to illustrate and test the ideas proposed in this paper.First we show how our methods work in theory with syntheticexperiments where complete observations are received at eachpoint in time and i.i.d. noise is sampled from a multivariateGaussian distribution. Next we illustrate our algorithm on datacollected from a real-world robotic system with substantialamounts of missing data.

Experiments were performed in Matlab, on a 2.66 GHzIntel Core i7 computer with 8 GB of RAM. The spectrallearning methods described in this paper are very fast, usuallytaking only a few seconds to run. Code and data for theseexperiments is available at [link will become available whenpaper is published].

A. Synthetic ExperimentsFirst we evaluate our spectral SLAM algorithm on simulated

range-only data. Our simulator randomly places 6 landmarks ina 2-D environment. A robot then randomly moves through theenvironment for 500 time steps and receives a range readingto each one of the landmarks at each time step. The rangereadings are perturbed by noise sampled from a Gaussiandistribution with variance equal to 1% of the range. Giventhis data, we apply the algorithm from Section III-C to solvethe SLAM problem. We use the coordinates of 4 landmarks tolearn the linear transform S and recover the true state space.Results are shown in Figure 2A. The results indicate that wecan accurately recover both the landmark locations and therobot path.

We also investigated the empirical convergence rate of ourobservation model (and therefore the map) as the number ofrange readings increased. To do so, we generated 1000 differ-ent random pairs of environments and robot paths. For eachpair, we repeatedly performed our spectral SLAM algorithmon increasingly large numbers of range readings and looked atthe difference between our estimated measurement model andthe true measurement model || �C−C||F . The results are shownin Figure 2B, and show that our estimates steadily convergeto the true model.

B. Autonomous Lawn MowerWe next evaluate the performance of our spectral SLAM

algorithm on two freely available range-only SLAM data setscollected from an autonomous lawn mowing robot [6]. (Thelawn mowing robot is shown in Figure 4A.)2 These “Plaza”

2http://www.frc.ri.cmu.edu/projects/emergencyresponse/RangeData/index.html

Method Plaza 1 Plaza 2Dead Reckoning (full path) 15.92m 27.28mCartesian EKF (last 10%) 0.94m 0.92mBatch Optimization (last 10%) 0.68m 0.96mFastSLAM (last 10%) 0.73m 1.14mROP EKF (last 10%) 0.65m 0.87mSpectral SLAM (worst 10%) 1.17m 0.51mSpectral SLAM (best 10%) 0.28m 0.22mSpectral SLAM (full path) 0.39m 0.35m

Fig. 3. Comparison of Range-Only SLAM Algorithms (Localization RMSE).Boldface entries are best in column.

datasets were collected via radio nodes from MultispectralSolutions that use time-of-flight of ultra-wide-band signalsto provide inter-node ranging measurements. This systemproduces a time-stamped range estimate between the mobilerobot and stationary nodes (landmarks) in the environment.The landmark radio nodes are placed atop traffic cones approx-imately 138cm above the ground throughout the environment,and one node was placed on top of the center of the robot’scoordinate frame (also 138cm above the ground). The robotodometry (dead reckoning) comes from an onboard fiberopticgyro and wheel encoders. The two environmental setups, in-cluding the location of the landmarks, the dead reckoning path,and the ground truth path (calculated within 2cm accuracy viaGPS), are shown in Figure 4B-C. For details see [6].

The two data sets were very sparse, with approximately11 time steps (and up to 500 steps) between range readingsfor the worst case landmark. We first interpolated the missingrange readings with the method of Section III-C4. Then weapplied the rank-7 spectral SLAM algorithm of Section III-C;the results are depicted in Figure 4B-C. Qualitatively, we seethat the robot’s localization path conforms to the true path.

In addition to the qualitative results, we quantitativelycompared spectral SLAM to a number of different competingrange-only SLAM algorithms (the results we compared to aresummarized in [6]). These algorithms included a CartesianEKF, a batch optimization technique [12] run for 10,000iterations, FastSLAM [14], and a Relative Over-ParameterizedEKF (ROP-EKF) [7], a state-of-the-art range-only SLAMalgorithm. The localization root mean squared error (RMSE)in meters for each SLAM algorithm is shown in Figure 3.Previous results only reported the RMSE for the last 10% ofthe path, which is generally the best 10% of the path (since itgives the most time to recover from initialization problems);this is a very favorable statistic for competing SLAM methods.The full path localization error can be considerably worse,particularly for the initial portion of the path—see Fig. 5(right) of [7].

To compare to these results, we report the RMSE for ourspectral algorithm when restricting the data set to the best orworst 10% of the path, as well as the RMSE for the full path.The results show that our spectral SLAM algorithm beats,often by a significant margin, the best previous methods onthe Plaza datasets.


Structure from motion

26

[Tom

asi &

Kan

ade,

1992

]


Video textures

• Kalman filter works for some video textures

‣ steam grate example above

‣ fountain:

27

observation = raw pixels (vector of reals over time)


Video textures, redux

Kalman Filter PSR

28

both models: 10 latent dimensions

Original


Learning in the loop: option pricing

• Price a financial derivative: “psychic call”‣ holder gets to say “I bought call 100 days ago”‣ underlying stock follows Black Scholes (unknown

parameters)

29

pt

pt–100


Option pricing

• Solution [Van Roy et al.]: use policy iteration‣ 16 hand-picked features (e.g., poly ⋅ history)

‣ initialize policy arbitrarily‣ least-squares temporal differences (LSTD) to

estimate value function‣ policy := greedy; repeat

30

pt

pt–100


Option pricing

• Better solution: spectral SSID inside policy iteration‣ 16 original features from Van Roy et al.‣ 204 additional “low-originality” features‣ e.g., linear fns of price history of underlying‣ SSID picks best 16-d dynamics to explain

feature evolution‣ solve for value function in closed form

31


Policy iteration w/ spectral learning

32

Exp

ecte

d pa

yoff

/ $ o

f und

erly

ing

1 2 3 4!10

!5

0

5

10

15

1 2 3 4!10

!5

0

5

10

15

1 2 3 4!10

!5

0

5

10

15

0 5 10 15 20 25 300.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

Val

ue

State State State

Expec

ted R

ewar

d

Policy Iteration

A. B. C. D.LSTD

PSTD

LARS-TD

LSTD (16)

LSTD

PSTD

LARS-TD

ThresholdJ

LSTD

PSTD

LARS-TD

J

LSTD

PSTD

LARS-TD

J!

!

!

Figure 1: Experimental Results. Error bars indicate standard error. (A.) Estimating the value func-

tion with a small number of informative features. All three approaches do well. (B.) Estimating the

value function with a small set of informative features and a large set of random features. LARS-TD

is designed for this scenario and dramatically outperforms PSTD and LSTD. (C.) Estimating the

value function with a large set of semi-informative features. PSTD is able to determine a small set

of compressed features that retain the maximal amount of information about the value function, out-

performing LSTD and LARS-TD. (D.) Pricing a high-dimensional derivative via policy iteration. At

each iteration the current policy was executed on 10,000 stock price trajectories sampled from the

stationary distribution. The y-axis is expected reward for the current policy executed on the test set

at each iteration. The optimal threshold strategy (sell if price is above a threshold [23]) is in black,

LSTD (16 canonical features) is in blue, LSTD (on the full 240 features) is cyan, LARS-TD (feature

selection from set of 240) is in green, and PSTD (16 dimensions, compressing 240 features(16 +

224)) is in red. PSTD outperforms the competing approaches.

holding the contract or exercise. We consider the financial derivative introduced by Tsitsiklis and

Van Roy [23]. The derivative generates payoffs that are contingent on the prices of a single stock.

At the end of a given day, the holder may opt to exercise. At the time of exercise the contract is

terminated, and a payoff is received in an amount equal to the current price of the stock divided by

the price 100 days beforehand. The stock price is modeled as a geometric Brownian motion with

volatility σ = 0.02, and there is a continuously compounded short term interest rate ρ = 0.0004(corresponding to an annual interest rate of ∼ 10%). In more detail, if wt is a standard Brownian

motion, then the stock price pt evolves as∇pt = ρpt∇t+σpt∇wt, and we can summarize relevant

state at the end of each day as a discrete-time process {xt | t = 0, 1, ...} ∈ R100with xt =�

pt−99pt−100

, pt−98pt−100

, . . . , pt

pt−100

�T. The ith dimension xt(i) represents the amount a $1 investment in a

stock at time t−100 would grow to at time t−100+ i. This process is Markov and ergodic [23, 24]:

xt and xt+100 are independent and identically distributed. The immediate reward for exercising

the stock is G(x) = x(100), the immediate reward for continuing to hold the stock is 0, and the

discount factor γ = e−ρis determined by the interest rate. The value of the derivative security is

given bysupt E[γtG(xt)]. Our goal is to calculate an approximate value function, and then use this

value function to generate a stopping time t∗ = min{t |G(xt) ≥ V ∗(xt)}. To do so, we sample a

sequence of 1,000,000 states xt ∈ R100and calculate features φH of each state. We then perform

policy iteration on this sample, alternately estimating the value function under a given policy and

then using this value function to define a new greedy policy “stop if G(xt) ≥ wTφH(xt).”

Within the above strategy, we have two main choices: which features do we use, and how do we

estimate the value function in terms of these features. For value function estimation, we used LSTD,

LARS-TD, or PSTD. In each case we re-used our 1,000,000-state sample trajectory for all iterations:

we start at the beginning and follow the trajectory as long as the policy chooses the “continue” action,

with reward 0 at each step. When the policy executes the “stop” action, the reward is G(x) and the

next state’s features are all 0; we then restart the policy 100 steps in the future, after the process

has fully mixed. For feature selection, we are fortunate, previous researchers have hand-selected a

“good” set of 16 features for this data set through repeated trial and error (see Appendix, Section D

and [23, 24]). We greatly expand this set of features, then use PSTD to synthesize a small set of

high-quality combined features and calculate the value function more accurately. Specifically, we

add the entire 100-step sample trajectory, the squares of the sample trajectory, and several additional

nonlinear features, increasing the total number of features from 16 to 240. We use histories of length

1, tests of length 3, and (for comparison’s sake) we choose a linear dimension of 16.

8

Iterations of PI

0.82¢/$ better than best competitor;

1.33¢/$ better than best previously

published

data: 1,000,000 time steps


Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction Horizon

Avg.

Pre

dic

tion

Err

.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µXτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction Horizon

Av

g.

Pre

dic

tio

n E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µXτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

Slot-car IMU data

0 50 100 1503

2

1

0

1

2

3

[Ko

& F

ox, 0

9; B

oots

et a

l., 10

]


Kalman < HSE-HMM < manifold

34

Manuscript under review by AISTATS 2012

0 10 20 30 40 50 60 70 80 90 1000

40

80

120

160A. C.

Prediction Horizon

RM

S Er

rorIMU

Racetrack

Mean Obs.Slot Car B.

Laplacian Eigenmap

Gaussian RBF

Kalman Filter

LinearHSE-HMMManifold-HMM

HMM (EM)Previous Obs.

Car

Loc

atio

ns P

redi

cted

by

Diff

eren

t Sta

te S

pace

s

Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) and

racetrack (bottom). (B) A comparison of training data embedded into the state space of three different learnedmodels. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from state

space. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE-

HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space

(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error for

prediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperforms

the other learned models by taking advantage of the fact that the data we want to predict lies on a manifold.

structing a manifold, none of these papers compares

the predictive accuracy of its model to state-of-the-art

dynamical system identification algorithms.

7 Conclusion

In this paper we propose a class of problems called

two-manifold problems where two sets of correspond-

ing data points, generated by a single latent manifold,

lie on or near two different higher dimensional mani-

folds. We design algorithms by relating two-manifold

problems to cross-covariance operators in RKHS, and

show that these algorithms result in a significant im-

provement over standard manifold learning approaches

in the presence of noise. This is an appealing result:

manifold learning algorithms typically assume that ob-

servations are (close to) noiseless, an assumption that

is rarely satisfied in practice.

Furthermore, we demonstrate the utility of two-

manifold problems by extending a recent dynamical

system identification algorithm to learn a system with

a state space that lies on a manifold. The resulting al-

gorithm learns a model that outperforms the current

state-of-the-art in terms of predictive accuracy. To our

knowledge this is the first combination of system iden-

tification and manifold learning that accurately iden-

tifies a latent time series manifold and is competitive

with the best system identification algorithms at learn-

ing accurate models.

Manuscript under review by AISTATS 2012

0 10 20 30 40 50 60 70 80 90 1000

40

80

120

160A. C.

Prediction Horizon

RM

S Er

rorIMU

Racetrack

Mean Obs.Slot Car B.

Laplacian Eigenmap

Gaussian RBF

Kalman Filter

LinearHSE-HMMManifold-HMM

HMM (EM)Previous Obs.

Car

Loc

atio

ns P

redi

cted

by

Diff

eren

t Sta

te S

pace

s

Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) and

racetrack (bottom). (B) A comparison of training data embedded into the state space of three different learnedmodels. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from state

space. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE-

HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space

(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error for

prediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperforms

the other learned models by taking advantage of the fact that the data we want to predict lies on a manifold.

structing a manifold, none of these papers compares

the predictive accuracy of its model to state-of-the-art

dynamical system identification algorithms.

7 Conclusion

In this paper we propose a class of problems called

two-manifold problems where two sets of correspond-

ing data points, generated by a single latent manifold,

lie on or near two different higher dimensional mani-

folds. We design algorithms by relating two-manifold

problems to cross-covariance operators in RKHS, and

show that these algorithms result in a significant im-

provement over standard manifold learning approaches

in the presence of noise. This is an appealing result:

manifold learning algorithms typically assume that ob-

servations are (close to) noiseless, an assumption that

is rarely satisfied in practice.

Furthermore, we demonstrate the utility of two-

manifold problems by extending a recent dynamical

system identification algorithm to learn a system with

a state space that lies on a manifold. The resulting al-

gorithm learns a model that outperforms the current

state-of-the-art in terms of predictive accuracy. To our

knowledge this is the first combination of system iden-

tification and manifold learning that accurately iden-

tifies a latent time series manifold and is competitive

with the best system identification algorithms at learn-

ing accurate models.

RMS prediction error (XY position) v. horizoncomparison of kernels

Geoff Gordon—MURI review—Nov, 2011

Algorithm intuition

• Use separate one-manifold learners to estimate structure in past, future

‣ result: noisy Gram matrices GX, GY

‣ “signal” is correlated between GX, GY

‣ “noise” is independent

• Look at eigensystem of GXGY

‣ suppresses noise, leaves signal

35


Summary

• General class of models for dynamical systems

• Fast, statistically consistent learning method

• Includes many well-known models & algorithms as special cases

‣ HMM, Kalman filter, n-gram, PSR, kernel versions

‣ SfM, subspace ID, Kalman video textures

• Also includes new models & algorithms that give state-of-the-art performance on interesting tasks

‣ range-only SLAM, PSR video textures, HSE-HMM

36


Papers

• B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, 2011.

• B. Boots and G.% Predictive state temporal difference learning.%NIPS, 2010.

• B. Boots, S. M. Siddiqi, and G.% Closing the learning-planning loop with predictive state representations. RSS, 2010.

• L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola.% Hilbert space embeddings of hidden Markov models.% ICML, 2010. (Best paper)

37

http://www.cs.cmu.edu/%7Eggordon/boots-gordon-PSTD.pdf




http://www.cs.cmu.edu/%7Eggordon/boots-siddiqi-gordon-closing-loop-psrs.pdf




IDEAL @ UT - Spectral learning algorithms for …Geoff Gordon—UT Austin—Feb, 2012 Given past...

Documents

Transcript of IDEAL @ UT - Spectral learning algorithms for …Geoff Gordon—UT Austin—Feb, 2012 Given past...