c 2004 by Kevin Michael Squire. All rights reserved.k-squire/thesis/Kevin_thesis_full.pdf · BY...

HMM-BASED SEMANTIC LEARNING FOR A MOBILE ROBOT

BY

KEVIN MICHAEL SQUIRE

B.S., Case Western Reserve University, 1995M.S., University of Illinois at Urbana-Champaign, 1998

DISSERTATION

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical Engineering

in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2004

Urbana, Illinois

ABSTRACT

We are developing a intelligent robot and attempting to teach it language. While there are many

aspects of this research, for the purposes of this dissertation the most important are the following

ideas. Language is primarily based on semantics, not syntax, which is the focus in speech recogni-

tion research these days. To truly learn meaning, a language engine cannot simply be a computer

program running on a desktop computer analyzing speech. It must be part of a more general,

embodied intelligent system, one capable of using associative learning to form concepts from the

perception of experiences in the world, and further capable of manipulating those concepts symboli-

cally. This dissertation explores the use of hidden Markov models (HMMs) in this capacity. HMMs

are capable of automatically learning and extracting the underlying structure of continuous-valued

inputs and representing that structure in the states of the model. These states can then be treated

as symbolic representations of the inputs. We show how a model consisting of a cascade of HMMs

can be embedded in a small mobile robot and used to learn correlations among sensory inputs to

create symbolic concepts, which can eventually be manipulated linguistically and used for decision

making.

iii

To my parents.

iv

ACKNOWLEDGMENTS

First and foremost, I would like thank my adviser, Dr. Stephen Levinson, for providing an extremely

ambitious and stimulating project. Steve has an amazingly broad perspective on our research, and

my own views and understanding have noticeably broadened under his tutelage. He has also had

seemingly unfailing belief in me and my work, even during times of difficulty, which I greatly

appreciate. I have gained a profound respect for him and his ideas and opinions, and I am deeply

grateful for having had the opportunity to work under him.

I would like to thank my committee members, Dr. Seth Hutchinson, Dr. Thomas Huang,

Dr. Mark Hasegawa-Johnson, and Dr. Patrick Xavier, for their questions, suggestions, and sup-

port during my research. In particular, Seth expressed interest early on in participating in my

research process, and has asked some of the deepest and most interesting questions regarding the

research; Tom has pushed me to search for ways to more broadly apply my research and the re-

search of the project; Mark has been very interested in and supportive of some of the more technical

aspects of my work; and Patrick has, from a distance, offered frequent advice and took the time to

fly in for my defense. For all of these interactions, I am very appreciative.

For the month before my defense, Ruei-Sung Lin and Matthew McClain were amazingly sup-

portive of the technical aspects of this project, pulling very long nights with me and writing and

changing code to fit my specifications. Without their help, a final demonstration my work would

not have been possible, and I thank them deeply.

I would like to thank Matthew Kleffner, Dr. Danfeng Li, Dr. Weiyu Zhu, and Dr. Qiong Liu,

whose technical contributions have helped form the foundation of our project, upon which my work

is built. I would additionally like to thank Matt for our many stimulating discussions.

v

Throughout my graduate studies, Dr. Rajiv Maheswaran and Dr. Sarunya “Noke” Hemjinda

have both listened intently when I have needed to talk, whether about technical aspects of my

research, about or real or mundane issues of life. Thanks to both for being really amazing friends.

I would like to thank the other members of the Beckman Institute Robotics Laboratory, for their

warm welcome and aid to our group when we joined their lab earlier this year. I would especially

like to thank James Davidson and Dr. Fred Rothganger for some very stimulating conversations

and for enthusiastic support of our project.

My participation in the artificial neural networks and computational brain theory (ANNCBT)

seminar has been one of the most interesting and intellectually stimulating experiences of my PhD,

and has been a strong guide for my research. I thank the members of that group for some very

interesting discussions, especially Samarth Swarup and Dr. Thomas Anastasio.

I would like to thank Dr. Donna Brown for her strong support and help while I was working

on my master’s degree. Without her support and encouragement, I would not have gone on for my

PhD.

While at the Beckman Institute, Mike Smith has been amazingly helpful to me and our group,

especially in helping to organize our lab and offices, and with setting up open house demonstrations.

I thank him for all he has done over the years.

I am deeply indebted to Dominic Frigon, Hala Jawlakh, Dr. Saptarshi Bandyopadhyay, Kwan-

rawee“Joy”Sirikanchana, Dr. Consuelo Waight, Ankur Garg, and Sarah Miller, for their enthusiastic

support, for interesting and helpful discussions about my work and about life, and for their close

friendship.

For helping keep me healthy and nourished, I would like to thank the members of the Friday

dinner gang—Anand Selvaraj, Chetan Pahlajani, Zaki Mohammed, Shivi Bansal, Carrie Owen,

Natasha Kipp, Hala and Dom, Siddhartha Raja, Deepti Samant, and Apurva Chitnis.

For keeping me sane, I would like to thank the past and present salseros and salseras of Urbana-

Champaign for giving me the chance to work off some frustration and energy, especially Rajiv,

Consuelo, Sarah, Joy, Ruben Aveledo, Julie Baterna, Lyre Murao, and the Regent Ballroom.

Last but not least, I would like to thank my father, Craig Squire, for his love and support, and

especially for slogging through early drafts of this dissertation, some of which was probably seemed

vi

quite foreign and unintelligible, and I would like to thank my mother for her constant prayers and

love, without which this process would have been much, much harder.

vii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Developmental Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 A Robotic System for Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Somatic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Noetic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Semantic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.2 General associative memory model for semantic learning . . . . . . . . . . . . 16

1.5 Contributions and Layout of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 19

CHAPTER 2 HIDDEN MARKOV MODELS AND THE RMLE ALGORITHM . . . . . . . 222.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Model Description and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3 Recursive Maximum-Likelihood Estimation of HMM Parameters . . . . . . . . . . . 26

2.3.1 RMLE derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.3 Model averaging and tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.4 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.5 Estimating a model with unknown model order . . . . . . . . . . . . . . . . . 42

2.4 HMMs as Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

CHAPTER 3 CASCADE OF HMMS: THEORY AND SIMULATION . . . . . . . . . . . . 483.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 HMMs for Learning Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1 Unimodal structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.2 Multimodal structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Cascade of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

viii

3.3.2 Recursive maximum-likelihood estimation for the cascade model . . . . . . . 523.3.3 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

CHAPTER 4 CASCADE OF HMMS AS AN ASSOCIATIVE MEMORY . . . . . . . . . . 644.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Associative Learning of Language Using Robots . . . . . . . . . . . . . . . . . . . . . 644.3 Concept Learning Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3.2 Model scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Robotic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4.1 Finite state machine controller . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4.2 Sensory inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4.3 HMM cascade model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.4.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Insights and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Derivation of recursive maximum-likelihood estimation algorithms . . . . . . 965.2.2 Generative modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.3 A language-learning robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

APPENDIX A HARDWARE AND SYSTEM-LEVEL SOFTWARE SPECIFICATIONS . . 99A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A.2 Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.2.1 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A.2.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

A.3 Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

APPENDIX B MOBILE ROBOT SOFTWARE . . . . . . . . . . . . . . . . . . . . . . . . . 103B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103B.2 Distributed Computing and Communication System . . . . . . . . . . . . . . . . . . 103

B.2.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104B.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105B.2.3 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.3 Speech Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.4 Visual Object Segmentation and Feature Extraction . . . . . . . . . . . . . . . . . . 114B.4.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115B.4.2 Pairwise Markov random fields . . . . . . . . . . . . . . . . . . . . . . . . . . 115

ix

B.4.3 Local message passing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 116B.4.4 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

APPENDIX C HIDDEN MARKOV MODEL ALGORITHMS . . . . . . . . . . . . . . . . . 119C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119C.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120C.3 Viterbi-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

APPENDIX D HIDDEN SEMI-MARKOV MODELS AND THE RMLE ALGORITHM . . 124D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124D.2 HSMM Model Description and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 125D.3 RMLE for the HSMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

APPENDIX E RMLE DERIVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133E.1 Proof that pn(y1, . . . , yn;ϕ) =

∏nk=1 b(yk;ϕ)′uk(ϕ) . . . . . . . . . . . . . . . . . . . 133

E.2 Proof that pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) =∏n′

k=1 d(τk)′B(yk|τk)uk . . . . . . . . . . 134

E.3 Specialized RMLE Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135E.3.1 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136E.3.2 Discrete observation probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 136E.3.3 Gaussian observation likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . 136

APPENDIX F MATRIX CALCULUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.3 Derivation of ∂

∂XaTX−1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

F.4 Derivation of ∂∂X

aX−1X−Ta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

x

LIST OF TABLES

Table

2.1 Simulation results for various combinations of learning rate ε and averaging history k. 35

3.1 Average classification accuracy for learned HMM ϕu over 50 simulation runs. . . . . 62

4.1 Average classification accuracy for learned HMM ϕc over 50 simulation runs. . . . . 744.2 List of words used in our robot demonstration. . . . . . . . . . . . . . . . . . . . . . 764.3 Harvard phonetically balanced sentences. . . . . . . . . . . . . . . . . . . . . . . . . 814.4 Initial observation probabilities used by the concept HMM for visible objects. . . . . 884.5 Initial observation probabilities used by the concept HMM for words. . . . . . . . . . 884.6 Trained transition probabilities for the concept HMM. . . . . . . . . . . . . . . . . . 914.7 Trained observation probabilities used by the concept HMM for visible objects. . . . 914.8 Trained observation probabilities used by the concept HMM for words. . . . . . . . . 91

A.1 Computing hardware mounted on robots. . . . . . . . . . . . . . . . . . . . . . . . . 102A.2 Computer Workstations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xi

LIST OF FIGURES

Figure

1.1 Cognitive Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Our robot Illy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Expanded view of the cognitive cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 The concept of apple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5 Visual/auditory concept hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.6 Associative learning of the word “apple.” . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 The effect of learning rate ε on parameter convergence during RMLE training, forconstant ε. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 The effect of ε0 and γ on parameter convergence during RMLE training, with anexponentially decreasing εn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 The effect of history size k on parameter averaging during RMLE training. . . . . . 392.4 Examples of learning in HMMs with finite-alphabet observation densities. . . . . . . 412.5 Initialization of an HMM with two-dimensional Gaussian observation densities. . . . 432.6 Learning an HMM using a model with a large number of states. . . . . . . . . . . . . 44

3.1 Semantic memory implemented using HMMs. . . . . . . . . . . . . . . . . . . . . . . 513.2 An HMM cascade model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 A dynamic Bayesian network (DBN) model showing the dependence among output

and state variables assumed by our cascade HMM. . . . . . . . . . . . . . . . . . . . 533.4 A switching HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.5 A cascaded switching HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Monte Carlo simulation for learning a cascaded switching HMM ϕ using a cascade

HMM ˆϕ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.7 Parameter learning for model ϕu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Parameter learning for model ϕl1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.9 Training run output for model ϕl2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.10 State sequence comparison between generative HMM ϕu and learned HMM ϕu. . . . 62

4.1 Concept learning scenario using a cascade of HMMs. . . . . . . . . . . . . . . . . . . 664.2 Model topology for robot concept learning. . . . . . . . . . . . . . . . . . . . . . . . 684.3 Parameter learning for model ϕc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Training run output for model ϕv. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.5 Parameter learning for model ϕa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.6 Objects used in our robot demonstration. . . . . . . . . . . . . . . . . . . . . . . . . 76

xii

4.7 The robot’s finite state machine controller. . . . . . . . . . . . . . . . . . . . . . . . 774.8 Auditory model used for speech recognition in our robot. . . . . . . . . . . . . . . . 794.9 Parameter estimation for phonetic HMM ϕaud. . . . . . . . . . . . . . . . . . . . . . 804.10 Equalized quantization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.11 Parameter learning for word model ϕword. . . . . . . . . . . . . . . . . . . . . . . . . 834.12 Parameter learning for model ϕvis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.13 Recognition of visual representations and concepts. . . . . . . . . . . . . . . . . . . . 854.14 Recognition of auditory representations and concepts. . . . . . . . . . . . . . . . . . 864.15 Recognition and learning using both auditory and visual information. . . . . . . . . 874.16 Illy learning about various objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.17 Parameter learning for model ϕcon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

B.1 Audio ring buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106B.2 Audio ring buffer on multiple machines. . . . . . . . . . . . . . . . . . . . . . . . . . 107B.3 Block diagram describing audio feature generation. . . . . . . . . . . . . . . . . . . . 114

xiii

LIST OF ABBREVIATIONS

AI artificial intelligence

CDF cumulative distribution function

CELL cross-channel early lexical learning

CHMM coupled hidden Markov model

DBN dynamic Bayesian network

EM expectation maximization

FSM finite state machine

GOFAI good old-fashioned artificial intelligence

HHMM hierarchical hidden Markov model

HMM hidden Markov model

HSMM hidden semi-Markov model

iid independent and identically distributed

JPDF joint probability density function

LAR log-area ratio

LP linear prediction

LPC linear prediction coefficient

MFCC mel-frequency cepstral coefficient

MLE maximum-likelihood estimation

ODE ordinary differential equation

RC reflection coefficient

RCLSE recursive conditioned least-squares estimation

RMLE recursive maximum-likelihood estimation

xiv

pdf probability density function

VCS voicing confidence score

VDHMM variable duration hidden Markov model

WLP warped linear prediction

WLPC warped linear prediction coefficient

wrt with respect to

xv

LIST OF SYMBOLS

CHAPTER 2

Xn, Yn discrete-time stochastic process defining a hidden Markov model (HMM)

(Ω,F , P ) probability space

Xn discrete-time first-order Markov chain

Yn observable stochastic process corresponding to Xn

xn the particular state value of Xn

yn the particular observation of Yn

r number of states in a Markov chain

R state space for Markov chain Xn; R = 1, . . . , r

πi probability of an HMM starting in state i, i ∈ R

π length-r vector of initial probabilities; π = πii∈R

Π set of all length-r stochastic vectors

aij P (Xn = j|Xn−1 = i); probability of transitioning from state i to state j in anHMM

A r × r transition probability matrix for an HMM; A = aiji,j∈R

A set of all r × r stochastic matrices

E space upon which each Yn takes values

b(·; θj), bj(·) observation density for state j of an HMM

θj parameters of a density function describing the observations of state j

Θ set of valid parameters for a family of observation densities

µj, σj mean and standard deviation parameters for state j of a single dimensional Gaus-sian distribution

s number of observations per state of an HMM with observations in a finite alphabet

xvi

V set of symbols in an HMM with observations from a finite-alphabet

vk observation symbol k of an HMM with observations from a finite-alphabet

bjk probability of observing symbol vk in state j of an HMM with observations froma finite-alphabet

g(·, θ) real-valued function on R indexed by θ; output is produced according to a prob-ability distribution with θ as a parameter

en sequence of independent and identically distributed (iid) random variables

Φ HMM parameter space; Φ = Π ×A× Θ or Φ = A× Θ

ϕ vector of model parameters for an HMM; ϕ ∈ Φ(e.g., ϕ = a11, a12, . . . , arr, θ1, . . . , θr)

ϕ estimate of model parameters for an HMM

ϕ∗ true model parameters for an HMM

p length of vector ϕ

ϕl the lth parameter of parameter vector ϕ; 1 ≤ l ≤ p

π(ϕ) initial probability vector for HMM ϕ

A(ϕ) transition probability matrix for HMM ϕ

aij(ϕ) i,jth element of A(ϕ)

θj(ϕ) observation density parameter(s) for state j of HMM ϕ

bj(·;ϕ) observation density of state j for HMM ϕ; equivalent to b(·; θj(ϕ))

bjk(ϕ) probability of observing symbol vk in state j of finite-alphabet HMM ϕ

µj(ϕ) observation mean of a single dimensional Gaussian distribution for state j ofHMM ϕ

σj(ϕ) observation standard deviation of a single dimensional Gaussian distribution forstate j of HMM ϕ

b(yn;ϕ) length-r column vector of observation density values for HMM ϕ; b(yn;ϕ) =[b1(yn;ϕ), . . . , br(yn;ϕ)]′

B(yn;ϕ) r×r diagonal matrix of observation pdf values for HMM ϕ; B(yn;ϕ) = diag[b1(yn;ϕ), . . . , br(yn;ϕ)]

〈y1, . . . , yn〉 a length-n sequence of observations

pn(y1, . . . , yn;ϕ) n-dimensional likelihood of observation sequence 〈y1, . . . , yn〉 for HMM ϕ

1` length-` column vector of all ones

uni(ϕ) probability of state i at time n given all previous observations; uni(ϕ) = P (Xn =i|y1, . . . , yn−1)

xvii

un(ϕ) length-r column vector of prior state probabilities for HMM ϕ at time n; un(ϕ) =[un1(ϕ), . . . , unr(ϕ)]′

w(l)n (ϕ) length-r column vector of derivative of un(ϕ) with respect to parameter l of ϕ;

1 ≤ l ≤ p

wn(ϕ) r × p matrix of derivatives of un(ϕ) with respect to all model parameters

R1(yn;ϕ) part of the calculation of w(l)n+1(ϕ)

R(l)2 (yn;ϕ) part of the calculation of w

(l)n+1(ϕ)

`n(ϕ) log-likelihood of 〈y1, . . . , yn〉 for HMM ϕ; `n(ϕ) = 1n+1 log pn(y1, . . . , yn;ϕ)

Yn a collection of parameters; Yn = (Yn,un(ϕ),wn(ϕ)).

S(l)(Y ;ϕ) the derivative of the last update to the likelihood function with respect to ϕl

S(Yn;ϕ) length-p “incremental score vector”; the collected derivatives of the likelihoodfunction with respect to each parameter; S(Yn;ϕ) = [S(1)(Yn;ϕ), . . . , S(p)(Yn;ϕ)]′

(∂/∂ϕl)h partial derivative of function h(·) with respect to ϕl

εn learning rate parameter; εn → ∞;∑

n εn = ∞

ΠG Projection onto set G

G compact and convex set; subset of parameter space Φ; G ⊆ Φ

µj(ϕ) observation mean of a multi-dimensional Gaussian distribution for state j ofHMM ϕ

Σj(ϕ) observation covariance matrix of a multi-dimensional Gaussian distribution forstate j of HMM ϕ

Rj(ϕ) the upper triangular matrix of the Cholesky decomposition of Σj(ϕ); Σj(ϕ) =Rj(ϕ)′Rj(ϕ)

Pϕ∗ probability measure for ϕ∗

K(ϕ) Kullback-Leibler information of ϕ; K(ϕ) = −[`(ϕ) − `(ϕ∗)]

LML set of global minima of K(ϕ)

Mn projection term needed to get ϕn + εnSn(Yn;ϕn) back to constraint set G

ϕ first derivative of ϕ, when described as an ordinary differential equation (ODE)

H(ϕ) (∂/∂ϕ)K(ϕ)

m force term needed to keep ODE ϕ(·) ∈ G

LG the set of limit points of finite difference equation 2.24

Nη(A) an η neighborhood of A

xviii

Sn averaged version of update S.

ϕn averaged version of parameter set ϕn

k the maximum history size used for averaging Sn and ϕn

fn(ϕ) length-r vector of posterior probabilities of states at time n for HMM ϕ

fni(ϕ) probability that the state is i after n observations; fni(ϕ) = P (Xn = i|y1, . . . , yn)

CHAPTER 3

ϕl1 ,ϕl2 the two lower level HMMs in an HMM cascade model

ϕu the upper level HMM in an HMM cascade model

ϕ an HMM cascade model; ϕ = ϕl1 ,ϕl2 ,ϕu

yu,1n , yu,2

n the observations of ϕu, corresponding to states in ϕl1 and ϕl2 respectively

xlγ generic term referring to xl1 or xl2 , the states of ϕl1 and ϕl2

λ a switching HMM

s number of transition probability matrices in a switching HMM

Am(λ) the set of transition probability matrices in switching HMM λ; m = 1, . . . , s

qn an external signal which chooses the transition probability matrix to use at timen

ϕ a cascaded switching HMM; ϕ = ϕu,λl1 ,λl2

CHAPTER 4

ϕc the robot’s concept HMM

ϕa the robot’s auditory HMM

ϕv the robot’s visual HMM

ˆϕrobot the cascade HMM being learned by the robot (simulation); ˆϕrobot = ϕc, ϕa, ϕv

ϕc the boy’s concept HMM

λa the boy’s auditory switching HMM

ϕv the boy’s visual HMM

ϕboy the cascaded switching HMM used by the“boy”(simulation); ϕboy = ϕc,λa,ϕv

ϕvis the visual HMM producing real-world outputs (simulation)

yvisn the output of the visual model

xix

xvn estimate of the state of the boy’s visual HMM (ϕv)

xcn estimate of the state of the boy’s concept HMM (ϕc)

ycan generated output of the boy’s concept HMM (ϕc) corresponding to auditory

information

xan the boy’s auditory model state

yan, y

audn the boy’s auditory model output

xa, xv estimated state sequences for the robot’s auditory and visual models

xc estimated state sequence for the robot

APPENDIX B

s(n) speech signal

ak linear predictive coefficient

e(n) prediction error

E(n) squared prediction error

ki reflection coefficients (RCs) for a one-dimensional vocal tract tube model

Ai area of one segment of a one-dimensional vocal tract tube model

gi log-area ratios; gi = Ai+1

Ai

c(1)t , c

(2)t voicing confidence scores for the first and second half of a speech segment, re-

spectively.

ctotal initial voicing confidence score estimate

cf final voicing confidence score estimate

Vthresh threshold for determining strongly unvoiced speech segments

en log-energy of a speech segment

yij an image pixel at location (i, j)

xij lattice point at location (i, j); corresponds to yij

Y random variable representing an entire image; Y = yij

X random variable representing an entire lattice, corresponding to a segmentationof image Y ; X = xij

P (X,Y ) joint probability of X and Y

Z scale factor

xx

ψ(xij , xkl) within-lattice potential function

φ(xij , yij) lattice-image potential function

X∗ optimal segmentation of image Y

x∗ij optimal segmentation label of lattice point xij

n iteration number

mn(ij,kl)(xkl) message passed from xij to xkl at time n

α scaling constant

Γ(i, j) set of neighbors of (i, j)

γ scale factor

APPENDIX C

Xn, Yn discrete-time stochastic process defining a hidden Markov model (HMM)

Xn discrete-time first-order Markov chain

Yn observable stochastic process corresponding to Xn

xn the particular state value of Xn

yn the particular observation of Yn

〈y1, . . . , yn〉 a length-n sequence of observations

〈x1, . . . , xn〉 a length-n sequence of states

pn(y1, . . . , yn;ϕ) n-dimensional likelihood of observation sequence 〈y1, . . . , yn〉 for HMM ϕ

αni(ϕ) forward probability; the joint likelihood of 〈y1, . . . , yn〉 and Xn = i; αni(ϕ) =p(y1, . . . , yn, Xn = i;ϕ)

αn(ϕ) length-r column vector of forward probabilities for HMM ϕ at time n; αn(ϕ) =[αn1(ϕ), . . . , αnr(ϕ)]′

βni(ϕ) backward probability; given Xn = i, the conditional likelihood of 〈yn+1, . . . , yN 〉;βni(ϕ) = p(yn+1, . . . , yN |Xn = i;ϕ)

βn(ϕ) length-r column vector of backward probabilities for HMM ϕ at time n; βn(ϕ) =[βn1(ϕ), . . . , βnr(ϕ)]′

P for Baum-Welch parameter estimation, the n-dimensional likelihood of observa-tion sequence 〈y1, . . . , yn〉; P = pn(y1, . . . , yn;ϕ); for the Viterbi algorithm, the n-dimensional joint likelihood of 〈y1, . . . , yn〉 and 〈x1, . . . , xn〉;P = pn(y1, . . . , yn, x1, . . . , xn;ϕ)

xxi

γij the expected number of transitions from state i to state j for a given model andobservation sequence

γi the expected number of transitions out of state i for a given model and observationsequence

aij(ϕ) new estimate of transition probability aij(ϕ) in Baum-Welch or Viterbi reesti-mation

bjk(ϕ) new estimate of observation probability bjk(ϕ) in Baum-Welch or Viterbi reesti-mation

πi(ϕ) new estimate of initial probability πi(ϕ) in Baum-Welch or Viterbi reestimation

µj(ϕ) new estimate of observation mean µj(ϕ) in Baum-Welch or Viterbi reestimation

σj(ϕ) new estimate of observation standard deviation σj(ϕ) in Baum-Welch or Viterbireestimation

φni the maximum joint likelihood of 〈y1, . . . , yn〉 , 〈x1, . . . , xn−1〉, and Xn = i, calcu-lated recursively in the Viterbi algorithm

ψnj the most likely state at time n− 1 leading to state j at time n

s number of observations per state of an HMM with observations in a finite alphabet

V set of symbols in an HMM with observations from a finite-alphabet

vk observation symbol k of an HMM with observations from a finite-alphabet

bjk probability of observing symbol vk in state j of an HMM with observations froma finite-alphabet

APPENDIX D

Xn′ , Y n′ , T n′ discrete-time stochastic process defining a hidden semi-Markov model (HSMM)

Xn′ discrete-time first-order Markov chain

Y n′ observable stochastic process corresponding to Xn′

T n′ sequence of discrete state durations corresponding to Xn′

n′ model time counter

(Ω,F , P ) probability space

xn′ the particular state value of Xn′

yn′ the particular τn′ -length observation of Y n′

τn′ the duration of observation sequence yn′

r number of states in a Markov chain

xxii

R state space for Markov chain Xn′ ; R = 1, . . . , r

πi probability of an HSMM starting in state i, i ∈ R

π length-r vector of initial probabilities; π = πii∈R

Π set of all length-r stochastic vectors

aij P (Xn′ = j|Xn′−1 = i); probability of transitioning from state i to state j in anHSMM

A r × r transition probability matrix for an HSMM; A = aiji,j∈R

A set of all r × r stochastic matrices

d(·;λj), dj(·) parametric duration density for state j of an HSMM

λj parameters of a density function describing the durations of state j of an HSMM

Λ set of valid parameters for a family of duration densities

νj, ηj parameters of a gamma function describing the durations of state j of an HSMM

(dj1 . . . , djT ) discrete probability distribution describing the durations of state j of an HSMM

b(·|τ ; θj), bj(·|τ ) conditional observation density for state j of an HSMM

θj parameters of a density function describing the observations of state j

Θ set of valid parameters for a family of observation densities

Φ HSMM parameter space; Φ = Π ×A× Λ × Θ or Φ = A× Λ × Θ

ϕ vector of model parameters for an HSMM; ϕ ∈ Φ (e.g., ϕ = a11, a12, . . . , arr, λ1, . . . , λr, θ1, . . . , θr)

ϕ estimate of model parameters for an HSMM

ϕ∗ true model parameters for an HSMM

p length of vector ϕ

ϕl the lth parameter of parameter vector ϕ; 1 ≤ l ≤ p

π(ϕ) initial probability vector for HSMM ϕ

A(ϕ) transition probability matrix for HSMM ϕ

aij(ϕ) i,jth element of A(ϕ)

λj(ϕ) duration density parameter(s) for state j of HSMM ϕ

θj(ϕ) observation density parameter(s) for state j of HSMM ϕ

b(yn′ |τn′ ;ϕ) length-r column vector of observation density values for HSMM ϕ; b(yn′ |τn′ ;ϕ) =[b1(yn′ |τn′ ;ϕ), . . . , br(yn′ |τn′ ;ϕ)]′

xxiii

B(yn′ |τn′ ;ϕ) r × r diagonal matrix of observation pdf values for HSMM ϕ; B(yn′ |τn′ ;ϕ) =diag[b1(yn′ |τn′ ;ϕ), . . . , br(yn′ |τn′ ;ϕ)]

d(τn′ ;ϕ) length-r column vector of duration density values for HSMM ϕ; d(τ n′ ;ϕ) =[d1(τn′ ;ϕ), . . . , dr(τn′ ;ϕ)]′

D(τn′ ;ϕ) r × r diagonal matrix of duration density values for HSMM ϕ; D(τ n′ ;ϕ) =diag[d1(τn′ ;ϕ), . . . , dr(τn′ ;ϕ)]

g(yn′ , τn′ ;ϕ) length-r column vector, product of observation and duration densities for HSMMϕ; g(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)1r

G(yn′ , τn′ ;ϕ) r × r diagonal matrix, product of observation and duration densities for HSMMϕ; G(yn′ |τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)

n normal time counter

t0τn′(k′) function defining the normal-time beginning of the k ′th state for duration sequence

τn′

t1τn′(k′) function defining the normal-time end of the k ′th state for duration sequence τn′

ξτn′(n) function defining the model time corresponding to real time n

Xn normal-time state process of an HSMM; Xn = Xξτn′

(n)

Yn normal-time observable process of an HSMM; Y n′ =⟨Yt0(n′), . . . , Yt1(n′)

⟩

〈y1, . . . , yn〉 a length-n sequence of normal-time observations

〈y1, . . . , yn′〉 a length-n′ sequence of model-time observations; y1 =⟨y1, . . . , yt1(1)

⟩, y2 =

⟨yt0(2), . . . , yt1(2)

⟩, . . . , yn′ =

⟨yt0(n′), . . . , yn

⟩

pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) n′-dimensional likelihood of observation sequence 〈y1, . . . , yn′〉 forHSMM ϕ

1` length-` column vector of all ones

un′j(ϕ) probability of state j at model time n′ given all observations through yn′−1;unj(ϕ) = P (Xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n

′)

un′(ϕ) length-r column vector of prior state probabilities for HSMM ϕ at model timen’; un′(ϕ) = [un′1(ϕ), . . . , un′r(ϕ)]′

unj(ϕ) probability of state j at normal time n given all previous observations, and giventhat we just changed states; unj(ϕ) = P (Xn = j|y1, . . . , yn−1, τ1, . . . , τn′−1, n

′, ξ(n−1) = n′ − 1, ξ(n) = n′)

un(ϕ) length-r column vector of prior state probabilities for HSMM ϕ at normal timen; un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′

`n′(τn′,ϕ) normalized log-likelihood of model-time observations 〈y1, . . . , yn′〉 for HSMM ϕ

`n(ϕ) log-likelihood of normal-time observations 〈y1, . . . , yn〉 for HSMM ϕ

xxiv

n′∗n the number of segments which maximizes `n(ϕ)

Tn process describing the most likely sequence of durations

τ∗n the length of the last segment of yn′ which maximizes `n(ϕ)

w(l)n (ϕ) length-r column vector of derivative of un(ϕ) with respect to parameter l of

HSMM ϕ; 1 ≤ l ≤ p

wn(ϕ) r × p matrix of derivatives of un(ϕ) with respect to all model parameters

R1(yn′ , τ ;ϕ) part of the calculation of w(l)n+1(ϕ)

R(l)2 (yn′ , τ ;ϕ) part of the calculation of w

(l)n+1(ϕ)

Yn a collection of parameters; Yn = (Yn, Tn,un(ϕ),wn(ϕ)).

S(l)(Y ;ϕ) the derivative of the last update to the likelihood function with respect to ϕl

S(Yn;ϕ) length-p “incremental score vector”; the collected derivatives of the likelihoodfunction with respect to each parameter; S(Yn;ϕ) = [S(1)(Yn;ϕ), . . . , S(p)(Yn;ϕ)]′

(∂/∂ϕl)h partial derivative of function h(·) with respect to ϕl

εn learning rate parameter; εn → ∞;∑

n εn = ∞

ΠG Projection onto set G

G compact and convex set; subset of parameter space Φ; G ⊆ Φ

xxv

CHAPTER 1

INTRODUCTION

1.1 Background and Motivation

Cognitive development has been studied in various environments—on the playground by the psy-

chologist, under the microscope by the neuroscientist, and in the armchair by the philosopher. Our

study occurs in a robotics lab, where we attempt to embody cognitive models in steel and silicon.

How did we choose this particular habitat? First and foremost, we are scientists and engineers,

which immediately suggests forming theories and building things to test them. The particular

question we are examining is one of the most fascinating questions that has been asked in the last

century: Can machines think?

Alan Turing raised this very question back in 1950. He introduced the idea of a machine

engaging in “pure thought” and communicating to the world via teletype writer. As an answer to

The Question, he suggested that when the machine’s discourse (via teletype) was indistinguishable

from a human’s, we could say that the machine was thinking. He goes on at the end of the paper to

suggest that initially, machines could perhaps learn to compete with men at some purely intellectual

task, such as chess, but then, he suddenly presents an alternative approach for creating machine

intelligence:

It can also be maintained that it is best to provide the machine with the best sense

organs that money can buy, and then teach it to understand and speak English. This

process could follow the normal teaching of a child. Things would be pointed out and

named, etc. [1, p. 76]

1

Most artificial intelligence research has followed the former proposal. We believe the latter method

holds more promise.

As scientists, we start with a hypothesis. Our hypothesis forms a constructive theory of mind,

and can be summarized as follows. We believe that human intelligence, and hence language, is

primarily semantic. We believe that the mind forms semantic concepts through the association of

events close together in time, or events or cues close together in space, or both. We further believe

that an integrated sensory-motor system is necessary to ground these concepts and allow the mind

to form a semantic representation of reality—there is no such thing as a disembodied mind.

To test our hypothesis, we are developing a robotic platform, complete with basic sensory-

motor and computing capabilities. The sensory-motor components are functionally equivalent to

their human or animal counterparts, and include binaural hearing, stereo vision, tactile sense,

and basic proprioceptive control. On top of these components, our group is implementing various

processing and learning models, with the intention of creating and aiding semantic understanding.

Our goal is to produce a robot that will learn to understand and carry out simple tasks in response

to natural language requests.

At this point in time, we have already developed a robust base system and conducted a number

of experiments on the way to our goal of a language-learning robot. In particular, we have developed

the basic hardware and software framework necessary for our work, have run numerous experiments

to study ideas in learning, memory and behavior. My primary contribution, and the main focus of

this dissertation, is an associative semantic memory based on hidden Markov models (HMMs) and

built as part of the robot’s cognitive system.

In the following introductory sections, we will discuss previous work in the field of developmental

robotics (which is the subfield of robotics to which our work belongs), give an overview of our

project, and then describe the semantic learning ideas used as a basis for the research described in

the bulk of this dissertation. Note that previous related work in stochastic modeling is described

in Section 3.2, and previous work in language grounding and associative language learning appears

in Section 4.2.

2

1.2 Developmental Robotics

From Turing’s 1950 paper until the mid-1980s, the field of artificial intelligence (AI) was dominated

by research on what Turing referred to as“purely intellectual” tasks. Despite the agreement that the

long-term goal of AI and robotics was to design physical systems exhibiting intelligent behavior, AI

research had until that time focused mostly on isolated topics: representation, search algorithms,

planning, etc. [2]. Turing’s suggestion to provide machines “with the best sense organs that money

can buy” was largely forgotten.

About 20 years ago, some AI researchers were beginning to feel that there were some fundamen-

tal problems with this good old-fashioned AI (GOFAI). Rodney Brooks was one of the first people to

articulate this point. In 1986, he argued that the most important aspects of intelligence were being

ignored by the AI community. Specifically, he suggested that much more focus needed to be given

to interaction with the environment, rather than just focusing on representation, that mobility,

vision, and survival behavior “provide a necessary basis for the development of intelligence” [3, p.

2].

Since this time, a number of researchers have built on and expanded on these basic ideas, and

the subdiscipline of developmental, or epigenetic, robotics have emerged. Developmental robotics

focuses on the use of robots to study cognitive development, and draws people from a wide variety

of backgrounds, including developmental psychology, neuroscience, biology, and robotics.

There are many common themes among research in this area. The following ideas were selec-

tively compiled from [3–9]:

1. Biological and cognitive systems consist of a large number of simple, integrated modules—

these systems are not monolithic. Complex behavior can emerge from this integrated system.

2. Biological and cognitive systems develop incrementally, both through evolution and through

learning.

3. Any form of cognition requires embodiment. Any representations that exist in the brain are

fundamentally based in the world and have no meaning outside of this context. An integrated

sensory-motor system is thus necessary for cognition.

3

4. Higher cognitive development depends on social interaction, including mimicry/imitation and

shared attention.

Most research in developmental robotics incorporates multiple ideas from this list. We highlight a

few projects below.

After some time focusing on insect robots and refining his initial ideas, Brooks group used what

they learned to change direction toward studying humanlike intelligence. In the early 1990s, they

built Cog [4, 10], an upper-torso humanoid robot, with the goal of studying issues in embodiment,

integration of multiple sensory and motor systems, and social interaction. A second robot, Kismet

[11, 12], was also built to study social interactions between robots and humans. In general, their

research has been based on ideas from psychology, with an initial focus on creating robust modules

for low-level cognitive functions, then progressing to study the relationship between low-level and

high-level cognitive functions and social interaction. Other projects focusing on social interaction

include [13, 14].

John Weng’s group at Michigan State have been working on two mobile humanoid robot

projects, SAIL and DAV [7, 8, 15, 16]. The focus of these projects is a developmental learning

model based on human cognitive and behavioral development, and with focus on sensory integra-

tion and high level cognitive function. Their research has mainly drawn from ideas and research in

developmental psychology.

At the other end of the spectrum, various researchers [6, 17–19] are concerned with studying

brain activity through the development of machines built on neurobiological principles. Specifically,

their approach is to develop low-level neurological models of the brain, and put them in simple

animal-like robots with the ability to sense and interact with the world. With these experiments,

they study the emergent behavior the models allow the robot to produce, as well as how closely

the response of the models match responses from neurological research.

Various other researchers [2,20–24] study aspects of cognition using robotics; see [9] for a recent

survey. One key aspect of our project is our focus on language learning and interaction as a basis

for higher level learning. We describe our project in the next section.

4

SensorySystem

MotorSystem

OutsideWorld

NoeticSystem Somatic

System

Feedback

Proprioceptive

Figure 1.1: Cognitive Cycle. This figure shows the flow of cognition among the senses, the noeticsystem, the motor system, and the environment. The noetic system refers physically to the brainand nervous system, which are assumed to be responsible for mental processes.

1.3 A Robotic System for Language Acquisition

As with other researchers in developmental robotics, our group is using robots to study cognition.

The description of our research begins with the cognitive cycle depicted in Figure 1.1. This simple

diagram shows the flow of cognition among three systems (a sensory system, a noetic system,

a motor system) and the environment. The fact that this diagram equally emphasizes these four

components is significant, as we feel that grounding and interaction with the world are requirements

for cognition. We describe the components of the cycle in more detail below, with discussion on

how they relate to human cognition and implementation of functional equivalents.

1.3.1 Somatic system

The somatic system is the “body” component of the mind-body system. It is composed of the

physical components necessary for cognition: the senses, muscular (motor) system, nervous system,

and the brain.

5

1.3.1.1 The senses

The necessary start of the cognition is the gathering of information from the environment through

the sensory inputs. In humans, these inputs include the five senses—tactile (touch), gustatory

(taste), olfactory (smell), auditory (hearing), and visual (sight). We also perceive information

about ourselves, through proprioception (sense of body position and movement) and interoception

(internal sensory perception of such things as hunger and body temperature). From these we draw

all of our experience, and while we can learn and adapt without one or more of them, sensory

perception is a prerequisite to our cognitive abilities.

1.3.1.2 Muscular/motor system

Our senses provide us with information from the environment, but the ability to perceive the

environment is only half of the connection with the world necessary for cognition. Humans and

other animals also have the ability to move around, interact with and affect our environment. We

can identify two classes of human movement that we wish to emulate:

1. Full body movement in the environment

2. Actuated and articulated movement of body parts (e.g., movement of arms and head, speech)

To do the most humanlike cognitive studies, we would like to work with a robot which is as

anthropomorphic as possible.

1.3.1.3 Brain and nervous system

The last fundamental components of the somatic system necessary for modeling cognition are the

brain and nervous system. For the study of cognition, we obviously need to emulate the functions

of these as well. We need a way to connect the sensory-motor periphery to the brain, and we, of

course, need to model certain functional aspects of the brain. The functionality of the brain which

we wish to model is described in more detail in Section 1.3.2.

6

Figure 1.2: Our robot Illy. Illy is one of three Arrick Robotics Trilobots we use for our cognitionand language acquisition research. The base unit for the robots was heavily augmented to includestereo cameras and microphones, an on-board computer, and wireless ethernet.

1.3.1.4 Implementation

Sensory perception and motor expression are the essential connections of the mind to the outside

world, and require a body. The body we chose to work with is Arrick Robotics’ Trilobot [25] (see

Figure 1.2). The robot’s anthropomorphic capabilities are rich enough to suit our purposes. In

particular, the robot can move freely via wheels, can move its head, and use its arm to manipulate

common objects, allowing relatively complex behaviors. A speaker is available on-board for the

production of sounds and, with additional processing, speech.

For embodied cognition, we desire our robot to have as many of the previously mentioned senses

as possible. For our robot’s eyes and ears, we have added cameras and microphones to the robot

to give it stereo vision and hearing capabilities. The sounds and images that humans receive are

of course processed by our brain, but even before that, the ear and eye do significant processing

on their inputs. It is well known, for example, that the human ear acts as a spectral filter (see

e.g., [26]), and that a large amount of feature extraction occurs in the retina before the signal

even leaves the eye (see e.g., [27]). Since the cameras and microphones mounted on our robots

7

do not handle this processing, we have implemented, in software, some basic audio and visual

processing and feature extractors to mimic aspects of these systems. For visual inputs, we use

mostly standard image processing and computer vision techniques. See [28–30] for details. For

auditory inputs, in addition to standard spectral filtering, D. Li has developed some important

processing techniques useful for anthropomorphic behavior. These include binaural sound source

localization and sound characterization. A robust sound source localization algorithm based on his

work is currently implemented on the robot, and is a key component of our work. Details can be

found in [31] and [32].

Equivalents for other senses are slightly more difficult to incorporate. Touch sensors, while not

nearly as versatile as skin, do allow for limited input of tactile sensations, and the Trilobot has a

number of touch and other sensors available. Some aspects of proprioception are implemented in

software and by using feedback sensors on some of the actuators located on the robot. Olfactory and

gustatory sensors are more difficult to include, and we chose to ignore these senses for now. However,

research has progressed in the development of artificial skin [33], noses [34], and tongues [35].

Sometime in the not too distant future, researchers will be able to use these organs to allow a robot

to perceive an even richer set of sensory inputs. For now, we have chosen to focus on the senses of

sight, sound, and touch, with minimal simulation of the others (e.g., proprioception) as needed.

Analogous to the brain and nervous system in humans and higher animals, our robot needs a

computational brain and a way to deliver information from its various sensors to this brain. On the

hardware level, we have incorporated a computer on board our robot which collects input from the

cameras, microphones, and sensors, and sends control commands to the robot. The computer can

also handle limited processing of the data, but a wireless transmitter is available to transmit the

data to other workstations, where most processing occurs. This distributed system of computers

houses the “brain” of our robot. To facilitate the communications necessary for this system, we

did extensive design and coding of a distributed communications and processing framework early

in this research. Details of this work appear in Appendix B.2. Hardware and system-level software

specifications can be found in Appendix A.

8

1.3.2 Noetic system

The noetic system in Figure 1.1 represents the “mind” aspect of the mind-body paradigm, which

we expand in further detail in Figure 1.3. Here we are not as interested in emulating the physiology

and low-level connectivity of the brain, except at the grossest levels; e.g., we would like our robots

to exhibit aspects of self-organization and emergent behavior. Our goal, though, is implementing

functional equivalents for high-level cognitive functions. In this section we review some of the

fundamentals of memory, learning, and behavior, and describe how these fundamentals are reflected

in our research. We would like to note that, even though we divide the various aspects of cognition

into these three areas, they are all interdependent; none of these cognitive components could exist

without the other two.

1.3.2.1 Memory

Memory is the most important function of the brain; without it life would be a blank.

Our knowledge is all based on memory. Every thought, every action, our very conception

of personal identity, is based on memory.... Without memory all experience would be

useless. (Edridge-Green, 1900) [36, p. 188]

Browsing through any recent psychology textbook, one can discover a plethora of views and theories

concerning the organization of the human memory system [37]. Some of these are complementary,

others are overlapping, but most simply look at memory from a different perspective. While all

of these views are constructive, here we choose and briefly describe one of the most fundamental

classifications of memory.

William James is generally regarded as the first person to suggest that memory is divided into

primary and secondary systems [38]. This idea later evolved into the concepts of short-term memory

(or working memory), and long-term memory, which we refer to as associative memory. These two

systems are presented as primary components of the noetic system in Figure 1.3.

Short-term memory refers to the immediate thoughts going through our head, whether obtained

from our senses or by manipulation of thoughts or knowledge retrieved from our long-term memory.

The term working memory came about after more research, and refines the idea of short-term

9

SensorySystem

MotorSystem

OutsideWorld

NoeticSystem

SomaticSystem

SemanticMemory

EpisodicMemory

ProceduralMemory

MemoryAssociative

WorkingMemory Noetic

System

MemoryAssociative

Feedback

Proprioceptive

Figure 1.3: Expanded view of the cognitive cycle. This expanded view shows the breakdown andrelationship among various components of the noetic system.

10

memory as a system consisting of a central executive and a number of subsidiary systems, including

at least visual and phonological subsystems [37, 38].

Conceptualization of long-term memory has also been considerably refined since James’ time.

One of the most common models, attributed to Endul Tulving [39], divides long-term memory into

procedural, semantic, and episodic memory, as shown at the bottom of Figure 1.3. Procedural

memory is concerned with our knowledge of how to do things, e.g., how to walk or drive a car.

Semantic memory concerns meaning and our general knowledge about the world. This includes,

for example, meanings of words and knowledge of where we live. Episodic memories are memories

of specific events that have occurred in the past, or alternatively, events that we anticipate in the

future.

1.3.2.2 Learning

Learning can be described as a transition from one mental state to another where information is

gained [40]. In this section, we will highlight what we feel are some essential aspects of learning

that we need to incorporate into our research.

Associative learning. If, as proposed earlier, an associative memory is the central component

of memory, the corollary states that associative learning is the primary mechanism of learning.

According to Shanks [40], in associative learning, “the environment provides a relationship among

contingent events, allowing [a] person to predict one [event] in the presence of others” (p. 2).

Possible events include both environmental cues and the subject’s own behavior. The relationship

between or among events can be causal or structural. In causal relationships, one event occurs,

followed by another, perhaps after a brief time interval. For example, there is a consistent causal

relationship between touching a hot burner and feeling pain. Structural relationships relate features

or properties of an object or event with other features which frequently co-occur. For example, after

both seeing and smelling a fire, the presence of one of these events generally indicates the presence

of the other. A less obvious example of a structural relationship is the association of a word with

a particular object or event, a key focus of our research.

11

Reinforcement learning. Reinforcement learning is one aspect of associative learning. It can

refer to a couple of distinct but related concepts, depending on the type of relationship being

learned:

1. knowledge gained through repeated stimulation of co-occurring cues from the environment;

or

2. behavior learned through the repeated association of an action and a reward or punishment

[41] (i.e., behaviorism).

Here, we will briefly describe the first version of reinforcement learning. Formally, a subject is

connected to its environment through perception and action. Through its senses, it perceives some

indication i of the state of the environment. The subject then produces some action a which has

an effect on the environment. This effect is evaluated through a reinforcement signal r (the reward

or punishment). The reinforcement signal may be internally or externally generated, but in either

case is a function of input i. In general, the subject’s goal is to choose actions which in some way

maximize the long-run sum of r.

A simple example of reinforcement learning occurs in the training of animals. An interesting

example comes from a recently published New York Times article, which describes how Gambian

giant pouched rats are being trained to find land mines [42]. Finding a mine earns each rat a snap of

a clicker and a snack of peanuts or banana. At times, the rats try to game the system by randomly

scratching the earth in the hopes of getting free treats, but they are rewarded with food only for

actual finds. Of course, reinforcement learning examples do not have to be so esoteric, nor are they

necessarily limited to other animals. Almost any activity humans attempt can involve evaluation

which causes a modification of future behavior.

1.3.2.3 Behavior

If memory contains our knowledge about the world, and learning modifies that knowledge, behavior

puts that knowledge into use. Behavior is, of course, intimately linked to the reinforcement learning

mechanism described above. Some human or animal behaviors would be difficult to emulate (e.g.,

12

procreation), but there are specific behaviors and aspects of behaviors which we would like to model.

A few are listed below.

Curiosity and exploration. Humans are curious creatures. Gopnik et al. [43] suggest that

infants and children are wired to explore, experiment, and learn about the world. Garvey [44]

also states that the cognitive abilities learned in the first two years “are developed by acting on

and interacting with ... things and people.... [T]hese developments also reflect the beginnings

of symbolic representation, a prerequisite to the development of language and abstract thinking”

(p. 41). We feel that exploration is necessary to obtain as much information as possible about our

environment, and is instrumental in our cognitive development.

Language understanding and acquisition. Since the focus of our research is language acquisi-

tion, some of the behaviors we hope to emulate are directly related to language and communication.

We have already mentioned Garvey’s comments above concerning exploration and language devel-

opment. More direct examples of linguistic behaviors are available. For example, dogs can be

taught to retrieve named objects [45], and children begin to understand and say object names at a

young age. Both of these behaviors are essential targets for our research.

Imitation. Children learn extensively through imitation of both speech and action [43]. One

benefit of imitation is that it gives an example specific behavior and desired outcome, which can

be used for evaluation in a reinforcement learning paradigm. Learning through imitation has also

been proposed as an efficient and perhaps necessary mechanism for learning in robots [13, 46–48].

1.3.2.4 Implementation

The noetic system in our robot should be able to express the aspects of memory, learning, and

behavior outlined above. Among other things, the robot needs to:

1. look around, navigate, and perform actions (procedural memory, using reinforcement learning

and imitation);

13

2. learn about and understand its environment (semantic memory, with associative learning);

and

3. make decisions using what it knows and currently senses (working memory and a central

decision maker, interacting with long-term memory).

The ability to remember specific past events or sequences of events, and the ability to predict or

even desire future events (episodic memory), are also essential for our study of language learning,

the principal long term goal of our work.

In our group’s research, we have studied various incarnations of these ideas. Our work includes

research in many of the topics just discussed, including (1) navigation and interaction via rein-

forcement training, (2) autonomous exploration, (3) speech imitation, and (4) concept learning via

association. We highlight this work below.

Environment navigation and interaction via reinforcement learning. Just as a child

must learn to move and interact with the world, our robot needs to learn to move around and

interact with its environment. To this end, members of our group have developed and implemented

reinforcement learning algorithms which allow the robot to learn navigation. In Lin’s work [30],

the robot learns to visually navigate a maze using Q-learning, a reinforcement learning algorithm.

Zhu and Levinson [49] developed an improved Q-learning algorithm called propagated Q-learning,

or PQ-learning, and used this method on the robot to learn general navigation toward a goal,

including obstacle avoidance.

Autonomous exploration. As mentioned above, children have a natural curiosity about the

world, and set out and explore as soon as they are able. McClain [50] identifies three general

instincts necessary for exploration:

1. The motivation and ability to search for and identify new objects

2. The motivation and ability to interact with objects

3. A survival instinct

14

Starting with these built-in behaviors, the robot explores its environment looking for objects. It is

particularly interested in objects that it has not seen before. Each time it discovers a new object,

it will approach the object and play with it, first attempting to pick it up and then attempting to

knock it over. The robot will also turn toward loud sounds, under the assumption that it will find

an object of interest in that direction. This work also demonstrates the ability of the robot to run

autonomously for long periods of time in a robust manner.

Speech imitation. From a young age, children learn to speak by mimicking those around them.

We plan to use speech imitation as a vehicle for the robot to learn to speak. Kleffner [48] has

developed a robust method for speech imitation, involving extracting phonetic and phonemic fea-

tures from the sound stream which give an internal representation correlating to the vocal tract

shape, while taking into account the resolution of the human ear. The features that are extracted

can be reused for speech synthesis or combined with features from other modalities for recognition

and learning. Experiments with the robot, described in Chapter 4, use these features for speech

recognition.

Semantic concept learning via association. As noted earlier, one aspect of learning funda-

mental to our work is the idea that learning and recall occur mostly as the association of sensory

input data. The main focus of the rest of this dissertation is the development and use of a cascade

of HMMs for associative learning of semantics. The topic, as it pertains to this dissertation, is

introduced and discussed in more detail in the next section.

In addition to the research described herein, two others in our group have addressed this research

question. Liu [51] developed a system whereby a benevolent teacher would push on a touch sensor

on the robot while speaking a movement command. For example, the teacher might push on a

sensor on the back of the robot and say “forward.” A touch on the rear sensor would “push” the

robot forward (its wheel would straighten and its motor would start running). After a training

period, the robot could, on a speaker dependent basis, be controlled by voice. In this work the

robot shows the beginnings of a conceptual understanding of commands and directions through

voice and tactile sensors.

15

Zhu and Levinson [52] also conducted some experiments on scene concept learning. In their

work, they proposed a joint probability density function (JPDF) representation for learning such

visual concepts as color, shape, and object name. Zhu and Levinson’s model was able to successfully

learn labels for 6 color concepts, 3 shape concepts, and 13 object concepts drawn from 15 natural

objects.

In the next section, we give more details and expand on the basic ideas of semantic learning.

1.4 Semantic Learning

1.4.1 Introduction

Let us restate our basic assumptions: first, that language is primarily semantic (that is, it is

concerned mostly with our knowledge of the world); second, that this understanding is gained by

recognizing and learning relationships between or among events and cues in the environment; and

third, that this learning requires the learner to be embodied and situated in the environment. In this

section, we will develop a basic model for learning semantic associations from environmental cues.

We note that our focus is on semantic knowledge gained primarily through repeated stimulation

from the environment, and so, for now, we are ignoring one-shot or fast-map learning [45, 53–55].

1.4.2 General associative memory model for semantic learning

Semantics is meaning. It is our knowledge of the world and how it works. Through evolution and in

our early development, we first learn to understand the world by associating sensory-motor events

and cues. Some examples pointed out in Section 1.3.2.2 include learning what happens when one

touches a hot burner, learning to associate the sight and smell of fire, learning to associate a word

with an event or some other co-occurring cue, or some combination of these.

If we refer to learning simply as association, this has a high degree of agreement with behaviorist

theories, particularly with regard to learning the relationship between cues or events and one’s own

actions. When talking about animal learning, behaviorism is often the best explanation, and it

can describe much of human behavior as well. How do human and animal behaviors differ then?

One important difference is that humans can communicate meaning linguistically, using symbols

16

* crunch *"Apple"

Other knowledge:facts, stories,

experiences, etc.

AppleConcept of

Figure 1.4: The concept of apple. The apple concept is associated with the different ways we senseapples, as well as with other related knowledge.

representing concepts.1 The question becomes, can we mimic this behavior? That is, can we build

a system that can learn meaning in a behaviorist manner (i.e., via association) and, in addition,

that can create symbols that can be manipulated and communicated? We think so.

According to Laurence and Margolis [57], “concepts are the most fundamental constructs in

theories of mind” (p. 1) While there is some debate about the definition of concepts, or even

whether they exist [57], a concept is generally defined in terms of the features that are associated

with it, as well as the rules that relate these features [58, p. 409]. Figure 1.4 shows an example,

where the concept of “apple” is associated with the smell, taste, sight, sounds, and feel of an apple,

as well as other related knowledge.

One feature to note about Figure 1.4 is the fact that the concept is represented as a discrete unit.

It does not simply exist as a set of weights connecting two sensory modalities. This formulation

differs from that of many of the models often used to associate different information streams,

where associative relationships are related directly (e.g., Hopfield networks and related work [59,

60], some instantiations of Bayesian Networks [61], and some HMM formulations tying together

multiple sensory modalities, such as fused [62,63] or coupled HMMs [64,65]). Why is this difference

important? Because it allows the concept to be manipulated as a symbol.

Figure 1.5 gives a more abstract illustration of concept connections. Taking the models one

at a time, the visual model independently learns visual concepts of the different objects or other

1As an aside, chimpanzees, dogs, bees, and some other animals may be able to communicate or understand symbolsto a limited extent. See, e.g., [45, 56].

17

ConceptModel

VisualModel Model

Auditory

SensoryInputs

SemanticMemory

to Working Memory

Figure 1.5: Visual/auditory concept hierarchy. This figure shows how representations from a generalauditory and visual model of the world are combined to create a conceptual model of the world.

distinguishable sights in its environment. These concepts could include such things as colors, shapes,

textures, or types of motion, although each of these may be put into a separate model. The audio

model learns concepts from audio cues, including speech. At the lowest level, this might include

environmental sounds and phonemes. The concept model learns frequently co-occurring states or

classifications of the lower models. Learning in all models is unsupervised, although depending

on the model and learning method chosen, models may be initialized with a bias to learn better

or faster or both. Although we do not yet do this, it should also be also possible to incorporate

feedback from other models, as well as positive or negative feedback from the environment for

reinforcement type learning. The model can, of course, scale up to include more types of sensory

models.

One necessary condition for effective communication is that the two people communicating

(or in our case, a person and a robot communicating) share a similar set of concepts. Thus, the

learning of concepts can be described as an attempt to learn a model of another person’s knowledge.

Figure 1.6 shows this idea graphically. This figure shows an interaction between two subjects, a

person and a robot, each with his own cognitive model of the world. The immediate goal of the

robot is to learn the cognitive model the person is using to understand the immediate environment.

Just learning concepts may be interesting and useful by itself, but as hinted by Figure 1.6, we

do envision this model as simply one part of a more complex model, designed around the cognitive

cycle described by Figure 1.1. The model as presented is very general, so any number of models

could be plugged into the clouds in the figures. For reasons highlighted in Chapter 2, we have

18

ConceptModel

VisualModel Model

Auditory VisualModel

ConceptModel

ModelAuditory

"Apple"

Figure 1.6: Associative learning of the word “apple.” By hearing the boy’s word in response to ashared visual stimulus, the robot can attempt to learn a model of the world compatible with theboy’s model.

chosen to use HMMs for the individual components of the hierarchy. This realization of the model

is presented in Chapter 3.

1.5 Contributions and Layout of Dissertation

Our group is attempting a complex and ambitious project, that of creating the body and mind of

an intelligent robot. It is necessary to stress the collaborative aspect of this project, which has

been quite rewarding, and without which progress would be extremely slow and limited. Within

this collaboration, I have made a significant contribution in three main areas. First, I was heavily

involved with the initial design and development of two of the robotic platforms used by the group.

Second, I was lead designer and developer of a robust system for transparently connecting the

various computing modules. My third area of contribution is the development of an HMM cascade

architecture for concept learning, described in detail in the following chapters. With regard to my

work involving HMMs, my contribution includes

1. noting an extension of the analysis of the recursive maximum-likelihood estimation (RMLE)

algorithm presented by Krishnamurthy and Yin [66] to finite-alphabet HMMs (their analysis

applies specifically to observations with continuous densities) (Section 2.3);

19

2. giving experimental results for various modifications of the RMLE algorithm (Section 2.3.4);

3. proposal and analysis of an HMM cascade architecture for learning associations among mul-

tiple observation streams, including arguments extending RMLE convergence analysis to our

proposed cascade architecture and experimental evaluation (Chapter 3);

4. implementation and use of the above-mentioned cascade model for learning semantic concepts

on our robot (Chapter 4); and

5. derivation of a version of RMLE for hidden semi-Markov models (HSMMs) (Appendix D).

In the previous sections of this introduction, we described the somatic and cognitive framework we

use, and we would like to note how the work of this dissertation fits into the above framework.

Within the somatic system, we are using the existing hardware and software framework. In

particular, our work runs on base platform described in Section 1.3.1.4, using the system of cameras,

microphones, and touch sensors described therein. For visual processing we use feature extraction

developed by R. S. Lin, described in Appendix B. We also use the sound source localization scheme

developed by D. Li, and audio feature extraction developed by M. Kleffner, both mentioned above.

Kleffner’s work is directly relevant to our work, and is therefore described in Appendix B. All of

these components are connected by the distributed communications framework developed mostly

by myself, described in the same Appendix in Section B.2.

For cognitive modeling, our work focuses on semantic concept learning using stochastic models,

similar to, but improving upon, the work by Q. Liu and W. Zhu described in Section 1.3.2.4. Our

work is built on top of autonomous exploration work by M. McClain, described in the same section.

The rest of this dissertation is organized as follows. Chapter 2 describes HMMs, and introduces

the RMLE algorithm for learning model parameters. HMMs and the RMLE algorithm are key

components of our composite HMM-based associative memory. Chapter 3 describes the theory and

gives simulation results for this associative memory, and Chapter 4 describes the experiments we

have run on our robot using this model. In Chapter 5, we summarize and discuss the significance

of our work. The appendices contain a wealth of additional information, including details of the

forementioned robotic hardware (Appendix A) and software (Appendix B), discussion of standard

algorithms used with HMMs (Appendix C), definition and derivation of the RMLE algorithm for

20

HSMMs (Appendix D), some additional RMLE derivations (Appendix E), and some matrix calculus

used in some of our derivations (Appendix F).

21

CHAPTER 2

HIDDEN MARKOV MODELS AND

THE RMLE ALGORITHM

2.1 Introduction

In Section 1.4, we described a hierarchical structure for modeling concepts. The structure is generic

enough that a variety of models could be used throughout the structure, even in a heterogeneous

manner. Our work focuses on the use of HMMs in this hierarchy.

An HMM is a discrete-time stochastic process with two components, Xn, Yn, where (i) Xn

is a finite-state Markov chain, and (ii) given Xn, Yn is a sequence of conditionally independent

random variables. The conditional distribution of Yk depends on Xn only through Xk. The name

hidden Markov model arises from the assumption that Xn is not observable, and so its statistics

can only be ascertained from Yn.

HMMs have many interesting features that we believe can be easily exploited for concept learn-

ing. As noted previously, concepts are formed from the correlation in time among events. HMMs

by construction have a notion of sequence, and have proven quite effective at learning time series

and spatial models in such areas as speech processing [67] and computational biology [68–70]. This

characteristic of HMMs provides a useful starting point for learning time correlation.

Another property of HMMs useful for learning concepts is their ability to discover structure in

input data. Cave and Neuwirth [71] demonstrated this capability by training a low-order ergodic

HMM on text. They found that the states of the model represented broad categories of letters,

discovering some of the underlying structure of the text. Poritz [72] developed a similar model for

22

speech data, and Ljolje and Levinson [73] created a speech recognizer based on this type of model.

Our hierarchical model exploits this natural capability of HMMs to discover structure in order to

learn higher level concepts.

Finally, in addition to their familiar role as recognizers, HMMs can be used in a generative

capacity. In particular, when placed in a hierarchy, we can drive the various HMMs to produce

sequences of states and corresponding output, roughly simulating thoughts and actions.

Some characteristics of HMMs are not as useful for our work, however. Two of the most common

methods used for HMM parameter estimation, the Baum-Welch method and methods based on the

Viterbi algorithm, both require off-line processing of large amounts of data. (See Appendix C for

details on these algorithms.) For our goal of learning concepts in real time using a robot, these

methods are not very useful. We would much prefer an iterative or on-line training procedure.

There are generally two approaches researchers have used to implement on-line training for

HMMs. The first minimizes the prediction error of the model via recursive methods. This approach

was first suggested by Arapostathis and Marcus [74], who proposed a recursive Gauss-Newton algo-

rithm and a general recursive stochastic gradient algorithm, although they only treat the learning

of transition probabilities in finite-alphabet HMMs. Collings et al. [75] present a similar technique

for when the observations for each state have a Gaussian distribution. They treat both transition

probability and observation mean estimation, though they do not estimate variances. LeGland and

Mevel [76] prove convergence of the recursive conditioned least squares estimator (RCLSE), which

is a generalization of the approach in [74] to the case of observations in Rd.

The other approach used to implement on-line training in HMMs is to maximize the Kullback-

Leibler information between the estimated model and true model, or equivalently, to maximize

the likelihood of the estimated model for an observation sequence. Holst and Lindgren [77] were

the first to propose an RMLE algorithm for HMMs. Krishnamurthy and Moore [78] derive an

on-line algorithm based on sequential expectation maximization (EM) schemes which minimize the

Kullback-Leibler information. In both of these papers, convergence was shown only in simulation.

Ryden [79] provides convergence analysis for a general class of batch-iterative recursive maximum-

likelihood estimators. Independently, LeGland and Mevel [76,80] suggest and prove the convergence

of RMLE, and compare it to the RCLSE (mentioned above). Krishnamurthy and Yin [66] extend

23

the RMLE results of [76] to autoregressive models with Markov regime, and add a number of

results on convergence, rate of convergence, model averaging, and parameter tracking. Because

they offer the most complete results, our RMLE implementation for HMMs (and the explanation

of the algorithm in this chapter) is based mostly on [66].

For the remainder of this section, we will formulate our model, derive the RMLE algorithm

for HMMs, sketch the proof of convergence given by Krishnamurthy and Yin [66], and discuss a

number of HMM training results using the algorithm. While the main purpose of this section is

to establish use of these algorithms in our cascade model in the next chapter, we will also provide

some analysis and commentary, including a discussion at the end of this chapter on why HMMs are

better Bayesian classifiers.

2.2 Model Description and Notation

An HMM is a discrete-time stochastic process with two components, Xn, Yn, defined on proba-

bility space (Ω,F , P ). Let Xn∞n=1 be a discrete-time first-order Markov chain with state space

R = 1, . . . , r, r a fixed known constant. The model starts in a particular state i = 1, . . . , r with

probability πi = P (X1 = i). Define π ∈ Π by π = πi, where Π is the set of length-r stochastic

vectors. For i, j = 1, . . . , r, the transition probabilities of the Markov chain are given by

aij = P (Xn = j|Xn−1 = i). (2.1)

Let A = aij. Then A ∈ A, where A is the set of all r × r stochastic matrices.

In an HMM, Xn is not visible, and its statistics can only be ascertained from a corresponding

observable stochastic process, Yn. The process Yn is a probabilistic function of Xn, i.e., given

Xn, Yn takes values from some space E according to a conditional probability distribution. The

corresponding conditional density of Yn is generally assumed to belong to a parametric family of

densities b(·; θ) : θ ∈ Θ, where the density parameter θ is a function of Xn, and Θ is the set

of valid parameters for the particular conditional density assumed by the model. The conditional

density of Yn given Xn = j can be written b(·; θj), or simply bj(·) when the explicit dependence on

θj is understood.

24

Example 2.1. (Gaussian observation density): Suppose the observation density for each state in

an HMM is described by a univariate Gaussian distribution. Then parameter set Θ = (µ, σ) ∈ R×

(0,∞), θj ∈ Θ, and Yn = yn is a sequence of continuously valued, conditionally independent

outputs on R, each with probability distribution

b(yn; θj) = b (yn;µj , σj) =1√

2πσj

exp

[

−(yn − µj)2

2σ2j

]

(2.2)

for Xn = j.

Example 2.2. (Finite-alphabet observation density): Suppose observations Yn are drawn from a

finite set of symbols V = vk, k = 1, . . . , s. Then Θ = (b1, . . . , bs)|∑s

k=1 bk = 1, bk ≥ 0 is the

set of length-s stochastic vectors, θj ∈ Θ, and Yn = yn is a sequence of symbols draw from a

finite alphabet, each yn having probability

b(yn; θj) = bjk|yn=vk(2.3)

for Xn = j.

For simplicity, the last two examples and the following discussion assume Yn to be scalar valued,

although the formulation easily generalizes to vector-valued observations.

Conceptually, it is useful to think of Yn as being generated by a hidden Markov process.

When Xn = j, the observation Yn is generated using

Yn = g(en; θj)|Xn=j , (2.4)

where g(·, θ) is a real valued function on R indexed by θ ∈ Θ, and en is a sequence of independent

and identically distributed (iid) random variables. This formulation is equivalent to a Monte Carlo

simulation, where g(·; θ) could be, for example, the inverse of the cumulative distribution function

(CDF) corresponding to observation density b(·; θ), and en a sequence of uniform random variables

distributed on [0, 1]. Other formulations, of course, are possible.

For later analysis, it will be convenient to collect model parameters together in a single parameter

vector. Define the HMM parameter space as Φ = Π ×A× Θ. The model ϕ ∈ Φ is then defined as

ϕ = π1, . . . , πr, a11, a12, . . . , arr, θ1, . . . , θr. (2.5)

25

The model parameters for a particular model are accessed via coordinate projections, e.g., aij(ϕ) =

aij . In some cases (such as when considering the RMLE algorithm below), we will not be concerned

with estimating π. In that case, Φ = A× Θ, and ϕ changes accordingly. Note that the literature

occasionally describes other model parameterizations (see, e.g., [75, 77]).

Example 2.3. For Example 2.1 above,

ϕ = (π1, ..., πr, a11, a12, ..., arr, µ1, σ1, ..., µr, σr).

Let p be the length of ϕ. When estimating model parameters, let ϕ∗ ∈ Φ be the fixed set of

“true” parameters of the model we are trying to estimate.

For a vector or matrix v, v′ represents its transpose. Define the r-dimensional column vector

b(yn;ϕ) and r × r matrix B(yn;ϕ) by

b(yn;ϕ) = [b1(yn; θ1(ϕ)), ..., br(yn; θr(ϕ))]′ (2.6)

and

B(yn;ϕ) = diag[b1(yn; θ1(ϕ)), ..., br(yn; θr(ϕ))]. (2.7)

Vector b(yn;ϕ) and matrix B(yn;ϕ) give the observation density evaluated at yn for each state (in

model ϕ), as a vector and diagonal matrix, respectively.

Using the definitions above, it can be shown (see, e.g., [81]) that the likelihood of the sequence

of observations 〈y1, . . . , yn〉 for model ϕ is given by

pn(y1, . . . , yn;ϕ) = π(ϕ)′B(y1;ϕ)

n∏

k=2

A(ϕ)B(yk;ϕ)1r, (2.8)

where 1r refers to the r-length vector of ones.

2.3 Recursive Maximum-Likelihood Estimation of HMM

Parameters

Maximum-likelihood estimation (MLE) is formally defined as follows. For observation sequence

〈y1, . . . , yn〉, find

ϕ = arg maxϕ∈Φ

pn(y1, . . . , yn;ϕ), (2.9)

26

where ϕ is the most likely estimate of the true underlying parameters ϕ∗. The recursive maximum-

likelihood estimation (RMLE) algorithm defined here is an iterative, stochastic gradient solution

to this problem.

2.3.1 RMLE derivation

The derivation of the RMLE algorithm for HMMs proceeds as follows. We first show how to

calculate the likelihood pn(y1, . . . , yn;ϕ) for a given HMM model recursively, using prediction (or

forward) filters. We note that maximizing log pn(y1, . . . , yn;ϕ) is equivalent to and generally easier

than maximizing pn(y1, . . . , yn;ϕ) [82], and that log pn(y1, . . . , yn;ϕ) can also be calculated recur-

sively. We can then search for the maximum of log pn(y1, . . . , yn;ϕ) using the derivative of the

update of this recursion.

For the results of this section to hold, it is necessary to assume various conditions on periodicity,

continuity, and ergodicity for the model. For simplicity, we will assume that all necessary conditions

hold and will introduce them in the next section.

Define the prediction filter as

un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′, (2.10)

where

uni(ϕ) = P (Xn = i|y1, . . . , yn−1) (2.11)

is the probability of being in state i at time n given all previous observations. Using this filter, the

likelihood pn(y1, . . . , yn;ϕ) can be written as

pn(y1, . . . , yn;ϕ) =

n∏

k=1

b(yk;ϕ)′uk(ϕ). (2.12)

(For this derivation, see Appendix E, Section E.1.)

The value of un(ϕ) can be calculated recursively as

un+1(ϕ) =A(ϕ)′B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ)(2.13)

when initialized by u1(ϕ) = π(ϕ).

27

Let w(l)n (ϕ) = (∂/∂ϕl)un(ϕ) be the partial derivative of un(ϕ) with respect to (wrt) the lth

component of ϕ. Each w(l)n (ϕ) is an r-length column vector, and

wn(ϕ) = (w(1)n (ϕ),w(2)

n (ϕ), . . . ,w(p)n (ϕ)) (2.14)

is an r × p matrix. Taking the derivative of un+1(ϕ) from Equation (2.13),

w(l)n+1(ϕ) =

∂un+1(ϕ)

∂ϕl

= R1(yn,un(ϕ),ϕ)w(l)n (ϕ) +R

(l)2 (yn,un(ϕ),ϕ) (2.15)

where

R1(yn,un(ϕ),ϕ) = A(ϕ)′[

I − B(yn;ϕ)un(ϕ)1′r

b(yn;ϕ)′un(ϕ)

]B(yn;ϕ)

b(yn;ϕ)′un(ϕ)(2.16)

R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′

[


b(yn;ϕ)′un(ϕ)

][∂B(yn;ϕ)/∂ϕl]un(ϕ)

b(yn;ϕ)′un(ϕ)+

[∂A(ϕ)′/∂ϕl]B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ). (2.17)

Using these equations, we can recursively calculate wn(ϕ) at every iteration.

For a set of observations 〈y1, ..., yn〉, we would like to find the maximum of pn(y1, . . . , yn;ϕ).

Equivalently, we can maximize log pn(y1, . . . , yn;ϕ). Define the log-likelihood of observations

〈y1, . . . , yn〉 as

`n(ϕ) =1

n+ 1log pn(y1, ..., yn;ϕ). (2.18)

Using Equation (2.12), we can rewrite this as

`n(ϕ) =1

n+ 1

n∑

k=1

log[b(yk;ϕ)′uk(ϕ)]. (2.19)

To estimate the set of optimal parameters ϕ∗, we want to find the maximum of `n(ϕ), which

we will attempt via recursive stochastic approximation. For each parameter l in ϕ, at each time

n, we take (∂/∂ϕl) of the most recent term inside the summation in Equation (2.19), to form an

“incremental score vector”

S(Yn;ϕ) =(

S(1)(Yn;ϕ), ..., S(p)(Yn;ϕ))′

(2.20)

28

with

S(l)(Yn;ϕ) =∂

∂ϕllog[b(yn;ϕ)′un(ϕ)]

=b(yn;ϕ)′[(∂/∂ϕl)un(ϕ)] + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)

b(yn;ϕ)′un(ϕ)

=b(yn;ϕ)′wn(ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)

b(yn;ϕ)′un(ϕ)(2.21)

where

Yn , (Yn,un(ϕ),wn(ϕ)). (2.22)

The RMLE algorithm takes the form

ϕn+1 = ΠG

(

ϕn + εnS(Yn;ϕn))

, (2.23)

where εn is a sequence of step sizes satisfying εn ≥ 0, εn → 0 and∑

n εn = ∞, G is a compact and

convex set (here, G ⊆ Φ, the set of all valid parameter sets ϕ), and ΠG is a projection onto set G.

The purpose of the projection is generally to ensure valid probability distributions and maintain

all necessary conditions. Note that Equation (2.23) is a gradient update rule, with constraints.

Equations (2.17) and (2.21) can both be simplified for each type of parameter in ϕ. Appendix E

contains derivations of both equations for

1. transition probabilities aij(ϕ),

2. observation probabilities bjk(ϕ) when assuming observations from a finite alphabet,

3. mean vector µj(ϕ) and covariance matrix Σj(ϕ) when assuming continuous observations

taken from a multidimensional Gaussian distribution, and

4. upper triangular matrix Rj(ϕ), where Rj(ϕ)′Rj(ϕ) = Σj(ϕ) above. This derivation is

included for mathematical convenience and is the one we use in our implementation, as it

greatly simplifies the calculation of Σj(ϕ)−1 and |Σj(ϕ)|.

2.3.2 Convergence

For both the derivation above and the proof of convergence below, we assume the following condi-

tions (from [66]) hold.

29

Condition 2.1. The transition probability matrix A(ϕ∗) is aperiodic and irreducible (see [83]).

Condition 2.2. The mapping ϕ → A(ϕ) is twice differentiable with bounded first and second

derivatives and Lipschitz continuous second derivative. For any yk, the mapping ϕ → b(yk;ϕ) is

three times differentiable, and the function b(yk; θ) is continuous on R for every θ ∈ Θ. Alternately,

for yk drawn from a finite alphabet, the mapping ϕ → b(yk;ϕ) is twice differentiable with bounded

first and second derivatives and Lipschitz continuous second derivative.

Condition 2.3. Under Pϕ∗ , the extended Markov chain

Xn, Yn,un(ϕ),wn(ϕ)

is geometrically ergodic1 (see [66, 83] for the proof when b(yn; θ) is continuous, and [74, 80] for the

proof when the observations yn are drawn from a finite alphabet).

Because of this geometric ergodicity, the initial values of u0(ϕ) and w0(ϕ) are forgotten expo-

nentially fast, and are therefore asymptotically unimportant in the analysis of the algorithm.

Note 2.1. For the case of observations from a finite alphabet, our conditions and assumptions

above did not appear in [66]. However, geometric ergodicity was shown for this case in both [74]

(for a special case of models with observations from a finite alphabet) and [80] for a more general

case. The proof in [80] assumes only that the transition probabilities are being updated, but can

be generalized to include maximum-likelihood estimation of all model parameters. By introducting

these assumptions, the following proof by Krishnamurthy and Yin can then be extended to apply to

HMMs with finite observation alphabets.

Krishnamurthy and Yin [66] analyze the convergence and rate of convergence of the RMLE

algorithm described above. Their proofs use an ordinary differential equation (ODE) approach,

which relates the discrete-time iterations of the RMLE algorithm to an ODE, and then proves

convergence of the ODE. The general theory of this method is given in [85]. Here we will sketch

their convergence proof. For full details, see [66].

1For Markov chains, ergodicity means that the ensemble statistics of the states approach the stationary distributionof the chain as n → ∞. Geometric ergodicity means that the ensemble statistics approach the stationary distributiongeometrically fast. See [84].

30

The general idea of the proof is to treat the sequence of parameter estimates ϕn as finite-

difference estimates to a projected ODE, that is, an ODE whose dynamics are projected onto a

constraint set G. In our case, G is the set of constraints necessary to maintain stochasticity of the

transition matrix A(ϕ), and observation probability matrix bjk in the case of observations from

a finite-alphabet. They then show that the set of limit points of this ODE are ϕ∗.

First, note that if log[b(yk;ϕ)′uk(ϕ)] is locally Lipschitz and assuming Conditions 2.1 through

2.3, there exists a finite `(ϕ) such that

`n(ϕ) → `(ϕ), Pϕ∗w.p. 1 as n→ ∞.

That is, `n(ϕ) converges to a limit `(ϕ), and the update algorithm we derived in the previous

section is attempting to find parameters ϕ which minimize `(ϕ). Moreover, this minimum is also

a minimum of the Kullback-Leibler information, which is defined as

K(ϕ) = −[`(ϕ) − `(ϕ∗)] ≥ 0.

Thus, maximizing `n(ϕ) is equivalent to minimizing K(ϕ). Let LML be the set of global minima

of K(ϕ) (see [86]) given by

LML = arg minϕ∈Φ

K(ϕ).

Clearly, ϕ∗ ∈ LML.

Rewrite Equation (2.23) as

ϕn+1 = ϕn + εnS(Yn;ϕn) + εnMn, (2.24)

where Mn is a projection or correction term; i.e., it is the vector of shortest length necessary to

bring ϕn + εnS(Yn;ϕn) back to the constraint set G. Consider a piecewise-constant interpolation

of ϕn. According to the Arzela-Ascoli theorem (see [85], p. 101), we can extract a convergent

subsequence whose limit satisfies an ODE projected onto G.

Consider the projected ODE

ϕ = H(ϕ) + m, ϕ(0) = ϕo, (2.25)

where H(ϕ) = (∂/∂ϕ)K(ϕ) and m is the force or constraint term needed to keep ϕ(·) ∈ G. Let

LG = ϕ;ϕ is a limit point of (2.24),ϕ ∈ G. A set A ⊂ G is locally asymptotically stable (in the

31

sense of Lyapunov) for Equation (2.25), if for each δ > 0 there is a δ1 > 0 such that all trajectories

starting in Nδ1(A) never leave Nδ(A) and ultimately stay in Nδ1(A), where Nη(A) denotes an η

neighborhood of A.

Assume the following conditions.

Condition 2.4. For each ϕ ∈ G, S(Yj ;ϕ) is uniformly integrable, E[S(Yj ;ϕ)] = H(ϕ) =

(∂/∂ϕ)K(ϕ), H(ϕ) is continuous, and S(Y ; ·) is Lipschitz continuous for each Y .

Condition 2.5. Let L1G ⊂ LG, and suppose that LML is locally asymptotically stable. For any

initial condition ϕo /∈ L1G, the trajectory of Equation (2.24) goes to LML.

Theorem 2.4. Assuming Conditions 2.1 through 2.4, there is a convergent subsequence of ϕn

that satisfies the projected ODE in Equation (2.24), and ϕn converges to an invariant set of the

ODE in G. Further assume Condition 2.5. Then the limit points of the projected ODE are in

L1G ∪ AG w.p. 1. In particular, if L1

G ∪ AG = ϕ∗, and ϕn visits a compact set in the domain of

attraction of L1G ∪ AG infinitely often, then ϕn → ϕ∗ w.p. 1.

Proof omitted. See [66].

Krishnamurthy and Yin [66] also provide rate of convergence analysis, examining the dependence

of the estimation error (ϕn − ϕ∗) on the step size εn, as well as the behavior of fixed step size

algorithm where ε is held constant. Because of the dependence between estimation error and step

size, choosing a good step size is an important consideration when implementing the algorithm.

One way to diminish the dependence on the step size is to use averaging to give more accurate

estimations. The next section discusses this idea briefly.

2.3.3 Model averaging and tracking

As can be seen from the first column of Figure 2.1 (p. 36), oscillation can be a problem when trying

to learn a model using a fixed ε in the update procedure, depending on the size of ε. If εn is chosen

to be, say, 1/n, convergence is guaranteed, but will be quite slow, and undesirable oscillations may

still be a problem. Ideally, we would want to choose a step size which would allow learning at

an optimal rate, although this is not an easy task. In this context, Kushner and Yin [85] suggest

32

that averaging reduces the need to choose an optimal form for εn. (See [85], Chapter 11, for more

discussion on this topic.)

Krishnamurthy and Yin [66] suggest averaging in both the iterates (i.e., ϕn) and the observations

(as measured by S(Y ,ϕ)). This averaging takes the form

ϕn+1 = ΠG(ϕn + εnnSn) (2.26)

ϕn+1 = ϕn − 1

n+ 1ϕn +

1

n+ 1ϕn+1 (2.27)

Sn+1 = Sn − 1

n+ 1Sn +

1

n+ 1Sn+1, (2.28)

with ε = 1/nγ , 0.5 ≤ γ ≤ 1. In [66], Krishnamurthy and Yin provide convergence, asymptotic

optimality, and asymptotic normality proofs for the modified algorithm.

These formulas can also be modified to work with a “fixed history” by replacing n in Equa-

tions (2.26)-(2.28) with a fixed constant k or, alternatively, min(n, k). Numerical simulations of the

original, averaging, and fixed history algorithms appear in the next section.

Various sources [66, 78] suggest the use of fixed ε for use in tracking. Analysis of the RMLE

algorithm for tracking slowly varying HMM parameters also appears in [66], and we give some

examples of training with fixed ε in the next section.

2.3.4 Numerical simulations

In this section, we present a number of Monte-Carlo simulations to demonstrate the RMLE algo-

rithm under various model configurations. In the first simulation, the observations in each model

come from a one-dimensional Gaussian distribution. In the second simulation, observations are

drawn from a distribution over a finite-alphabet. The third simulation uses two-dimensional Gaus-

sian observation densities to show how the model converges when we use a model with a large

number of states to learn from data produced by a smaller model. This setup may be useful if,

for example, we know the general extents of our data, but do not know the underlying number of

states or have a good way of initializing the model.

Implementation notes. The update formula for the parameters derived in Section 2.3.1 required

a projection term to keep the updated parameters within their constraints (i.e., at a minimum, to

33

maintain the stochasticity of the probability transition matrix A(ϕ) and, for observations from

a finite alphabet, observation probabilities bjk(yn;ϕ)). In the literature, the general suggestion

has been to parameterize each r-length row in the transition matrix with r − 1 variables. For

example, in A(ϕ), each off-diagonal entry aij , i 6= j, can be represented as such, and each diagonal

entry aii is parameterized with aii = 1 −∑j,i6=j aij. For transition probabilities, this might be

reasonable, as self-transitions often have different meaning than other transitions (see Appendix D

for a discussion on this topic). However, in general, this type of parameterization leads to some

undesirable behavior during the training, because every change in an off-diagonal entry in the

matrix causes a corresponding opposite change in the diagonal entry, while the rest of that line

in the matrix is unaffected. This problem is especially evident for finite-alphabet observation

probabilities, where the parameterized variable generally has no special meaning.

Our solution to this problem was to avoid this parameterization, and instead use Lagrange

multipliers to maintain stochasticity constraints in the mapping ΠG. The perhaps nonintuitive

result is that the mapping which brings the modified parameters to the closest point within the

constraint space simply subtracts the same amount from each parameter in aj·, taking care, of

course, that no parameters become less than some ε > 0. We would like to point out that, initially,

before using Lagrange multipliers, we simply attempted to scale the parameters, which is incorrect

and caused the model not to converge.

2.3.4.1 Gaussian observations

For the first test, we generated data from a simple two state model, with transition matrix

A =

0.9 0.1

0.1 0.9

and observations generated from Gaussians with parameters

µ =

[−1.0

1.0

]

, σ =

[0.6

0.9

]

.

For training, we tested both fixed step sizes (ε = 0.006, 0.003, 0.001), and exponentially

decreasing step sizes (εn = ε0nγ , ε0 = 0.1, 0.3, 0.5, 0.5 ≤ γ ≤ 1). We also varied the aver-

aging history, replacing n in Equations (2.26)-(2.28) with a fixed history of min(n, k) for k =

34

Table 2.1: Simulation results for various combinations of learning rate ε and averaging history k.All values were measured at 50 000 iterations, over 50 runs. Values in the table indicate the meansand standard deviations (in parenthesis) of the measured values. Original model values are givenin Section 2.3.4.1.

Algorithm

[a11

a22

]

µ σ

ε = 0.001, k = 1

[0.8280 (0.7423)0.8405 (0.7848)

] [−1.0002 (0.4673)0.9791 (0.5371)

] [0.6059 (0.3329)0.9044 (0.3734)

]

ε = 0.003, k = 1

[0.8766 (0.2210)0.8600 (0.6537)

] [−0.9917 (0.3211)1.0097 (0.2758)

] [0.6085 (0.2683)0.8971 (0.1702)

]

ε = 0.006, k = 1

[0.8960 (0.0970)0.8909 0.1139()

] [−1.0029 (0.1652)1.0046 (0.1656)

] [0.6039 (0.1378)0.8986 (0.0897)

]

ε = 0.5/n0.5, k = 1

[0.8693 (0.5820)0.8758 (0.6631)

] [−0.8636 (3.5216)0.8650 (3.5274)

] [0.6360 (0.7999)0.8882 (0.6351)

]

ε = 0.5/n0.6, k = 1

[0.8983 (0.0959)0.8955 (0.0917)

] [−0.9613 (1.9682)0.9587 (1.9820)

] [0.6074 (0.3351)0.8913 (0.3208)

]

ε = 0.1/n0.5, k = 1

[0.8996 (0.0632)0.8988 (0.0703)

] [−1.0008 (0.0833)0.9966 (0.0983)

] [0.5977 (0.0905)0.8996 (0.0534)

]

ε = 0.3/n0.5, k = ∞[

0.8754 (0.6096)0.8181 (1.3387)

] [−0.7771 (3.5573)0.7368 (3.3679)

] [0.7617 (1.7868)1.0405 (1.8286)

]

ε = 0.3/n0.5, k = 10000

[0.9020 (0.3473)0.8037 (1.7091)

] [−0.8623 (2.3053)0.9317 (1.6149)

] [0.6942 (1.6433)1.0418 (2.3224)

]

ε = 0.3/n0.5, k = 1000

[0.9164 (0.2951)0.7503 (2.1159)

] [−0.7701 (3.2633)0.8142 (3.2751)

] [0.7417 (1.8322)1.0791 (2.5741)

]

1, 10, 1000, 10 000,∞, where k = 1 implies no averaging, and k = ∞ implies averaging from time

0 (i.e., min(n, k) = n. For all tests here, parameters in the learned model were started at

A =

0.5 0.5

0.5 0.5

, µ =

[−0.75

−0.50

]

, σ =

[1.0

1.0

]

.

For each parameter combination, we ran the simulation for 50 000 iterations. We then chose

a representative subset of parameter combinations and reran each of these 50 times. Results are

summarized in Table 2.1. We discovered the following trends:

1. Convergence of all parameters occurred in less than 50 000 iterations for all fixed values of

ε, with larger values converging faster, but producing larger amplitudes of oscillation around

converged values. This behavior can be seen in Figure 2.1.

35

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(a) Transition probability es-timates (ε = 0.006)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(b) Transition probability es-timates (ε = 0.003)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(c) Transition probability esti-mates (ε = 0.001)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(d) Gaussian mean estimates(ε = 0.006)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(e) Gaussian mean estimates(ε = 0.003)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(f) Gaussian mean estimates(ε = 0.001)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

time

σ

(g) Gaussian standard devia-tion estimates (ε = 0.006)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

time

σ

(h) Gaussian standard devia-tion estimates (ε = 0.003)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

time

σ

(i) Gaussian standard devia-tion estimates (ε = 0.001)

Figure 2.1: The effect of learning rate ε on parameter convergence during RMLE training, forconstant ε. Notice how larger values of ε converge faster, but have more oscillations. Originalmodel parameters are indicated by the symbol at the right edge of the graphs, and are specifiedin Section 2.3.4.1.

36

2. For exponentially decreasing εn, models only converged in 50 000 iterations for limited com-

binations of learning parameters. Larger values of ε0 generally converged faster (for those

runs which converged), but also caused larger oscillations, as was the case with large fixed ε.

Smaller values of γ (from εn = ε0nγ ) caused faster convergence than larger values of γ, since

for larger values of γ, εn decreases too quickly for the model to reach convergence. How-

ever, smaller γ also provided less attenuation of oscillations. Compare the three columns of

Figure 2.2.

3. Longer averaging histories provided much smoother learning trajectories than shorter histo-

ries, and greatly reduced the frequency of oscillations in the learned parameters, although

the oscillation magnitude did not change much. With constant ε, this large oscillation is not

desirable. In most models with long histories, 50 000 iterations was not long enough for µ

and σ to converge. See Figure 2.3.

4. The algorithm becomes quite sensitive when one or more of the parameters of the learned

transition probability matrix A approaches zero (see Figure 2.4, page 41, for an example).

Averaging reduces this problem. Holst and Lindgren [77] also suggest parameterizing A using

log likelihoods instead of probabilities in order to alleviate this problem. We did not try this

solution.

Discussion. The starting point of the learned model was chosen to make learning challenging,

which may explain the limited combinations of learning parameters that actually converged for

models with exponentially decreasing εn. Some of these models may simply have needed more time

to converge. Since we were testing many different parameter combinations, we initially only ran

each combination once, which may not produce results indicative of that parameter combination

(i.e., we could have been “unlucky” early in those simulations which did not converge). However,

our initial goal was to find combinations of parameters which are stable and rapidly converging

even in difficult situations, and the collected data provides this information.

While averaging did reduce the frequency of oscillations in the learned parameters, as pointed

out above, it did not reduce the magnitude of oscillation when learning the means and standard

deviations of the observation densities. This fact can possibly be explained by the following:

37

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(a) Transition probability es-timates (ε = 0.5

n0.5 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(b) Transition probability es-timates (ε = 0.5

n0.6 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(c) Transition probability esti-mates (ε = 0.1

n0.5 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(d) Gaussian mean estimates(ε = 0.5

n0.5 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(e) Gaussian mean estimates(ε = 0.5

n0.6 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(f) Gaussian mean estimates(ε = 0.1

n0.5 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.2

0.4

0.6

0.8

1

1.2

1.4

time

σ

(g) Gaussian standard devia-tion estimates (ε = 0.5

n0.5 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

time

σ

(h) Gaussian standard devia-tion estimates (ε = 0.5

n0.6 )

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

time

σ

(i) Gaussian standard devia-tion estimates (ε = 0.1

n0.5 )

Figure 2.2: The effect of ε0 and γ on parameter convergence during RMLE training, with an expo-nentially decreasing εn. Notice that larger values of ε0 converge faster, but have larger oscillations.Larger values of γ smooth out oscillations faster, though slow down convergence. Original modelparameters are indicated by the symbol at the right edge of the graphs.

38

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(a) Transition probability es-timates, with averaging (ε =0.006, k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(b) Transition probability es-timates, with averaging (ε =0.3

n0.5 , k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(c) Transition probability es-timates, with averaging (ε =0.3

n0.5 , k = 1000)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(d) Gaussian mean estimates,with averaging (ε = 0.006, k =∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ

(e) Gaussian mean estimates,with averaging (ε = 0.3

n0.5 , k =∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

time

µ(f) Gaussian mean estimates,with averaging (ε = 0.3

n0.5 , k =1000)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.5

1

1.5

time

σ

(g) Gaussian standard devia-tion estimates, with averaging(ε = 0.006, k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.4

0.6

0.8

1

1.2

1.4

1.6

time

σ

(h) Gaussian standard devia-tion estimates, with averaging(ε = 0.3

n0.5 , k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

time

σ

(i) Gaussian standard devia-tion estimates, with averaging(ε = 0.3

n0.5 , k = 1000)

Figure 2.3: The effect of history size k on parameter averaging during RMLE training. The abovegraphs show parameter learning for different history sizes (k = ∞, 1000) with both fixed anddecreasing ε. Notice the large oscillation amplitudes for fixed ε in the first column. Long historiesand averaging do not work well with constant the step size version of the algorithm. Compare thesegraphs with those in Figures 2.1 and 2.2, which did not use averaging. Original model parametersare indicated by the symbol at the right edge of the graphs.

39

1. The flatness of the parameter space around µ and σ; i.e., small changes in these parame-

ters may not have much affect on the likelihood function, as compared with parameters in

transition probability matrix A.

2. The fact that we are averaging score vector Sn. As n becomes large, the amount that Sn

contributes to score vector Sn in Equation (2.28) decreases dramatically, maintaining the

momentum of the score vector.

These ideas suggests a few alternative approaches:

1. We could change the update for Sn to give more weight to the current score, for example, by

rewriting the equation as

Sn+1 = (1 − α)Sn + αSn+1,

for 0 < α < 1. We have not tried this idea.

2. We could not average observations Sn, but continue averaging parameter iterates ϕn. This

approach, unfortunately, did not produce converging results.

3. Since averaging is beneficial for transition probabilities, as pointed out above, and seem-

ingly disadvantageous for observation means and variances, a combined approach of obser-

vation/iterate averaging for A and iterate-only or no averaging for µ and σ could be tried.

Although results are not presented here, this approach produced some useful results.

4. We could attempt to use a different update algorithm. In particular, Schraudolph has pro-

posed local gain adaption [87] and stochastic conjugate gradient [88] for stochastic training

of neural networks. Both ideas could be tried here. We have not yet attempted to implement

either algorithm.

We also ran a set of tests on a model with two-dimensional Gaussian observations, with similar

results.

2.3.4.2 Observations from a finite alphabet

While the derivation and proof of the RMLE algorithm described in this chapter assume continuous

observation densities, the algorithm is capable of learning models with finite-alphabet observation

40

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(a) Transition probability es-timates, with averaging (ε =0.001, k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(b) Transition probability es-timates, with averaging (ε =0.001, k = 1000)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

A’(:

,2)

(c) Transition probability esti-mates, without averaging (ε =0.001)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,2)

(d) Discrete observation den-sity estimates, with averaging(ε = 0.001, k = ∞)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,2)

(e) Discrete observation den-sity estimates, with averaging(ε = 0.001, k = 1000)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.2

0.4

0.6

0.8

1

time

b’(:

,2)

(f) Discrete observation den-sity, without averaging (ε =0.001)

Figure 2.4: Examples of learning in HMMs with finite-alphabet observation densities. In theseplots, we set with various averaging histories (k = ∞, 1000, 1) and fixed ε = 0.001. Original modelparameters are indicated by the symbol at the right edge of the graphs, and are specified inSection 2.3.4.2.

densities. Figure 2.4 shows some examples of learning in such models. The model used for training

was

A =

0.99 0.01

0.01 0.99

, b =

0.900 0.005 0.095

0.005 0.095 0.900

.

The model being learned was initialized with

A =

0.5 0.5

0.5 0.5

, b =

0.600 0.200 0.200

0.200 0.600 0.200

.

41

As with transition probabilities, the algorithm is quite sensitive when finite-alphabet observation

probabilities approach zero, as can be see by the graphs in the third column of Figure 2.4. As can

be seen in the first two columns, averaging helps alleviate this problem.

2.3.5 Estimating a model with unknown model order

In most of the literature dealing with HMMs, it is generally assumed that the number of states

needed to represent an underlying process is known. When working with a real system, however,

the correct or optimal number of states may be difficult or impossible to know. A recent tutorial

paper on hidden Markov processes by Ephraim and Merhav [89] summarizes the state of the art

of order estimation in HMMs. A more recent proposal can be found in [90]. Order estimation of

continuous observation HMMs has been treated in [91, 92].

Most of the proposed approaches use a penalized likelihood method, where the likelihood of

models with various orders are compared. Since the likelihood will invariably increase as the order

of the model is increased, a penalty function is added to penalize larger models. The actual penalty

function used varies. See [89–92] for more details. To our knowledge, none of these techniques have

been applied to online HMM order estimation.

For our model, numerous ad hoc methods of treating order estimation suggest themselves,

including growing the model to handle data not well modeled and attempting to cover the subspace

inhabited by the incoming data. While we have not done much study of existing techniques, we

present the results of a space covering experiment below.

2.3.5.1 Space covering

In this section we suggest an ad hoc approach for learning the underlying state order of a set

of observations, as follows. First, we initialize a model with a large number of states, with the

observation densities initially covering the region of space occupied by the observations. In our

example, we assume that our observations will be contained in the region (x, y) : x, y ∈ (−10, 10),

and we choose to start with 16 states with Gaussian densities equally spaced throughout this region.

Figure 2.5 shows this setup,where the densities for each state are drawn in blue. The densities of

the states in the model to be learned are drawn in red.

42

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

Figure 2.5: Initialization of an HMM with two-dimensional Gaussian observation densities. Eachdensity is indicated on the graph by its mean and a contour line containing 80% of the density.Each density is also shaded according to the stationary probability of its state, with more likelystates shaded darker. Densities of the model to be learned are drawn in red.

Note that there is no indication of transition probabilities on this graph. However, in this

figure and in the graphs in Figure 2.6, the density associated with each state is colored according

to that state’s stationary probability, derived from the stationary distribution of the transition

probability matrix A. Initially, all transition probabilities are equal, so the stationary distribution

(and therefore the distribution coloring) is uniform. Darker coloring of mean and contour lines

indicates higher stationary probability for a particular state.

The parameters of the source model in this experiment are

A =

0.7 0.1 0.1 0.1

0.1 0.7 0.1 0.1

0.1 0.1 0.7 0.1

0.1 0.1 0.1 0.7

,µ =

(−4.5, 4.5)

(−1, 1)

(2.4,−1.3)

(5,−5)

,

Σ1 =

1.0 0.75

0.75 1.5

Σ2 =

2.0 0.5

0.5 1.0

Σ3 =

2.0 −1.5

−1.5 2.0

Σ4 =

2.5 −0.1

−0.1 2.5

.

Figure 2.6 documents the progression of the training.2 Some points to note:

1. By 500 000 iterations, the model did converge to the four states in the model, but note

that this situation is not necessarily stable. Notice, for example that by 208 000 iterations,

the model had nearly converged to the original four states model, but at 330 000 iterations,

2The iteration numbers chosen for the graphs map to the curve y = 1000ˆ

x3˜

, with x evenly spaced on the interval

[1,3√

500], and [x] indicating the nearest integer function (i.e., rounding).

43

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(a) 1000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(b) 5000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(c) 14 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(d) 33 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(e) 67 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(f) 123 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(g) 208 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(h) 330 000 iterations

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(i) 500 000 iterations

Figure 2.6: Learning an HMM using a model with a large number of states. Here, we are learninga 16-state model with data generated from a 4-state model. The model was run with historyk = 1000, and constant learning rate ε = 0.001.

44

some additional states have distributions which cover the same data. This phenomenon

seems to be caused mainly by observation outliers “activating” a state with a lower stationary

probability, temporarily destabilizing the model. It could be eliminated by combining states

whose distributions overlap considerably, and by removing unused or rarely used states.

2. Densities in light grey indicate states are not close to observation data and seemingly not used

in the model. Generally their stationary probabilities go to zero if transition probabilities can

go to zero, or get very small otherwise. They could be removed from the model, though we

did not do that here.

3. During training, two or more states may have distributions covering the same data, as can

be seen at 123 000 and 330 000 iterations. As mentioned previously, it may be desirable to

combine these states. Since we have control over the spatial layout of the distributions, the

search for states to combine could be limited to states in the same topological neighborhood.

One method for determining whether to combine two states would be to measure the Kullback-

Leibler distance between their observation distributions, and combine them if the value is

above some threshold.

4. The final state of the model took many more iterations to converge to the final model than

earlier experiments. This is, again, partially due to states whose distributions compete for the

same observations. Convergence, of course, is also obviously affected by parameterization.

Obviously, there are many caveats to this method. The user needs to pick the initial number,

size and shape (variance), and spacing of the observation distributions. As the dimension of the

data increases, the sheer number of states required to cover a particular region of space increases

exponentially, making the method impractical for all but small spaces. Whether, when, and how to

remove or combine states was not considered here. Nevertheless, experiments here do show that it

is possible for an HMM to learn the underlying structure of a set of inputs with an on-line training

algorithm, and in doing so validates the use of similar training in smaller models where observation

densities are initially primed with estimates of the densities of the observation process.

45

2.4 HMMs as Bayesian Classifiers

The HMM presented in this chapter is an ideal model. When attempting to use it to model real

world data, such as speech, the basic assumption of the model—that the underlying sequence of

states of the real data is a Markov chain—is almost certainly untrue. What the model does provide,

however is an improvement over the assumption that observations in a sequence are independent

of each other. That is, an assumption is made that the sequence of observations is important, and

that we can model some of the characteristics of that sequence with a first-order Markov chain.

We can see this improvement explicitly by analyzing an HMM as a stochastic classifier. A Bayes

classifier attempts to classify an input by estimating the posterior probabilities of an observation

using Bayes’ rule, i.e., for observation y and class xi,

P (xi|y) =p(y|xi)P (xi)∑

i p(y|xi)P (xi). (2.29)

If we know the prior probabilities of each xi and the prior distributions of y for each xi, this

classifier is optimal; i.e., it is the classifier with the lowest probability of error [93]. In fact, we

cannot in general know these distributions exactly, but the better our estimate of them, the better

our classifier. An HMM provides, at each time, an improved estimate of P (xi) by assuming that

the underlying sequence of states can be modeled as a Markov chain. The following analysis shows

why.

For comparison to our model definition earlier in this chapter, it will be convenient to write

Bayes’ rule in matrix form. Let ui = P (xi) be the prior probability of class i, and let u =

[u1, . . . , ur]′. Let b(y; θi) = p(y|xi) be the prior likelihood of y for class i, and let

b(y) = [b(y; θ1), . . . , b(y; θr)]′, (2.30)

where θi represents the parameters of the probability density function associated with class i. As

with our HMM analysis, let B(y) = diag[b(y)]. Finally, let fi = P (xi|y) be the posterior probability

of state i for observation y, and let f = [f1, . . . , fr]′. We can then rewrite Equation (2.29) as

f =B(y)u

b(y)′u. (2.31)

For an HMM, let fn(ϕ) be the probability distribution of Xn, i.e.,

fn(ϕ) = [fn1(ϕ), . . . , fnr(ϕ)]′ (2.32)

46

where

fni(ϕ) = P (Xn = i|y1, . . . , yn). (2.33)

That is, fni(ϕ) is the probability that the state (class) of the model is i at time n, given all

observations through time n. Using the variable definitions from Section 2.2, fn(ϕ) can be calculated

as

fn(ϕ) =B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ), (2.34)

where un(ϕ), b(yn;ϕ), and B(yn;ϕ) are defined as before.

Comparing Equations (2.31) and (2.34), clearly an HMM is a Bayesian classifier. As suggested

above, for sequential data, un(ϕ) provides a better estimate of the prior class probabilities than a

static prior u. We can see this by rewriting Equation (2.13) as

un+1(ϕ) = A(ϕ)′fn(ϕ), (2.35)

where A(ϕ) is the Markov transition probability matrix defined in Section 2.2. Thus, the prior

estimate at each time is the weighted average of transition probability vectors, with the weights

at time n+ 1 determined by the probability distribution of Xn. Thus our prior changes for every

observation, and the improved estimate of the class priors will give us an improved classifier for

sequential data, to the extent that our Markov chain accurately models the first-order statistics of

the underlying state sequence.

2.5 Discussion

This chapter explored the use of the RMLE algorithm for on-line training of HMMs. We have

successfully trained models using the algorithm, exploring how various combinations of training

parameters affect learning. We have also successfully demonstrated that a large model with states

whose distributions cover a section of space can correctly learn the structure of that space.

We have available an on-line learning algorithm for a model which can discern the underlying

structure of a set of inputs, which can turn continuous inputs into discrete states, and which learns

a notion of sequence among those states. We believe this type of model is ideal for our proposed

cascade structure for semantic learning, described in Section 1.4. The next chapter will describe

our implementation of that cascade structure using HMMs with the RMLE algorithm.

47

CHAPTER 3

CASCADE OF HMMS: THEORY

AND SIMULATION

3.1 Introduction

In Section 1.4, we introduced a general theory of semantic learning, suggesting that we form se-

mantic concepts through associations among inputs from multiple sensory modalities. Motivated

by these ideas, we have developed a new model based on a cascade of HMMs. Our ideas follow a

long line of research on attempts to identify additional structural information present in real-world

data, and incorporate that structural information into hidden Markov and related models. Below,

we review some of the previous research in this area. We then present an abstract description of

our cascade model, which attempts to model additional structure present in multiple data streams.

We discuss convergence of the model using the RMLE algorithm presented in the previous chapter,

and show some simulation results. In the next chapter, we will describe the use of this model as an

associative memory.

3.2 HMMs for Learning Structure

3.2.1 Unimodal structure

Considering the ideas in Section 2.4, one of the benefits of using HMMs for unsupervised learning is

that they have a notion of sequence, and can therefore learn some of the sequential structure of the

data during training. Later, knowledge of this additional structure can be used to better classify

48

data. In fact, HMMs do not account for all of the structure in the sequence, and one question that

comes to mind is whether or not we can learn more of this structure.

One of the first attempts at modeling additional structure beyond HMMs were hidden semi-

Markov models (HSMMs), first introduced as variable-duration HMMs (VDHMMs) [94–96]. HSMMs

still assume that the underlying sequence of states is a Markov chain, but that each state emits a

variable length sequence of observations, which is a more accurate model of, e.g., speech. Generally,

then, the model includes a specific probability density function (pdf) for the duration spent in each

state, as well as a pdf for the (variable length) sequence of observations for each state.1 During our

work on this dissertation, we explored the use of HSMMs, and derived a version of the RMLE for

the model (which we ultimately did not use). A formal description of this model and our RMLE

derivation appear in Appendix D.

Although not originally described as such, the hierarchical hidden Markov model (HHMM)

[97, 98] is a particular implementation of an HSMM which, in the terminology we use above,

assumes that the pdf of the variable length observation sequence is itself modeled by an HMM

(or another HHMM). This model has been shown [97] to be able to extract, in an unsupervised

manner, some of the higher order structure in real world data (specifically, text) that we alluded

to above.

Factorial HMMs [99] provide another approach to learning more complex structure in a stream

of data. These models assume that data output is produced from an interaction of multiple, loosely

coupled processes, and therefore can be better modeled with a distributed state representation.

This model was shown to discover complex structure in some of the melody lines of the chorales by

J. S. Bach that cannot be captured by traditional HMMs.

Following this progression, recently researchers have proposed more complicated models which

better capture the underlying structure of a given signal. One such example of these are called

switching state-space models, also known as hybrid models [100, 101]. These models assume an

underlying Markov chain modeling discrete states, with observations in each state assumed to

be produced by a Kalman filter. The Markov chain thus switches among various Kalman filters

1Often the observations are treated as independent random variables and their individual likelihoods for eachclass are multiplied together; i.e., if bj(yn, yn+1, . . . , yn+m−1) represents the m-dimensional pdf of a sequenceof observations for state j of a model, the likelihood will often be calculated as bj(yn, yn+1, . . . , yn+m−1) =bj(yn)bj(yn+1) · · · bj(yn+m−1).

49

producing output, and the Kalman filters are assumed to better model the short-term dynamics of

the observations produced by a particular state.

All of the above models assume a discrete sequence of states, transitioning according to a Markov

chain, with each state producing a possibly variable length observation sequence. In some cases,

such as the HHMM, it is possible to look at state sequences at multiple resolutions, although the

underlying sequence at each resolution is still assumed to be Markovian. With the exception of the

switching state-space models, all of the HMM-based approaches described above are mathematically

equivalent to a standard HMM, although, of course, the corresponding HMM would often be rather

large and complicated.

A class of discrete models known as stochastic grammars are the next level of complexity with

regard to modeling structured sequences [102, 103]. These grammars are stochastic versions of

those in the language grammar hierarchy proposed by Chomsky [104], and in fact, some HMMs

are equivalent to right-linear stochastic grammars. Unfortunately, algorithms for working with

stochastic grammars are computationally expensive and rather unwieldy to work with, and so to

our knowledge, little practical work has been done with them.

The various models described above have generally been used for supervised learning, although

in theory, they could all be run in an unsupervised fashion to discover structure in unimodal data.

3.2.2 Multimodal structure

The previous section discussed a class of models that modeled structure within a particular stream

of data. In our work we are interested in discovering structure among inputs from multiple streams.

Within the family of HMMs and related stochastic models, a few models have been proposed

that try to deal with multiple streams of data, typically streams representing visual and auditory

information. Two variations are coupled HMMs (CHMMs [64,65], and fused hidden Markov models

(fused-HMMs) [62,63]. CHMMs tie together two individual hidden Markov models by introducing

a conditional probability between the state variables of the two models. Fused-HMMs work in a

similar way, but model the joint distribution of the observation and state sequences of both models.

Compared with our work, the most striking difference is in how we represent the relationship

among multiple input models. Our proposal is to model the relationship between the two input

50

ConceptModel

VisualModel Model

Auditory

SensoryInputs

SemanticMemory

to Working Memory

=⇒

ϕ^

n^ visx xn

aud^

yn

aud

ϕ^ vis

ϕ^con

ϕ^aud

yn

vis

con vis aud^ ^y =x , x n n n

n^ conx

Figure 3.1: Semantic memory implemented using HMMs. Each model in the left diagram is modeledby a single HMM in the right diagram.

models not as a conditional or joint pdf, but with a third hidden Markov model. An important

aspect of this approach is that our model is compositional. That is, the states of the input models

are considered to be functions of the state of the third model, and as in regular HMMs, the outputs

are considered to be functions of these states. To our knowledge, this type of compositional cascade

model is not found in the literature. As an added benefit of our approach, we can use well known

algorithms for learning and inference in all three HMMs. We describe our model in the following

section.

3.3 Cascade of HMMs

As shown in Figure 3.1, we are using HMMs for each of the individual models in the semantic

memory model we proposed in the introduction to this dissertation. This figure shows the topology

of our model, as a cascade of HMMs with two lower “input” HMMs, ϕ`1 and ϕ`2 , and one upper

“concept” HMM ϕu, each defined as in Section 2.2. As stated previously, we propose to use the

lower HMMs to individually learn a set of classes of sensory input data in an unsupervised manner,

and the upper HMM to learn states representing frequent co-occurrences in the classifications of

the lower models. The remaining description in this chapter will be from the point of view of this

abstract model.

51

uϕ

lϕ 1 lϕ 2

n

u

n n

ll 21y =x , x

nl2xn

l1x

nl1y n

l2y

xu

ϕ_

Figure 3.2: An HMM cascade model. Abstractly, we assume information is arriving from to distinctbut related input streams, yl1

n and yl2n . This information is recognized/learned by HMMs ϕl1 and

ϕl2 , respectively, which produce estimated state sequences xl1n and xl2

n . These state sequences arethen recognized/learned by HMM ϕu. All learning is unsupervised, using RMLE.

3.3.1 Model description

Formally, let the topology of our cascade model be as shown in Figure 3.2; i.e., let our cascade model

ϕ = ϕl1 ,ϕl2 ,ϕu, where each component model ϕlj and ϕu are HMMs defined according to the

description in Section 2.2. Let X l1n , Y

l1n , X l2

n , Yl2n , and Xu

n , Yun be the state and observation

sequences corresponding respectively to ϕl1 , ϕl2 , and ϕu. In this model, observations Yljk of lower

models ϕlj are generally assumed to be continuous. The observations Y uk of upper model ϕu are

the concatenated state sequences of the lower level models; i.e., Y uk =

(

X l1k , X

l2k

)

, and ϕu models

the joint distribution of X l1k and X l2

k for each state j = 1, . . . , ru, where ru is the number of states

in ϕu. To simplify calculations and for future considerations, the joint observation density of ϕu is

modeled assuming independence between components of its input.

3.3.2 Recursive maximum-likelihood estimation for the cascade model

Even though the individual component models are generative, as discussed in Section 2.2, it is im-

possible to generate data with this model with sufficient statistics to identify all model parameters.

To see this, suppose we use upper model ϕu to generate state pair sequences yu,1n , yu,2

n , and use

these as the states of the lower models; i.e., let xl1n = yu,1

n and xl2n = yu,2

n . In this situation, there

52

ytl1

xtl1

tu,1y

tu,2y

xtl2

ytl2

xtu

yt+1l1

xt+1l1

t+1u,1y

xt+1u

t+1u,2y

xt+1l2

yt+1l2

yt−1l1

xt−1l1

t−1u,1y

xt−1u

t−1u,2y

xt−1l2

yt−1l2

Figure 3.3: A dynamic Bayesian network (DBN) model showing the dependence among outputand state variables assumed by our cascade HMM. The cascade HMM cannot generated thesedependencies, but this DBN can be fully implemented using a switching HMM.

is no direct Markov dependence between xlγn and x

lγn−1, because we do not use the state transition

matrix A(ϕlγ ). On the other hand, if we use the state transition matrix to generate the next state,

then there is no dependence on the upper model.

To alleviate this problem, we will make a slight modification of our original model for generative

purposes, and then use our proposed cascade model for learning and inference. The modification

we need is to make the states xlγn of the lower models dependent both on x

lγn−1 and on yu,γ

n . A

graphical dynamic Bayesian network (DBN) showing this relationship is shown in Figure 3.3. To

generate this dependence, we define a modification of an HMM called a switching HMM, whose

name refers to its structural similarities to the switching state-space models mentioned earlier.

A switching HMM is a discrete-time stochastic process with two components, Xn, Yn, defined

on probability space (Ω,F , P ). Let Xn∞n=1 be a discrete-time process with state space R =

53

qn

yn

λ

Figure 3.4: A switching HMM. Each Markov chain in the model has the same number of states, withthe same state in each chain corresponding to the same observation probability density function.Input qn chooses which Markov chain to use for the transition from xn−1.

1, . . . , r. Unlike an HMM, in a switching HMM the dynamics of Xn are determined by a set

of Markov chains Am(λ), m = 1, . . . , s, with each chain having order r, and the transition

probabilities for each chain defined as usual. An external discrete signal qn = 1, . . . , s, determines

which Markov chain to use for the transition from Xn−1 to Xn. As in an HMM, the process Yn is

a probabilistic function of Xn, as we have defined previously. Let λ be the vector of parameters

for this model. The topology of this model is shown in Figure 3.4.

Proceeding, let the topology of our generating model be as shown in Figure 3.5, and call this

model a cascaded switching HMM. Define ϕ = ϕu, λl1 , λl2, where ϕu is a finite-alphabet obser-

vation HMM as defined in Section 2.2, and λl1 and λl2 are switching HMMs as defined above. We

could, if we wished, attempt to estimate the model parameters of the original cascaded switching

HMM using a model with identical structure (and in fact, we have begun to look at this model,

but have not completed sufficient analysis to include results here.) Instead, we will approximate ϕ

with a cascade model ˆϕ = ϕu, ϕl1 , ϕl2, as shown in Figure 3.6.

54

lλ 1 l

λ 2

n n ny =y , y u,1 u,2u

nyl2

nyl1

nyu,1

nyu,2

ϕu

Figure 3.5: A cascaded switching HMM. As a generator, an HMM ϕu outputs a discrete pairyu,1

n , yu,2n , the components of which become the switches for a pair of switching HMMs λl1 and

λl2 , selecting which Markov chain is used to determine the next transition.

lλ 1

ϕ~

nl1y

n

u,1

n

u,2

n

uy =y , y

uϕ

lϕ 1 lϕ 2

^ ^n

u

n n

ll 21y =x , x

nl2y

lλ 2

nu,2yu,1yn

^nl2x^

nl1x

ϕu ^ϕ_

Figure 3.6: Monte Carlo simulation for learning a cascaded switching HMM ϕ using a cascadeHMM ˆϕ. The model on the left is a generative model, generating data for the model on the rightto learn.

55

Comparing the two models, we note that (1) in the cascaded switching HMM, state transitions

are determined by a set of transition probability matrices Am(λl1), which we attempt to model

by a single transition probability matrix A(ϕl1) in the cascade HMM, and (2) in the cascaded

switching HMM, the joint observation densities in ϕu represent the selection of Markov chains in

the switching HMMs, whereas in the cascade HMM, the joint distribution in ϕu correspond to‘ the

actual states in the lower models. We suggest that generally this joint distribution over states will

be sufficient to identify states in the original HMM ϕu. However, we note that not all cascaded

switching HMMs ϕ will be identifiable. An example of a switching HMM that is not identifiable

by our cascade model can be constructed (1) by selecting a particularly simple form for ϕu, such

that each state deterministically selects a single Markov chain in each of the switching HMMs, and

then (2) by considering transition probability matrices in Am(λlγ ), γ = 1, 2, whose stationary

distributions are identical, but whose actual transitions differ.

Proceeding, for the following analysis, assume that the number of states and the form of the

density function in each HMM and switching HMM in the original model ϕ are known, and that we

are attempting with ˆϕ to learn a set of first-order transition probabilities and observation density

parameters representing the original data. Consider state-observation sequence pair x l1n , y

l1n . Even

though this sequence was not generated by an HMM, there exists an HMM that represents the first-

order statistics of this sequence, i.e., one that exactly matches the first-order transition probabilities

and observation densities of this sequence. As shown in the previous chapter, the model ϕl1 in our

cascade structure will converge to this model when updated using the recursive maximum-likelihood

algorithm presented in Chapter 2. The same applies to ϕl2 .

Next, consider the estimated composite state sequence xl1n , x

l2n recognized by the models ϕl1

and ϕl2 . We will assume that, as models ϕl1 and ϕl2 converge, this sequence will be representative

of the true state sequences in the switching HMMs λl1 and λl2 which generated the data. As above,

we note that there then exists an HMM that can represent the first-order statistics of this sequence,

which, again, we can learn through recursive maximum-likelihood estimation. Each state in model

ϕu will correspond to a unique state in ϕu if the joint distribution of states in λl1 and λl2 is unique

for each state in ϕu.

56

3.3.3 Numerical simulations

We will be using the setup in Figure 3.6 for numerical simulations. Since the structure of the

generating model and the learning model are different, we cannot directly compare the learned

parameter values. What we can show in simulation is

1. that the likelihood for each model increases during training, and

2. that the learned models can classify the original data and produce state sequences represen-

tative of the original model sequences.

For the simulation, we used the following parameters for the generative cascaded switching HMM.

HMM ϕu was a three-state finite-alphabet HMM, with parameters

A(ϕu) =

0.98 0.01 0.01

0.01 0.98 0.01

0.01 0.01 0.98

,

bl1(ϕu) =

0.8 0.1 0.1

0.1 0.8 0.1

0.1 0.1 0.8

,

and

bl1(ϕu) =

0.48 0.48 0.02 0.02

0.02 0.02 0.94 0.02

0.02 0.02 0.02 0.94

,

where, to simplify calculations, the discrete observations are assumed to be independent and mod-

eled with two discrete, finite probability mass functions bl1(ϕu) and bl2(ϕu).

Switching HMM λl1was modeled using three probability transition matrices

A1(λl1) =

0.90 0.05 0.05

0.80 0.10 0.10

0.80 0.10 0.10

,

A2(λl1) =

0.10 0.80 0.10

0.05 0.90 0.05

0.10 0.80 0.10

,

57

and

A3(λl1) =

0.10 0.10 0.80

0.10 0.10 0.80

0.05 0.05 0.90

,

and single-dimensional Gaussian observation pdfs, with parameters

µ = [ 0 7 9 ]′, σ2 = [ 2.0 0.8 0.6 ]′.

Similarly, HMM λl2 was modeled using four probability transition matrices

A1(λl2) =

0.90 0.08 0.01 0.01

0.50 0.50 0.00 0.00

0.50 0.50 0.00 0.00

0.50 0.50 0.00 0.00

, A1(λl2) =

0.00 0.50 0.50 0.00

0.01 0.90 0.08 0.01

0.00 0.50 0.50 0.00

0.00 0.50 0.50 0.00

,

A1(λl2) =

0.00 0.50 0.50 0.00

0.00 0.50 0.50 0.00

0.01 0.08 0.90 0.01

0.00 0.50 0.50 0.00

, A1(λl2) =

0.00 0.00 0.50 0.50

0.00 0.00 0.50 0.50

0.00 0.00 0.50 0.50

0.01 0.01 0.08 0.90

,

and Gaussian observation pdfs with parameters

µ = [ −1 3 6 9 ], σ2 = [ 0.6 0.5 0.8 0.7 ].

For the learning model ˆϕ = ϕu, ϕl1 , ϕl2, transition probabilities for all HMMs were initial-

ized uniformly. The finite-alphabet observation densities in ϕu were initialized randomly, and the

Gaussians in ϕl1 and ϕl2 were initialized by running the generative model for 1000 iterations and

using k-means clustering to determine a set of starting means and variances.2 To make the prob-

lem slightly more interesting, Gaussian noise with zero mean and standard deviation one was then

added to the initial means, and noise with zero mean and standard deviation 0.5 was added to the

variances. For recursive maximum-likelihood training, we let learning rate εn = 0.006n0.2 , where n is

the iteration number, and we used a smoothing history of k = 1000 (see Section 2.3.3).

Figure 3.7 shows the output a training run for model ϕu. Figure 3.7(a) gives a running average

of the log-likelihood of the observations versus time. The remaining subfigures show the convergence

2In fact, this initialization could be done randomly as well, but k-means clustering and similar techniques arecommonly used to give a set of initial parameters which will converge in a reasonable amount of time [93].

58

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−3

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n

(a) Running average of log likeli-hood of pϕu(y1, . . . , yn).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)

(b) Training of transition probabil-ity matrix A(ϕu).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,1

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,2

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,3

)

(c) Training of observation probabil-ity matrix bl1(ϕu).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,1

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,2

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,3

)

(d) Training of observation proba-bility matrix bl2(ϕu).

Figure 3.7: Parameter learning for model ϕu. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.

of the parameters of ϕu during training. Since we cannot compare this trained model directly to

the original model ϕu, it is difficult to draw conclusions from these graphs with regard to the

“goodness” of the model. What we can say is that the parameters did converge, and that as they

converged, the log-likelihood of the observations generally increased. Note that since we are doing

stochastic optimization, we are not guaranteed to always increase the likelihood; hence there may

be an occasional dip in the likelihood graphs, especially near the beginning. Similar graphs for

models ϕl1 and ϕl2 are shown in Figures 3.8 and 3.9. Note that since we used k-means clustering

to initialize ϕl1 and ϕl2 , their density estimations started quite close to their optimal values.

One way to compare the models in our simulation is by finding the correspondence between

the states of ϕu and ϕu after training (or after the model parameters seem to have converged)

59

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−3

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n

(a) Running average of log likeli-hood of p

ϕl1

(yl11 , . . . , yl1

n ).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)

(b) Training of transition probabil-ity matrix A(ϕl1 ).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−1

0

1

2

3

4

5

6

7

8

9

time

µ

(c) Training of Gaussian observationdensity mean µ(ϕl1).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.5

0

0.5

1

1.5

2

2.5

3

time

σ

(d) Training of Gaussian observa-tion density variance σ2(ϕl1 ).

−10 −5 0 5 10 150

500

1000

1500

2000

2500

3000

3500

(e) Data histogram and observationdistribution learned by the model.

Figure 3.8: Parameter learning for model ϕl1 . Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.

60

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−3

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n


ϕl2

(yl11 , . . . , yl2

n ).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,4)


0.5 1 1.5 2 2.5 3 3.5 4

x 104

−2

0

2

4

6

8

10

time

µ


0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

time

σ


−4 −2 0 2 4 6 8 10 12 140

500

1000

1500

2000

2500

3000


Figure 3.9: Training run output for model ϕl2 . Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.

61

0 20 40 60 80 100 120 140 160 180 200

1

2

3

States of φu

0 20 40 60 80 100 120 140 160 180 200

1

2

3

States of φu

Figure 3.10: State sequence comparison between generative HMM ϕu and learned HMM ϕu. Ascan be seen from the figure, state 1 in model ϕu corresponds to state 3 in model ϕu, state 2 in ϕu

corresponds to state 1 in ϕu, and state 3 in ϕu corresponds to state 2 in ϕu.

Table 3.1: Average classification accuracy for learned HMM ϕu over 50 simulation runs. Thenumber in parenthesis is the standard deviation.

Maximum LikelihoodClassification

Viterbi Classification

ϕl1 92.6%(3.5%) 91.9%(4.5%)

ϕl2 94.2%(3.8%) 94.5%(4.2%)

ϕu 91.1%(3.3%) 95.3%(3.7%)

and then measuring the classification accuracy of ϕu against the original sequences generated by

ϕu. Figure 3.10 shows a plot with a portion of the state sequence generated by ϕu, along with

the same portion recognized by ϕu. As can be surmised from the figure, each state in the original

model was shifted “up” one state in the learned model. A similar analysis can be done for the lower

level HMMs/switching HMMs in each model. Using this correspondence, we ran the generative

model for 10 000 iterations and measured the accuracy of the trained model. We repeated this

experiment for 50 training episodes of 50 000 iterations each. Means and standard deviations for

model accuracy are summarized in Table 3.1. Maximum-likelihood classification was based on the

forward filters described in the previous chapter. For Viterbi classification, we calculated the state

sequence backwards at appropriate times according to the algorithm presented in Section C.3 of

Appendix C. Specifically, if the backpointers ψn all pointed to the same state at time n − 1, we

62

set a break and calculated the most likely sequence back to the previous break, or back to the

beginning of the sequence if there was no previous break. This sequence was then compared with

the generated state sequence to produce the second column in Table 3.1.

Note that for many of these runs, the HMMs in the cascade model did not all completely

converge after 50 000 iterations. In particular, because of our model setup, the observations for ϕl2

often required longer to converge for certain initializations. Despite this, the model still proved to

be a reasonable classifier at all levels, as indicated by Table 3.1.

3.4 Discussion

In this chapter we have presented a cascade hidden Markov model architecture, offered some in-

formal analysis concerning convergence of the model, and presented a Monte Carlo simulation of

the model. Since the cascade model cannot be used in a fully generative fashion, we proposed a

cascaded switching HMM in order to incorporate proper dependencies into our data, that is, to

give it structure at multiple time scales. The cascaded switching HMM is an interesting model

itself, and we hope in some future life to be able to study it in more detail. Initial study indicates

that a version of the RMLE could be derived for this model in a manner similar to the RMLE

derivation for the hidden semi-Markov model presented in Appendix D. That said, the cascade

model presented in this chapter offers a distinct advantage in simplicity.

We believe the discussion and results presented in this chapter justify the use of the cascade

model even in situations where the model does not match the underlying model of the system. The

simulation results above showed that the model can learn information about the structure of data

at multiple scales, and can use that information to make classification decisions. The model also

seems to be rather robust: in many of the simulation runs, one state of lower model ϕl1 did not

converge to the original model within 50 000 iterations, and yet the upper model ϕu still recognized

the original state sequence from ϕu with over 90% accuracy.

The next chapter describes the application of this model to real world data, as a means for a

mobile robot to learn concepts.

63

CHAPTER 4

CASCADE OF HMMS AS AN

ASSOCIATIVE MEMORY

4.1 Introduction

Our original motivation for proposing the model described in Chapter 3 was to create a model able

to learn simple concepts, which, as we suggested in Chapter 1, are formed by associations within

and among information from a sensory-motor system. In this chapter, we will describe the use of

the cascade model from Chapter 3 for this purpose. Specifically, we will demonstrate the model’s

ability to learn concepts among features from visual and auditory streams as sensed by a mobile

robot.

4.2 Associative Learning of Language Using Robots

A number of researchers are using robotics to study language grounding and/or associative language

learning. Most of the work in this area has focused on the association of auditory and sensory infor-

mation, where the auditory information generally represents speech, and the sensory information

is generally visual information. We highlight some of this work below.

For association of speech and visual information, Roy [105] has proposed a model of grounded

language learning called cross-channel early lexical learning (CELL), in which speech provides noisy

and ambiguous labels for video, and vice versa. In this work, words are discovered by searching

for segments of speech which reliably predict co-occurring visual cues. Since these pairings are

64

extremely noisy, the technique used to find potential speech segments searches for matching sub-

segments of speech in matching visual contexts. Initial training used recorded speech from mothers

playing with their infants for auditory input, and static images of related objects for visual input.

Later, the system was incorporated into a real-time speech and vision system embodied in a robotic

arm. Notably, the system incrementally learns words, then a rudimentary grammar. It can also

generate spoken outputs from stored word prototypes.

Steels [106] and Steels and Kaplan [107] focus not on specific learning models, but on the in-

teraction between the robot and researcher. Steels presents the idea of language games, whereby a

person interacts with a robot for the purpose of teaching the robot words. For experiments, Steels

and Kaplan use an off-the-shelf speech recognizer to associate words and contextual information us-

ing simple instance-based learning algorithms. Our own experiments are similar to those described

in [106, 107].

For the association of words and general sensory information, Oates et al. [108] and Oates [109]

present a stochastic method for clustering words according to syntactic information, then separately

estimate the conditional probability that the word would be uttered given a set of generic sensor

readings from a mobile robot. They use their system to first associate written descriptions, and later

spoken descriptions, of the activities of a robot with the sensor readings. Later, Burns et al. [110]

proposed an information theoretic approach for learning similar associations with the same robot.

As mentioned in Section 1.3.2.4, two others on our own project have conducted research in

similar areas. Liu [51] developed a system where the robot learned associations between words and

“pushes” (tactile inputs), by which a robot learned to understand spoken navigational commands.

Zhu and Levinson [52] proposed a method to learn a joint probability density function (JPDF)

representation of the relationship of visual information and a text label, for learning such concepts

as color, shape, and object name.

Although it is usually not their main focus, many other developmental robot projects include

aspects of language study in their work [11, 12, 16, 20, 46, 110, 111].

Our proposal to use an HMM cascade structure to model associative learning is novel in the

same way as we described in the last chapter—we specifically model the relationship between repre-

sentations in multiple modalities with an HMM, as opposed to simply learning a joint distribution

65

"Apple"

Figure 4.1: Concept learning scenario using a cascade of HMMs. This model corresponds toFigure 1.6 on page 19, with the generic models replaced by HMMs.

between the two modalities, as is generally done in the work cited above. Our use of this model

appears next.

4.3 Concept Learning Scenario

Analysis of our model in Chapter 3 assumed that the data being analyzed came from the same

underlying source. In fact, unless our auditory and visual input streams are being produced by

an intelligent projector or R2-D2, it is unlikely that the data was produced in this manner. A

more likely scenario comes from Figure 4.1, which is derived from Figure 1.6 in Chapter 1. In this

scenario, both the robot and the person have a model of the world, which here is represented by

a cascade of HMMs. We assume that each model structurally allows the recognition of visual and

auditory information present in the world (the lower level models), and further, that concepts can

be inferred and understood from the sequence of discrete classifications of this auditory and visual

information (using the upper level model).

It is assumed that the boy’s model of the world is better or more complete than the robot’s

model and, therefore, that the goal of the robot is to learn the boy’s model of the world. To reach

this goal, the robot must try to garner information about each of the boy’s submodels. To learn

the boy’s visual submodel, the robot will use visual data obtained from the world and assume that

66

the boy’s model was learned from similar information. For learning the boy’s auditory submodel,

the robot will use the boy’s own “speech”, and to learn the boy’s concept model, the robot will

attempt to find a relationship between what the boy says and what the world presents visually.

4.3.1 Model description

Formally, the structure of our model is equivalent to the structure developed in Chapter 3, although

the flow of information through this structure may be different. For our scenario, assume that

our robot’s model of the world is a cascade model ˆϕrobot = ϕc, ϕa, ϕv, where ϕa and ϕv are

auditory and visual HMMs, respectively (corresponding to ϕl1 and ϕl2), and ϕc is a concept HMM

(corresponding to ϕu). Assume that the boy’s model of the world is a hybrid cascade model

ϕboy = ϕc, λa,ϕv, where λa is a switching HMM modeling audio information, as described in the

previous chapter, and the other submodels are visual and concept HMMs as before. Finally, assume

that the visual information presented by the world (e.g., the apple) is represented by a traditional

HMM ϕvis. In this scenario, ϕvis and ϕboy are fixed, and ϕv is the boy’s representation of ϕvis.

4.3.2 Model scenario

Figure 4.2 shows the model topology of the scenario we envision. This scenario proceeds as follows:

1. The model ϕvis produces a stream of states xvisn and corresponding visual features yvis

n .

The visual features yvisn are accessible by both the boy and the robot. The stream of states

may include such states as xvisn = APPLE and xvis

n = NOTHING.

2. The boy uses ϕv to recognize this visual stream, producing estimated state sequence xvn.

3. Using only the visual partial of the joint audio-visual state pdfs in concept model ϕc, the

boy “thinks” of the concept related to the visual input (i.e., chooses the most likely state xcn

concept state in ϕc corresponding to xvn).

4. The boy may choose, at random times, to “speak his mind.” At these times, he uses the

auditory observation pdf from state xcn to produce yca

n . This output becomes the switch for

switching HMM λa, which produces output stream yaudn = ya

n. It is assumed that the

67

cϕ

aϕ

nya

^nax

n

c,2yn

v^n

y = xc,1

aλ

vϕ

^nvx

vϕ

^ ^^ vy =x , x nnc

n

a

ϕboy_

ny = yv

n

visnyvis audy = y

n n

a

nyvis

ϕvis

robotϕ_ϕc

Figure 4.2: Model topology for robot concept learning. The topology of this diagram corresponds tothe scenario presented in Figure 4.1. The lower model ϕvis is a model of the world producing visualoutputs. The upper left model ϕboy recognizes this visual input and produces auditory output.The upper right model ˆϕrobot uses both the visual input produced by ϕvis and the auditory inputproduced by ϕboy, and trains its various submodels.

switch is “on” long enough to produce meaningful output from λa. At other times, the model

λa produces “silence” (i.e., xan = SILENCE, and ya

n represents this state appropriately).

5. The robot simultaneously recognizes and learns (clusters) class information from visual input

stream yvisn with HMM ϕv, and auditory input stream yaud

n with HMM ϕa. These models

produce estimated state sequences xvn and xa

n, respectively.

6. When both xvn and xa

n have meaningful information (i.e., xan 6= SILENCE and xv

n 6=

NOTHING), model ϕc both:

(a) updates (learns) using these inputs, (i.e., it clusters common co-occurrences), and

(b) estimates xc, its “thoughts” about the pair of inputs.

68

7. At other times, when only one of xvn and xa

n have meaningful information, ϕc uses only the

partial pdf associated with that input to estimate xc, and the model is not updated.

When actually run on the robot, estimated state information from all of the robot models may be

used by other programs (e.g., the controller) to make decisions.

4.3.3 Simulation results

Using the scenario outlined above, we ran a Monte Carlo simulation of the composite model. This

section outlines those results. The following parameters were used for the fixed models ϕvis and

ϕboy = ϕc, λa,ϕv. Let ϕvis be an HMM with Gaussian observations. Define its transition

probability matrix as

A(ϕvis) =

0.90 0.05 0.05

0.04 0.95 0.01

0.04 0.01 0.95

,

and its Gaussian observation density parameters as

µ(ϕvis) = [ 0 7 9 ]′, σ2(ϕvis) = [ 1.0 0.7 0.6 ]′.

Let the boy’s visual model ϕv be a learned version of ϕvis, i.e., ϕv ≈ ϕvis.

For the boy’s auditory switched HMM λa, let the set of transition probability matrices Am(λa),

1 ≤ m ≤ 3, be defined as

A1(λa) =

0.94 0.02 0.02 0.02

0.94 0.02 0.02 0.02

0.94 0.02 0.02 0.02

0.94 0.02 0.02 0.02

,

A2(λa) =

0.05 0.45 0.45 0.05

0.01 0.90 0.08 0.01

0.01 0.70 0.28 0.01

0.05 0.45 0.45 0.05

,

69

and

A3(λa) =

0.05 0.05 0.45 0.45

0.05 0.05 0.45 0.45

0.05 0.05 0.10 0.80

0.01 0.01 0.08 0.90

,

and let the Gaussian parameters µ(λa) and σ2(λa) be

µ(λa) =

[

0.0 3.0 5.0 7.0

]′

, σ2(λa) =

[

1.0 0.4 0.5 0.6

]′

.

Finally, let the boy’s concept HMM ϕc be defined by

A(ϕc) =

0.90 0.05 0.05

0.08 0.90 0.02

0.08 0.02 0.90

,

bv(ϕc) =

0.98 0.01 0.01

0.02 0.90 0.08

0.02 0.08 0.90

,

and

ba(ϕc) =

0.96 0.02 0.02

0.10 0.90 0.00

0.10 0.00 0.90

.

Note that bv(ϕc) represents a distribution over the states of ϕv, whereas ba(ϕc) represents a distri-

bution over the selection of Markov chains Am(λa). These two distributions are not and cannot

be used simultaneously.

As in the cascade model simulation in Chapter 3, assume we know the number of states and

type of distribution for each of the models, so that ˆϕrobot = ϕc, ϕa, ϕv has (approximately) the

correct topology to learn the given models. (See Section 2.3.5 for a brief discussion on model order

approximation when the number of states is not known or easily discernible.) As in the previous

chapter, means and variances for the observation densities of ϕa and ϕv were initialized using

k-means initialization on the first 1000 outputs of the generative model, Gaussian noise with zero

mean and standard deviation one was then added to the initial means, and noise with zero mean

70

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n

(a) Running average of log likeli-hood of pϕu(y1, . . . , yn).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)

(b) Training of transition probabil-ity matrix A(ϕu).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,1

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,2

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b1

’(:,3

)

(c) Training of observation probabil-ity matrix bl1(ϕu).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,1

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,2

)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

b2

’(:,3

)

(d) Training of observation proba-bility matrix bl2(ϕu).

Figure 4.3: Parameter learning for model ϕc. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.

and standard deviation 0.5 was added to the variances. For recursive maximum-likelihood training,

we again let learning rate εn = 0.006n0.2 , where n is the iteration number, and we used a smoothing

history of k = 1000 (see Section 2.3.3).

Figure 4.3 shows the progression of a training run for model ϕc. As before, Figure 4.3(a) gives

a running average of the log-likelihood, and the remaining subfigures show the progression of the

parameter values through time. As can be seen from the graphs, most of the parameters converge

quite rapidly. Parameters which converge more slowly, such as those in Figure 4.3(d), are those

that are tracking changes in lower models which have not yet converged. The convergence of ϕv

and ϕa can be seen in Figures 4.4 and 4.5. As before, since we used k-means initialization for

initializing the observation densities of ϕa and ϕv, these values started somewhat close to their

71

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n


ϕl1

(yl11 , . . . , yl1

n ).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)


0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

1

2

3

4

5

6

7

8

9

time

µ


0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

1.5

2

2.5

3

time

σ


−6 −4 −2 0 2 4 6 8 10 120

500

1000

1500

2000

2500

3000

3500


Figure 4.4: Training run output for model ϕv. Since, in simulation, this model learns directly fromdata generated by ϕv, we can compare the model parameters of these models directly. Originalmodel parameters are indicated by the symbol at the right edge of the graphs.

72

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−2.5

−2

−1.5

−1

−0.5

0

time

(1/n

)log

p n


ϕl2

(yl11 , . . . , yl2

n ).

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,1)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,2)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,3)

0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.5

1

time

A’(:

,4)


0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

1

2

3

4

5

6

7

time

µ


0.5 1 1.5 2 2.5 3 3.5 4

x 104

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

time

σ


−5 0 5 100

500

1000

1500

2000

2500

3000

3500

4000


Figure 4.5: Parameter learning for model ϕa. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.

73

Table 4.1: Average classification accuracy for learned HMM ϕc over 50 simulation runs. Thenumber in parenthesis is the standard deviation.

Maximum-LikelihoodClassification

Viterbi Classification

ϕa 90.1%(3.7%) 89.9%(4.2%)

ϕv 97.9%(1.6%) 99.1%(2.4%)

ϕc 98.4%(1.1%) 98.8%(1.1%)

optimum, although as noted, noise was added to this initialization. Because of this added noise

and the nature of the source model distributions, in some cases the variance parameters for the

Gaussian observation distributions did not converge to the correct values by 40 000 iterations.

We repeated this experiment for 50 training episodes of 50 000 iterations each, and used the

same method outlined in the previous chapter to measure model accuracy. These episodes are

summarized in Table 4.1. As before, the cascade model showed a high degree of robustness.

4.4 Robotic Experiments

The basic scenario for robotic experiments of the model is similar to the simulated scenario presented

in the last section: the robot and a person are looking at the same object, the person names the

object or briefly some aspect of it, and the robot, over time, learns the association between that

word or phrase and the visual features of the object. This scenario, while describing a necessary

aspect of learning in our robot, does not take into account the overall goals and work of the project

described in Chapter 1. Here we describe an experiment which better demonstrates these goals.

In our real scenario, our robot is wandering around a benign environment, and is instinctually

motivated to look for “interesting” things. We expect the following behaviors:

1. It will be attracted to objects, especially ones that it has not seen before, or not seen recently;

it will “play” with these objects, attempting to first pick them up, then knock them over.

2. It will be attracted by loud noises, turning toward them and assuming, e.g., that someone

wants to get its attention.

3. Using our proposed cascade model, it will

74

(a) learn to recognize the visual objects in its environment,

(b) learn to recognize distinct words spoken to it, and

(c) learn the concepts associated with the various words and objects.

4. Also using our HMM cascade, it will demonstrate that it recognizes these concepts by

(a) recognizing a word, choosing a corresponding concept, and finding an object which also

matches that concept, and

(b) recognizing an object and saying the name of a concept corresponding to that object.

The behaviors listed in numbers one and two above were first demonstrated by McClain [50]. The

demonstration described here builds on his work and on the work of others, including

• sound source localization research by Li and Levinson [31, 32],

• speech feature extraction and synthesis research by M. Kleffner [48], and

• visual feature extraction by R. S. Lin (unpublished).

The specific objects we are using in this demonstration are shown in Figure 4.6, and the list of

words and phrases we say are listed in Table 4.2. These words were chosen to test the learning

concepts for specifically named objects (such as cat) as well as concepts for general categories (such

as animal). Although not necessary, the concepts we initially learn correspond directly to the words

and phrases listed in Table 4.2.

Because they pertain directly to our work, autonomous exploration and speech and visual feature

extraction are discussed below, with the details of both feature extraction algorithms appearing

in Appendix B. This discussion is followed by a description of the implementation of our HMM

cascade model for our robots in Section 4.4.3.

4.4.1 Finite state machine controller

The central component of the above experiment is a finite state machine (FSM) controller developed

by McClain [50] as a part of an autonomous exploration mode for our robot . This controller

continuously evaluates the state of the robot and its environment, and uses this information to

75

Figure 4.6: Objects used in our robot demonstration.

Table 4.2: List of words used in our robot demonstration.

animalballcatdog

green ballred ball

make decisions and produce specific types of behavior. For our experiment, we modified the state

machine and its related programs to use information from our associative memory when making

decisions, as well as to facilitate learning in our model. The FSM we are using is shown in Figure 4.7.

A description of each state is as follows:

1. Explore: look around for something interesting.

(a) If we see an interesting object (such as one we have not seen before), go to state 2.

(b) If we hear an interesting (i.e., loud) sound, go to state 6.

2. Found object: an object is visible.

(a) If it is far away, approach it, study what it looks like, and stay in state 2.

(b) If it is near, go to state 3.

3. Learn name: learn the name or feature of an object.

(a) If we hear something, repeat it and try to associate it with this object; stay in state 3.

(b) After a period of silence, go to state 4.

4. Play 1: play with the object.

(a) Approach and attempt to pick up the object; go to state 5.

5. Play 2: play with the object.

76

explore−−−/

visibleobject

silence(timeout)

−−−/pick upobject

−−−/run intoobject

speech: "illy"/turn toward

sound

timeoutexpired/

beep

silence(timeout)

wrong object/explore

found nothing or

founddesired object/

say name

object far/approach object

learn object

unknown speech/beep object

nearspeech/

repeat & learn

Object2−Found

3−Learn Name

6−Interact

4−Play 1

7−Search 5−Play 2

objectlost

1−Explore

search for objecthear known object/

Figure 4.7: The robot’s finite state machine controller. Values on the arcs indicate the inputs andcorresponding behaviors when transitioning between states.

(a) Try to knock the object over; go back to state 1.

6. Interact: listen for known sounds.

(a) If we hear the name of an object we know, look for it; go to state 7.

(b) If we hear something we do not know, beep and stay in state 6.

(c) If we do not hear anything for a short period, go back to state 1.

7. Search: look for a particular object.

(a) If we have not found the object, keep looking, and stay in state 7.

(b) If we have not found the object after a long time, give up and go to state 1.

(c) If we find the desired object, say the name (if we know it), and go to state 2.

The role our HMM cascade associative memory plays changes depending on the state of the model.

We describe these roles below in Section 4.4.3.

4.4.2 Sensory inputs

We are running this experiment on a real robot with real sensory inputs, so in addition to the FSM

controller, our associative memory needs features extracted from live speech and visual inputs.

77

For speech data analysis, we are extracting energy, voicing confidence, and a set of log-area ratios

(LARs) from a 16-kHz audio stream. This processing is based on work developed by Kleffner [48] for

speech imitation for the robot, and is described in Appendix B. Typically, around 8-12 LARs plus

pitch and voicing information can be used to synthesize a very accurate reproduction of the speech

signal. For our work, we are currently extracting three LARs, log energy, and voicing confidence

on consecutive 20-ms segments of audio, giving us a stream of length-five feature vectors at 50 Hz.

Despite the short length of this feature vector, these features are very representative of the speech

signal; using only the three extracted LARs and voicing information, we can still synthesize speech

that is intelligible.

For visual data analysis, the current experiment is using a robust segmentation and feature

extraction algorithm developed by R. S. Lin (unpublished). The segmentation algorithm is based

on loopy belief propagation [112]. After image segmentation, the feature extractor presents a

length-10 visual feature vector for each object in an image, consisting of

1. a normalized length-eight color histogram,

2. the first moment of the object shape, and

3. the height/width ratio of the object.

These features appear at a rate of about 2 sets per second. Descriptions of the segmentation and

feature extraction algorithms are presented in Appendix B.

4.4.3 HMM cascade model setup

Our HMM cascade model is set up structurally similar to model ˆϕrobot in Figure 4.2 on page 68,

with some modifications. The biggest change is in our audio model ϕa, which is actually a two level

model with some additional processing, as shown in Figure 4.8. Conceptually, the lower model is

a phonetic model, and the upper model is a word model. As mentioned above, for auditory input

features we are using three log-area ratios, log energy, and voicing information. These features

are presented to our phonetic model, a 3-state HMM. The observation densities in each state of

this model were initialized from silence, voiced, and unvoiced auditory data features, respectively.

Transition probabilities were initialized uniformly, and the model was then trained with the RMLE

78

ϕ^ aud

n

audy

ϕ^ word

S2 S3S1

n’

word~ y

n’

word x ^

n

audx ^

Figure 4.8: Auditory model used for speech recognition in our robot. Lower HMM ϕaud is aphonetic model. Sequences of states from this model corresponding to words are converted to ahistogram and normalized. This histogram and the word length comprise the word feature vector.This vector is quantized, and then presented to word HMM ϕword, and used to estimate xword

n .

algorithm using features extracted from recorded speech of 20 sentences from the Harvard list of

phonetically balanced sentences [113], shown in Table 4.3 on page 81. The training of some of the

parameters in this model is shown in Figure 4.9.

For the word recognizer, we needed some way of representing the features of variable length

words or phrases. We first made the assumption that only words or short phrases would be spoken,

i.e., that we would not need to parse full sentences. A voice activity detector (a component of the

the speech imitation code) was used to determine the boundaries of these words or phrases. Using

these boundaries, we extracted the word/phrase length and calculated a normalized histogram of

the state sequence recognized by the phonetic HMM, giving us a length-4 word feature vector (i.e.,

the word length plus one histogram value for each of the three states of ϕaud). We then quantized

79

1 2 3 4 5 6 7 8 9 10

x 104

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

iterations

(a) Training of transition probabil-ity matrix A(ϕaud) for model ϕaud.

1 2 3 4 5 6 7 8 9 10

x 104

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

iterations

(b) Training of Gaussian meanµ(ϕaud) for state 1 of modelϕaud.

1 2 3 4 5 6 7 8 9 10

x 104

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

iterations

(c) Training of Gaussian meanµ(ϕaud) for state 2 of model ϕaud.

1 2 3 4 5 6 7 8 9 10

x 104

−0.4

−0.2

0

0.2

0.4

0.6

0.8

iterations

(d) Training of Gaussian covariancematrix Σ(ϕaud) for state 1 of modelϕaud.

1 2 3 4 5 6 7 8 9 10

x 104

−0.2

0

0.2

0.4

0.6

0.8

iterations

(e) Training of Gaussian covariancematrix Σ(ϕaud) for state 2 of modelϕaud.

Figure 4.9: Parameter estimation for phonetic HMM ϕaud. Training of means and covariances ofthe Gaussian observation densities are shown for two of the three states in the model.

80

Table 4.3: Harvard phonetically balanced sentences. Features extracted from one wave file of eachsentence above were used to train the 3-state phonetic HMM.

List 1 List 21. The birch canoe slid on the smooth planks.2. Glue the sheet to the dark blue background.3. It’s easy to tell the depth of a well.4. These days a chicken leg is a rare dish.5. Rice is often served in round bowls.6. The juice of lemons makes fine punch.7. The box was thrown beside the parked truck.8. The hogs were fed chopped corn and garbage.9. Four hours of steady work faced us.

10. Large size in stockings is hard to sell.

1. The boy was there when the sun rose.2. A rod is used to catch pink salmon.3. The source of the huge river is the clear spring.4. Kick the ball straight and follow through.5. Help the woman get back to her feet.6. A pot of tea helps to pass the evening.7. Smoky fires lack flame and heat.8. The soft cushion broke the man’s fall.9. The salt breeze came across from the sea.

10. The girl at the booth sold fifty bonds.

each component of this feature vector into five bins, and the resulting quantized feature vector was

presented to the word HMM. The feature vector quantization bins were non-uniform; the cutoffs

were determined by dividing the sorted list of each feature value for our training set into five roughly

equal bins, as shown in Figure 4.10. The discrete observation densities for each state in the HMM

were initialized using 10 repetitions of each word or phrase in Table 4.2, giving one state per word.

Audio features were extracted from each training waveform, passed through and recognized by the

phonetic HMM, and then quantized. Transition probabilities were initialized uniformly, and the

whole word model was then trained for 80 epochs on the same 10 repetitions of each word using the

RMLE algorithm. The training of some of the parameters in this model is shown in Figure 4.11.

For the visual HMM ϕvis, we used a four state HMM to recognize features from the objects

show in Figure 4.6. As with the word model, we initialized the observation densities for the object

models from a collected data set. For each object, we obtained a feature vector for 200 images of

that object taken from multiple perspectives. These vectors were quantized as above before being

used to initialize the densities. Again, transition probabilities were initialized uniformly, and the

model was then trained for 10 epochs on the same data using RMLE. We only used 10 epochs to

train with because the initial density estimates were close to their optimal values, as can be seen

by the parameter training examples in Figure 4.12.

The third model (the concept model), ϕc, is a discrete HMM with observations covering the joint

state spaces of the audio and visual models. We initialized a model with six states, corresponding

81

−4 −3 −2 −1 0 1 2 3 4

⇓

Sorted Samples:

-1.8740 -1.4751 -1.0106 -1.0091 -1.0078 -0.9921 -0.9499 -0.8217 -0.7420 -0.6436

-0.6355 -0.5596 -0.3775 -0.3510 -0.3179 -0.2959 -0.2656 -0.2556 -0.2340 -0.1315

-0.0482 -0.0195 0.0000 0.0403 0.0880 0.1184 0.2120 0.2379 0.3148 0.3803

0.3899 0.4282 0.4437 0.5077 0.5689 0.5690 0.5779 0.5913 0.6145 0.6232

0.6771 0.7310 0.7812 0.7990 0.8956 0.9409 1.0823 1.0950 1.4435 1.6924

⇓

−4 −3 −2 −1 0 1 2 3 4

Figure 4.10: Equalized quantization. A finite number of samples are drawn from an unknownprobability distribution (represented by a Gaussian distribution above), sorted, and divided intoequal groups. The resulting divisions between groups are used as the cutoffs for future quantization.

82

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(a) Training of transitionprobability matrix A(ϕword).

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(b) Training of dimension 1 ofword 1 of discrete observationdensity bjk(ϕword).

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(c) Training of dimension 1 ofword 2 of discrete observationdensity bjk(ϕword).

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(d) Training of dimension 1 ofword 3 of discrete observationdensity bjk(ϕword).

0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(e) Training of dimension 1 ofword 4 of discrete observationdensity bjk(ϕword).

Figure 4.11: Parameter learning for word model ϕword. Discrete observation density plots areshown for the first dimension of the observation densities for the first four words of the model.There were a total of six words, each with four observation dimensions (word length plus threestate histograms), quantized into five discrete levels. Plotted above are the probabilities of eachquantization level.

83

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

iterations

(a) Training of transitionprobability matrix A(ϕvis).

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(b) Training of dimension 1 ofobject 1 of discrete observa-tion density bjk(ϕvis).

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(c) Training of dimension 1 ofobject 2 of discrete observa-tion density bjk(ϕvis).

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(d) Training of dimension 1 ofobject 3 of discrete observa-tion density bjk(ϕvis).

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(e) Training of dimension 1 ofobject 4 of discrete observa-tion density bjk(ϕvis).

Figure 4.12: Parameter learning for model ϕvis. Discrete observation density plots are shown forthe first dimension (moment) of the observation densities for the four objects learned by the model.Each object was represented by a 10-dimensional vector (moment, height/width ratio, and a length-8 color histogram), each quantized into five discrete levels. Plotted above are the probabilities ofeach quantization level.

84

ϕ^

n^ visx

ϕ^ vis

ϕ^con

yn

vis

con vis^n ny = x

n^ conx

yn

aud

xn

aud^

ϕ^aud

Figure 4.13: Recognition of visual representations and concepts. In this state, the robot is notlistening for speech input, so model ϕaud is disabled.

to the six words/phrases in our word list in Table 4.2. The transition probabilities were initialized

uniformly, and the observation probabilities were initialized by hand to bias them slightly toward

the desired concepts. For example, for the state we chose to corresponding to “ball,” the observation

probabilities corresponding to the visually recognized red and green balls were given slightly higher

probabilities than the observation probabilities corresponding to the cat and the dog, and the

observation probability corresponding to the word “ball” was given a slightly higher value than

those probabilities corresponding to other words.1

Depending on the mode of the finite state machine, certain parts of the model are inactive.

Specifically, referring to the FSM in Figure 4.7 and described on page 76, when in states 1, 2, 4, 5,

and 7, auditory input is ignored: the object HMM recognizes visual inputs, and the concept model

uses the marginal density corresponding to the states of the object HMM to determine its state.

This idea is presented in Figure 4.13. In state 6, where the robot is listening for speech input,

the opposite happens: visual input is ignored, the auditory model attempts to recognize spoken

words, and the state of the word model alone determines the state of the concept model, as seen in

1This biasing is not strictly necessary, but helps our model converge in a reasonable amount of time. As withmany modeling scenarios, using prior knowledge to initialize the model is common [114]. We note that Poritz used asimilar type of bias when he conducted experiments on unsupervised learning of speech using HMMs [72].

85

ϕ^ϕ^con

con aud^n ny = x

n^ conx

ϕ^ aud

yn

aud

n^ audx

yn

vis

ϕ^vis

xn

vis^

Figure 4.14: Recognition of auditory representations and concepts. In this state, the robot is onlylistening for speech input, so model ϕvis is disabled.

Figure 4.14. Finally, in state 3, both audio and visual inputs are present. All models are active, and

recognition and learning is done in the concept model with both inputs, as shown in Figure 4.15.

Note that learning is possible in both the auditory and visual models in any state where the model

of interest is active. For the experiment here, we chose not to enable learning in these models.

4.4.4 Issues

There are a few miscellaneous issues we must deal with in our experiment, depending on the current

state of the FSM. The first issue is that multiple objects may be present within a scene. When this

happens, each visual object is presented in sequence to the visual HMM. In this way, the transition

probabilities in the visual HMM would come to represent information about the spatial relationship

between various objects, in that objects that are close to one another will frequently be presented

to the HMM sequentially.

Depending on the state of the FSM, one of these objects is identified as a target object. For

example, in state 2, the target object would be the object first identified as “interesting” in state

one. In subsequent iterations, the robot will remember and attempt to track this target object,

e.g., so that it can be played with later.

86

ϕ^

n^ visx xn

aud^

yn

aud

ϕ^ vis

ϕ^con

ϕ^aud

yn

vis

con vis aud^ ^y =x , x n n n

n^ conx

Figure 4.15: Recognition and learning using both auditory and visual information.

Because we have stereo cameras, we additionally have a correspondence problem to deal with.

Currently, at every iteration the model recognizes objects in each image separately, and then cor-

respondence is determined using the recognition labels (i.e., the recognized states of ϕv) for each

image. Objects which appear in only one of the images are currently ignored. We do not currently

handle the situation where there are multiple objects of the same visual class present.

A final potential issue is object occlusion, where only a portion of an object appears in an

image. As of right now, this has not been a serious issue for us. In the case where an object is

misclassified because it is occluded in one image, but fully visible in the other, correspondence is

not drawn between the two objects. If the robot is looking for this object, it will eventually find it

when it moves or turns its head. As the robot approaches an object, the bottom of the object may

also be cut off; in this case, we lower its head. Even in the case of a partially occluded object, the

recognition has generally proven robust enough to do proper recognition. This issue may become

more important as we increase the number of objects.

4.4.5 Results

Our goal in this experiment is to show that the concept model ϕcon can be learned from a set of real

inputs. As described above, we initialized and trained the auditory and visual models off-line using

87

Table 4.4: Initial observation probabilities used by the concept HMM for visible objects. Thehorizontal axis refers to the concept class, and the vertical axis refers to the classified visual object.

animal ball cat doggreenball

red ball

0.30 0.20 0.40 0.30 0.15 0.15

0.30 0.20 0.30 0.40 0.15 0.15

0.20 0.30 0.15 0.15 0.40 0.30

0.20 0.30 0.15 0.15 0.30 0.40

Table 4.5: Initial observation probabilities used by the concept HMM for words. The horizontalaxis refers to the concept class, and the vertical axis refers to the classified spoken word.

animal ball cat doggreenball red ball

“animal” 0.50 0.10 0.10 0.10 0.10 0.10“ball” 0.10 0.50 0.10 0.10 0.10 0.10“cat” 0.10 0.10 0.50 0.10 0.10 0.10“dog” 0.10 0.10 0.10 0.50 0.10 0.10

“green ball” 0.10 0.10 0.10 0.10 0.50 0.10“red ball” 0.10 0.10 0.10 0.10 0.10 0.50

recorded auditory and visual features, respectively. Note that, even though the training occurred

off-line, we used recursive maximum-likelihood estimation to learn the model parameters, so this

training could be done online.

For the concept model, we initialized the model as described in Section 4.4.3 above, i.e., we

initially set all of the transition probabilities equal, and by hand initialized the discrete observation

probabilities so that they would have a slight bias to particular concepts. The actual initialization

we used is shown in Tables 4.4 and 4.5. The model was then trained using RMLE during the

simulation run. Specifically, when the FSM entered state 3, the robot would sit in front of a target

object. The visual model ϕv would continuously recognize this object, and the auditory model ϕa

would recognize words that were spoken into a close-talk microphone. When a word was spoken and

88

Figure 4.16: Illy learning about various objects. In this scenario, as Illy approaches various objects,she stops and waits for a verbal description consisting of short words or phrases. Over time, sheassociates these spoken words with the object.

recognized, the state xa of model ϕa corresponding to that word and the state xv of model ϕv were

presented to the concept model, and the model was updated according to the RMLE algorithm.

To speed up training, each input pair was presented 10 times each time the a word was recognized.

This process was repeated multiple times for each object as the robot wandered around and played

with its toys.2 Figure 4.16 shows a picture taken during this training.

Figure 4.17 shows the change of some of the parameters of the concept model as the model is

trained. For the training run shown here, we ran the robot for about 30 min. The final trained

transition and observation probabilities are shown in Tables 4.6, 4.7, and 4.8.

2Because we were slightly impatient to get results, the robot was not actually allowed to play with its toys duringthe training run; it could only look at them and hear their names.

89

100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(a) Training of transitionprobability matrix A(ϕc).

100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(b) Training of the visual in-put of discrete observationdensity (

¯ϕc) for the first con-

cept (“animal”).

100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(c) Training of the visual inputof discrete observation density(¯ϕc) for the second concept

(“ball”).

100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(d) Training of the audio inputof discrete observation density(¯ϕc) for the first concept (“an-

imal”).

100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations

(e) Training of the auditoryinput of discrete observationdensity (

¯ϕc) for the second

concept (“ball”).

Figure 4.17: Parameter learning for model ϕcon. Discrete observation density plots are shown forthe first dimension (moment) of the observation densities for the four objects learned by the model.Each object was represented by a 10-dimensional vector (moment, height/width ratio, and a length-8 color histogram), each quantized into five discrete levels. Plotted above are the probabilities ofeach quantization level.

90

Table 4.6: Trained transition probabilities for the concept HMM. These values were initializeduniformly (i.e., all values started at 1

6 ).


animal 0.4211 0.0856 0.1670 0.1840 0.0699 0.0723ball 0.0723 0.4760 0.0597 0.0721 0.1659 0.1540cat 0.1931 0.1017 0.3717 0.1479 0.0925 0.0931dog 0.2023 0.0911 0.1307 0.4115 0.0776 0.0867

green ball 0.1082 0.2142 0.1002 0.1051 0.3321 0.1401red ball 0.1105 0.1951 0.1026 0.1186 0.1434 0.3298

Table 4.7: Trained observation probabilities used by the concept HMM for visible objects. Thehorizontal axis refers to the concept class, and the vertical axis refers to the classified visual object.

animal ball cat doggreenball

red ball

0.4086 0.0761 0.5412 0.3193 0.0996 0.1041

0.3987 0.0744 0.2665 0.5077 0.0974 0.1027

0.0957 0.4215 0.0963 0.0603 0.5232 0.2777

0.0970 0.4280 0.0960 0.1127 0.2799 0.5155

Table 4.8: Trained observation probabilities used by the concept HMM for words. The horizontalaxis refers to the concept class, and the vertical axis refers to the classified spoken word.


“animal” 0.6728 0.0530 0.1977 0.2134 0.0576 0.0605“ball” 0.0660 0.7088 0.0572 0.0509 0.2166 0.1896“cat” 0.0769 0.0268 0.5738 0.0575 0.0385 0.0398“dog” 0.1112 0.0735 0.1022 0.6194 0.0647 0.0882

“green ball” 0.0396 0.0699 0.0369 0.0306 0.5550 0.0722“red ball” 0.0334 0.0681 0.0322 0.0282 0.0676 0.5496

91

4.4.6 Discussion

The actual results shown are an indication of the long-term capabilities of the model. Specifically,

the model had not converged entirely after 30 min, but the parameter values did move in a direction

which indicated convergence to a useful state.

Trained transition probabilities in Table 4.6 indicate (1) a general affinity for “thinking” of the

same object at consecutive time steps (as indicated by high diagonal values), and (2) a slightly

smaller but discernible relationship between related classifications (e.g., the animal state was more

likely to be followed by a cat or dog state than any of the other states). These transition probabilities

strongly reflect the order of words presented to the model, which is reasonable considering (1) we

only trained the model when a word was present, (2) we often repeated the same word consecutively,

and (3) when we did not repeat a word consecutively, we often spoke another word related to the

same object.

For both auditory and visual inputs, the observation probabilities for each state in the concept

HMM were biased slightly at the beginning of training toward a particular outcome. For the

observation probabilities learned through the end of training, those probabilities referring to visible

objects (in Table 4.7) are the more interesting of the two. For example, the concept ball initially

corresponded to a visual representation of the red or green ball with probabilities 0.3 and 0.3, and

to a visual representation of the cat or dog with probabilities 0.2 and 0.2. The trained values of

these states showed a stronger proclivity to all initial biases. Taking the ball example again, the

final observation probabilities for the red ball and green ball were 0.43 and 0.42, respectively, and

the observation probabilities for the visual representations of cat and dog went down accordingly.

The same was true for other observation probabilities.

The results here indicate that our HMM cascade model can learn a set of concepts using fea-

tures extracted from live auditory and visual inputs measured by a mobile robot exploring its

environment.3 This learned information can then be used by the robot’s controller module to make

important behavioral decisions.

3The auditory inputs in this experiment were presented using a boomset microphone. However, in theory, nothingprevents us from using microphones on the robot, although we would have a noisier signal.

92

4.5 Conclusion

This chapter has discussed the use of our cascade of HMMs as an associative memory. First,

simulation results representing a real-world scenario indicated that this model is viable for learning

associations among concurrent stationary regions of multiple input streams, where each of these

stationary regions are modeled by a state in a hidden Markov model. Next, a live version of this

scenario was run on the robot, whereby features were extracted from auditory and visual streams,

classified by a HMM, and these classifications then used as input to a concept HMM for training

and additional classification.

The robotic implementation of our HMM cascade model presented here is a proof of concept for

an important idea. Specifically, we are able to take noisy, real-world analog inputs, convert them

to symbols (by classifying them), and present them to a controller for use in making important

decisions (for example, whether to approach and play with a particular toy, or look for another).

In other words, our robot is making symbolic decisions based on discrete representations of the real

world around it. In addition, when classifying and learning about real world outputs, the model

learns to associate related auditory and visual information with the same (symbolic) concept. The

model is learned online using a robust maximum-likelihood estimator.

The work described in this chapter explored the case where each concept in our concept HMM

corresponded to exactly one word, though potentially multiple visual objects. An interesting future

experiment would be to learn concepts which could refer to both multiple words and multiple

objects. Another obvious though challenging extension to this work would be to attempt to grow

the cascade model as new visual objects or words are presented to the robot.

93

CHAPTER 5

CONCLUSION

5.1 Summary

The main thrust of this dissertation has been to propose and analyze the use of HMMs in a cascade

architecture, as a means of extracting meaning from information available in multiple input streams.

Our motivation for creating this model was based on our understanding of how people, especially

children, learn meaning. Fundamentally, we believe all of our understanding of the world is based on

information from our senses. Some of this understanding is symbolic, or conceptual, as expressed

especially by language. These symbolic concepts have particular representations in the various

senses, and are related through an underlying spatio-temporal structure. Working backwards, we

believe that concepts are learned by associating, from multiple senses, representations of information

that seem to be related spatially or temporally. The model presented here is our first attempt at

this goal.

For our implementation, we chose to work with stochastic models. This choice was based

on a number of motivating factors, including the well known theory of these models and their

close relationship with optimal Bayes classifiers. One of the most interesting motivating factors,

however, was highlighted in a very recent neuroscience article on early language acquisition [115].

The article states that, according to recent research, infants “use computational strategies to detect

the statistical and prosodic patterns in language input” [115, p. 831]. Thus, our use of stochastic

models for a similar purpose seems particularly suitable.

Our particular choice to use and extend HMMs for our stochastic model stems from their

inclusion of a notion of time and sequence in the model definition. While more expressive stochas-

94

tic models exist, HMMs are currently the most feasible for our project, both conceptually and

computationally. We found that we could run a cascade of small models (3-10 states, 5-10 dimen-

sional observations) in real time, including model updates, at around 50 iterations per second on

a 2.2-GHz desktop machine, although processing audio and video features on the same machine

significantly reduced this rate. Further computational optimizations and more advanced hardware

will make much more complex models computationally feasible. The relatively recent development

of robust, iterative learning algorithms for these models also contributed to their suitability for our

application.

Our HMM cascade model itself is robust in both theory and simulation. Since each submodel of

the cascade model is itself an HMM, and since we are updating the model parameters using recursive

maximum-likelihood estimation, each submodel will (under appropriate conditions) converge with

probability 1 to the best stochastic representation of the data, as the number of iterations goes to

infinity, even if that data was not produced by an HMM (and, to our knowledge, no real source of

data is produced by such a model). Of course, convergence in the limit does not necessarily lead to

convergence with finite amounts of data, so we provided simulations to show that in practice, the

cascade does converge to something useful in a reasonable amount of time.

The original motivation for this research was to implement, for our robot, an associative memory

for learning the symbolic concepts mentioned above. The model is currently implemented and

running in our robot, and has worked very well. We have been able to run the model as part of

a demonstration, learn concepts from auditory and visual cues in the environment, and use these

concepts to make decisions. An important perspective on this simple statement is that our model

converting analog inputs to discrete symbols, allowing the robot’s controller to make decisions

symbolically using discrete representations of the environment. Moreover, these symbols form the

basis needed for more complex symbolic manipulation, such as language.

5.2 Insights and Future Directions

There are a few additional insights we have gained while working on this project. Some of these

have been outlined in the course of this dissertation, and have become part of our premises. Others

95

are purely technical, but just as interesting. Often they suggest avenues for further research. We

highlight a couple of these insights below.

5.2.1 Derivation of recursive maximum-likelihood estimation algorithms

One technical insight we have gained concerns the development of RMLE for HMMs. The basic

derivation starts by writing the likelihood function for a sequence of observations 〈y1, . . . , yn〉 for a

model ϕ as

pn(y1, . . . , yn) =

n∏

i=1

b(yn;ϕ)′un(ϕ), (5.1)

where b(yn;ϕ) is a vector of likelihoods for each class in the model, and un(ϕ) is a vector of prior

probabilities for each class in the model (see Section 2.3.1 and Appendix E). The derivation then

proceeds by taking the log of this function, which turns it into a sum of logs, and then uses the

partial derivatives of the last term in the sum, log[b(yn;ϕ)′un(ϕ)], to update the model parameters.

This procedure is quite general, which implies that a version of RMLE could be derived for

almost any model for which the likelihood of a sequence of observations can be written in the

form of Equation (5.1). As noted before, we completed such a derivation for hidden semi-Markov

models, which appears in Appendix D. In addition, although the derivation does not appear here,

the likelihood of a set of observations for the switching HMM described in Section 3.3.2 can be

written in the form of Equation (5.1), implying that a similar derivation is possible. We believe

this technique may also be applicable to switching state-space models. Of course, there may be

other restrictions on the form of the model for convergence to hold, but we feel this is a direction

worth exploring.

5.2.2 Generative modeling

As highlighted in Chapter 3, our HMM cascade cannot generate outputs that depend on all model

parameters. In particular, this implies that, even though we can use our model to learn sequences

of concepts involving auditory and visual information, we cannot, for example, directly turn the

model outward and produce equivalent auditory output from a concept or sequence of concepts.

(In this discussion, we will temporarily ignore the fact that the speech produced would probably

not be intelligible.)

96

At first glance, it would seem that humans do not do this either. For example, our ears do basic

spectral processing of auditory signals, which is then processed by our brain. When we produce

speech output, however, the signal is not simply reversed and sent back out through our ears!

Instead, a set of signals controlling the muscles in the mouth, vocal cords, and diaphragm create

the desired output of the system. A similar discussion can ensue concerning visual inputs. That

said, we can, at the very least, imagine sounds and pictures inside of our head, so we can consider

the models we use for recognizing speech and images as generative models which can reproduce

speech and images inside our head.

That our HMM cascade is not generative in the strictest sense does not diminish its recognition

capabilities, but it may imply that other models may be more appropriate. In particular, the

cascaded switching HMM of Chapter 3 is a fully generative model. In fact, learning and inference

in this model would be considerably more complicated, both conceptually and computationally, and

therefore it may not yet be practical to implement on a robot. It would, however, be interesting to

study this and similar models more carefully.

5.2.3 A language-learning robot

The ultimate goal of our project is the creation of a language learning and understanding robot.

The work in this dissertation is a significant step toward that goal. In particular, we have offered

a mechanism for creating an internal symbolic representation of the outside world, which can be

used as the basis for decision making and more complex symbolic processing.

As a practical matter, some good engineering on the robotic implementation presented in this

dissertation would improve the quality of any future experiments. In particular, the speech pro-

cessing is currently somewhat error prone.

As of right now, this work is used as the basis for decision making in a finite state machine

controller. While this controller suffices, the types of behavioral decisions it makes is currently

hard-coded. A very useful area of research would be to study and implement a controller which can

learn behaviors based on previous experience using reinforcement learning [41]. Some work related

to this has already been conducted by Zhu and Levinson [49] for our own project, although it is not

incorporated into our current work. Another related avenue of behavior learning research would be

97

to study the use of partially observable Markov decision processes (POMDPs) [116] to replace the

controller.

With regard to language learning, a medium- to long-term goal would be to learn some more

complex spatial or temporal relationships among the concepts currently being learned semantically.

Especially as computing power increases, more complex stochastic grammars could augment or

replace the HMMs in the cascade model presented here. In the short term, S. Levinson has proposed

learning simple two-word grammars, which could be studied and built on top of the models presented

herein. Even this seemingly simple task presents significant challenges for the next generation of

intelligent robotics researchers.

5.3 Final Words

The work presented in this dissertation is part of an ambitious project which draws ideas from

a large number of areas. At times the sheer number of fields touched and vastness of knowledge

required to simply engineer the project has been very daunting and frustrating. At the same time,

the broad understanding and insight gained through this process has been extremely rewarding.

We are proud of our contributions, and hope that they help advance this project and its related

fields.

98

APPENDIX A

HARDWARE AND SYSTEM-LEVEL

SOFTWARE SPECIFICATIONS

A.1 Introduction

Over the years, our research has required the use of three different robots, as well as various comput-

ers and other hardware. In this chapter we list the specifications of recent hardware (the hardware

used in Illy and Norbert), as well as information about configuration and installed software.

A.2 Robots

The base unit for both Illy and Norbert is an Arrick Robotics second generation Trilobot.

A.2.1 Specifications

These specifications were taken from the Arrick Robotics website (http://www.robotics.com/

trilobot/).

Features:

• 12”×12”×12” body dimensions, 11 pounds.

• Dual differential drive with DC gear motors and encoders.

• Maximum speed: 10” per second.

• Surfaces: tile, concrete, low pile carpet, moderate bumps and inclines.

• 2 pound payload capacity for radio data link, embedded PC, etc.

• Thumb screws make removing panels easy.

99

• Removable battery pack uses 8 standard D-cells.

• Pan/tilt head positions sensors quickly.

• Stationary mast contains additional sensors including a digital compass.

• Gripper can grasp and lift cans and balls.

• Programmable control from user’s desktop PC or on-board embedded PC.

• Infrared communications from TV remote control and other Trilobots.

• RC receiver port allows control from an RC transmitter.

• PC-style joystick control port.

• 2 line x 16 character liquid crystal display.

• 16-key keypad.

• Sound effects and rudimentary speech (optional speech synthesizer).

• Sound recording and playback.

• Expansion port allows unlimited possibilities.

• Safe, low voltage system.

Sensors:

• 8 whiskers surround the base.

• 2 degree electronic compass.

• Sonar range finder can detect objects and their distance.

• Passive Infrared Motion Detector (PIR) detects movement of people.

• 4 light level sensors detect direction and intensity of light.

• Digital temperature sensor.

• Tilt sensors detect inclines in all directions.

• Water sensor detects puddles.

• Sound can be detected and stored.

• Motor speed and distance using optical encoders.

• Battery voltage can be monitored.

• Infrared detector can receive communications from remote control.

• Infrared emitters can communicate with other Trilobots.

For right now, we do not use (or plan to use) all sensors, only those which fit our need for anthro-

pomorphism.

A.2.2 Configuration

Below we describe modifications to the base hardware for our two main robots, Illy and Norbert.

100

A.2.2.1 Illy

For Illy, in order to mount the small form-factor PC (described below), we added a support structure

around and above her head. This structure interferes with some of the sensors on the head mast

(e.g., compass), although we have no current plans to use most of these sensors. The structure also

required that we remove the handle used to pick up and carry Illy.

The original hardware had support for a single video camera on the head. We chose instead to

mount a pair of miniature cameras where the head is normally located, and attach the head to the

antenna on top of the forementioned structure.

Normally, when the robot is turned on, it comes up in terminal mode, whereby it is controlled

by the control panel. We control the robot via a serial interface, so we set the startup mode to

“Command Mode.”

A.2.2.2 Norbert

In Norbert, we purchased a smaller computer that fits internally in the robot’s storage bay, but

which required us to move the control panel out a couple of inches. As with Illy, we again added

stereo cameras, but mounted them above the original head.We also bring the robot up in“Command

Mode.”

A.3 Computers

As mentioned above, each of the robots contains a small form factor computer on board, mainly

for the purpose of collecting images and sounds. These computers and related hardware are listed

in Table A.1. In addition to the robots, we run our demonstrations on two Linux workstations,

described in Table A.2.

101

Table A.1: Computing hardware mounted on robots.

Illy: Norbert:Ampro Little Board P5x (Discontinued) Digital-Logic MSM-P3 SEN

266 MHz Pentium Processor (low power) 700 MHz Pentium III ProcessorPC/104+ (PCI/ISA) Expansion slot PC/104+ (PCI/ISA) Expansion slot256 MB RAM 128 MB RAM11 Mbs Wireless Ethernet (RadioLAN 11 Mbs Wireless (802.11b)

proprietary, external; connected to HD: 1 GB IBM Microdriveon-board ethernet) OS: Debian GNU/Linux 3.0

HD: 64 MB compact flash Expansion Cards (PC/104+):OS: Debian GNU/Linux 3.0 (root file Sound Card: MicroSpace MSMM104

system mounted over NFS) Framegrabber Card: Sensory 311 (×2)Expansion Cards (PC/104+):

Sound Card: MicroSpace MSMM104Framegrabber Card: Sensory 311 (×2)

Table A.2: Computer Workstations

Hal: Sal:Dell Opteron Champaign Computer

866 MHz Pentium III Processor 2.2 GHz Pentium IV Processor512 MB RAM 512 MB RAM10 Mbs Ethernet 10 Mbs EthernetHD: 20 GB HD: 40 GBOS: Debian GNU/Linux 3.0 OS: Debian GNU/Linux 3.0

102

APPENDIX B

MOBILE ROBOT SOFTWARE

B.1 Introduction

The work described earlier in this dissertation depends greatly on a suite of software developed by

various members of our group over the years. This appendix describes some of the software relevant

to this dissertation. Specifically, Section B.2 describes a distributed computing and communications

system fundamental to all research currently done on our robots. I did almost all of the design and

most of the implementation of this system.

Following the description of this system, Section B.3 describes a speech feature extraction

algorithm developed by M. Kleffner, and Section B.4 describes an object segmentation and feature

extraction algorithm developed by R. S. Lin. Both of these systems are the basis of features used

by the robot implementation of our cascade of HMMs, described in Chapter 4.

B.2 Distributed Computing and Communication System

We want to provide our robot with functional equivalents for much of the sensory-motor and

decision-making periphery in humans. In addition to the need for hardware equivalents for such

organs as eyes and ears, we need computational equivalents for components of the cognitive

framework—sensory processing, learning and memory, decision making, and outputs. Most com-

puting modules should run independently, and because of hardware constraints, may not even be

run on the same system. This section describes the software framework we have developed for

communication among the various modules.

103

B.2.1 System design

B.2.1.1 System modules

Our group has developed various processing and learning modules for our robotic system. Currently,

the modules used in our main demonstrations include (with attribution to the developer):

1. Audio/video servers (K. Squire) handle acquisition of stereo audio and video data on the

robot.

2. Sound source localization (D. Li) determines the direction from which sounds are coming.

3. Visual processing and object recognition (R. S. Lin) processes visual information in order to

find “interesting” objects.

4. Central memory (M. McClain) stores state information about the world.

5. Decision making and navigation (M. McClain) provides finite state machine controller, au-

tonomous navigation code.

6. Speech output (M. Kleffner/M. McClain) speaks short phrases.

7. Control server (K. Squire) handles direct control of the robot hardware.

8. HMM-based associative memory (K. Squire) provides basic learning and recognition of au-

dio/video semantic information.

These modules are all connected together to construct the cognitive cycle framework shown in

Figure 1.1 (p. 5). Our main concern here is passing information among these components.

B.2.1.2 Implementation issues

There are a number of implementation issues we need to consider. In particular, our biggest

limitation is hardware. Power on board the robot is limited, and the particular robot we have

chosen to work with has little space for holding additional equipment. Thus, on-board processing

is limited, and we must shift much of the processing to other computers. We are also restricted to

a relatively low-bandwidth wireless link, which limits the amount of data we can transmit. Despite

104

these restrictions, we still desired to meet a goal of iterating through the complete cognitive cycle

three-five times a second.

The actual implementation of most of the modules in our robot is described elsewhere [31, 32,

48, 50]. In the next section we will describe the implementation of the communications framework

and related modules (audio, video, and control servers).

B.2.2 Implementation

Below we enumerate a number of requirements for the communications framework:

1. The framework should allow multiple modules able to access the same data simultaneously

(e.g., speech audio processing and sound source localization).

2. Modules should have near real-time access to acquired or processed data.

3. Data access should be transparent, even if the module and data source are on different systems.

4. The interface used to access the data should be simple and consistent.

At the time we began this project, we found no system which met these needs perfectly. The

description of the system we developed follows.

B.2.2.1 IServer (audio/video/data/control server)

The IServer program is a general purpose server for facilitating and coordinating transparent

access to raw and processed data, as well as allowing robotic control. Here we will describe how

it is used for data acquisition and distribution, its usage, and some general comments about the

system.

Data acquisition. The data acquisition component deals with acquiring data (for example, raw

or processed audio or video) and distributing it to other modules that might need it. Here is how

it works.

For a particular data source, the program sets up a ring buffer in shared memory. There are

two types of processes which access this ring buffer:

105

Speech Recognition

Sound SourceLocation (Sink)

(Sink)(Source)

Sound Card

AudioRing

Buffer

full

full (old)

fullfilling

Figure B.1: Audio ring buffer. This diagram shows how a source may be writing to one segmentof the buffer, while multiple sinks may be reading from another segment.

1. A source process fetches (or creates) the raw data and writes it to the ring buffer.

2. A sink process reads and processes the data.

One example of a source process is one which reads audio data from the sound card and writes it to

the ring buffer. An example of sink process would be a process such as a sound source localization

program, which accesses the audio data and uses it to determine the direction a sound is coming

from. This setup can be seen in Figure B.1.

There are a number of benefits to this setup:

1. More than one sink process can use the same data on the same machine.

2. A program processing the data (a sink) does not need to worry about the details of obtaining

the data from the hardware. Access to data is consistent and easy.

3. A sink process may transparently reside on a different machine than the original source (as

described next).

Because of the demanding requirements of the input processing and the limited computing power

available on the robot, much of the processing takes place on other computers. The IServer

program includes a special sink process with the sole purpose of taking the data in the ring buffer

and sending it to another machine, where a corresponding source process receives this data and

106

Audio Source(Sound Card)

Audio Server(Sink)

Speech Recognition(Sink)

Sound SourceLocation (Sink)

Audio Source(Remote)

Audio Server(Sink)

Illy (robot)

network

network

Hal (workstation)

AudioRing

Buffer

AudioRing

Buffer

full

fullfilling

full (old)

filling

full (old)full

full

Figure B.2: Audio ring buffer on multiple machines. In this figure, a sink on on machine sendsaudio data to a source on another machine, which is then used to fill the ring buffer on the secondmachine. Other sink processes still access data in the same manner.

writes it to a ring buffer on its machine. A sink process on the second machine accesses this data

in exactly the same manner as if it were on the original machine.

Figure B.2 demonstrates this, again using audio processing as an example. In this setup, the

sound source location program is running on Illy (our robot), and accesses audio from the ring

buffer as before. A speech recognition program is running on another workstation (Hal), and needs

access to the same audio data. To get it, an audio server running on Illy takes data from the ring

buffer and sends it to the audio source process on Hal, which writes it to its ring buffer just like any

other source of audio data. The speech recognition program reads the data in the same manner as

before. The ring buffer on Hal may also have an audio server which sends the audio data to other

machines.

For each source of data (audio or video), there is a ring buffer set up on the source machine and

on each machine which needs the data. Since the robot is on a wireless link with limited bandwidth,

it is possible to set up the system to only transmit data once over the wireless link, and retransmit

as necessary from the receiving machine to other machines on the network. For example, suppose

two machines, Hal and Chadwick, need access to audio data from Illy. The best setup would be

to have Hal receive the data from Illy, and then resend it to Chadwick. This does cause a greater

time lag for programs needing the data, but is arguable better than saturating the wireless link.

107

Other features of the data acquisition system:

• The ring buffer is divided into segments, the total number and size of which depends on the

data type. Each segment is protected by a locking semaphore, so that a source process will

not write to any block that’s being read from, and a sink process will not read from a block

that’s being written to.

• Each segment of data includes a generic header specifying the endianness, byte count, a time

stamp, and a sequence number. There is also some padding added to the data structure to

take care of alignment issues on some architectures (e.g., SGI Octane).

• Data is sent in little-endian format (i.e., format used on Intel PCs). Network byte order,

by convention is big endian, so we are going against convention, but since a majority of

our processing currently takes place on little-endian machines, this convention saves a lot of

conversions. Note that we have in the past used a big-endian SGI Octane for video processing,

and hence, on this machine we need to convert the byte order for data obtained from other

machines. Conversion is taken care of by the source process on the local machine.

• Access to successive segments can be specified as sequential or most recent. Specifying se-

quential indicates that subsequent requests for data from the buffer should retrieve the next

segment in sequence—i.e., segments are read in order. This is the default for audio sink pro-

cesses. Specifying most recent returns the most recent buffer. This is the default for video

sink processes.

• In addition to allowing multiple sinks, there can be more than one audio or video source.

For example, during an open house demonstration with a lot of background noise, a boomset

microphone could be attached to a workstation, and a separate ring buffer set up to receive

this audio and send it to the speech recognition program. The sound source localization could

still use the original audio from the robot.

Function. The IServer program must be running on all machines (e.g., the robot Illy, and

workstations Hal and Sal). A sink program or process accesses ring-buffer data through a C++

library interface. To access data, the sink process must do the following:

108

1. Get the key for the particular data it wants (via the dbGet function). Here, the process can

also specify some of the desired parameters data streams (e.g., number of channels, sampling

rate, etc.).

2. Create an instance of the appropriate sink class (IVideoSink, IAudioSink).

3. Get metadata information via an appropriate function call (myGetVideoData, myGetAudio-

Data).

4. Get a pointer to a data segment using myGetSeg, use the data, and then release the segment

using myReleaseSeg. Lather, rinse, repeat (but just this step).

In response to the sink request, a source process will check the key request, and if the source is

not running, attempt to start it. If the request is for a data source on a different machine, the

source process will send a request to that machine to start a corresponding server process. This

server process is another sink process, and will again make a request for the same data on the local

machine, and then begin sending the data to the remote machine.

As mentioned above, the IServer program is written in C++ and is designed to be modular.

This modularity also allows a process to act as a data filter. We currently have two such filters

in our system: one is uses code by M. Kleffner to extract speech features from audio, and another

uses code by R. S. Lin to segment and extract object features from visual inputs. In both cases, the

filter code includes a sink (an IAudioSink or IVideoSink, respectively). The data coming into the

sink is processed, relevant features are extracted, and the features are then treated as the output

of an IDataSource, which can then be used by any other program.

B.2.2.2 Additional system information

In addition to data acquisition and distribution, there are a couple of other essential components

of the communications framework.

The central memory is a short-term memory containing information about the state of the

world. The actual data stored here changes depending on the demonstration being run, but it may

include such things as the direction of interesting sounds or the current goals of the robot.

109

The control server is the main connection to the robot hardware and handles all robot control.

It is currently implemented as a finite state machine and is detailed in Section 4.4.1.

B.2.3 Discussion and future work

As mentioned, the current framework is complete, although there are always some areas that could

use some additional work.

The main issue right now is with performance. While the system works well enough to suit

our needs, there can be significant delays between the acquisition of data and when it is processed.

Some of this delay is inevitable, but some tweaking should allow the overall delays to be reduced.

The system, as it works now, is quite robust, but there are rare circumstances in which it

breaks. While rudimentary monitoring programs exists, we hope to add a central monitor to the

framework, to better understand when the system is not functioning properly.

B.3 Speech Feature Extraction

The discussion below is summarized from M. Kleffner’s master’s thesis [48].

B.3.1 Introduction

M. Kleffner developed code for speech feature extraction and synthesis, in order to establish a

means of vocal expression for the robots. Our main use is for the feature extractor, which we use

for recognition. Below we briefly discuss the background and design of the system as used on our

robot. For full details, please see [48].

B.3.2 Background

The system described here was developed for the purpose of extracting speech features from audio

suitable for speech synthesis. Conceptually, the easiest way to synthesize speech is to simulate the

human vocal tract, and the simplest model for doing so is a linear-source filter model. This model

assumes a spectrally uniform source (representing the vocal folds) processed by a filter representing

the vocal tract. The filter contains resonances at vocal tract frequencies, as well as the spectral

110

slope of the waveform. Thus, for a parameterized model of speech synthesis, we require a vocal

tract filter, the fundamental pitch, a voiced, unvoiced, or mixed source, and a voicing confidence

score. For the purposes of recognition, we will only use the filter coefficients and voicing confidence

from this parameterization, and in addition will calculate the log-energy of the original source.

B.3.2.1 Linear prediction

One of the simplest and most efficient ways to estimate the spectral shape of the vocal tract is

to use linear prediction (LP). For a complete introduction to linear prediction, see Chapter 8 of

Rabiner and Schafer [117]. The linear prediction problem requires us to find p coefficients such that

the current sample can be accurately predicted from the previous p samples using

s(n) =

p∑

k=1

aks(n− k) + e(n), (B.1)

where s(n) is the speech signal, ak are the coefficients being estimated, and e(n) is the prediction

error in the estimation. To find the best set of coefficients, the squared prediction error

En =∑

m

(

sn(m) −p∑

k=1

aksn(m− k)

)2

(B.2)

is minimized by taking the derivative of En with respect to ak and finding the least-squares solution.

See [48, 117] for details.

B.3.2.2 Warped linear prediction

Although LP is optimal in the least-squares sense, it calculates error based on a uniform spectrum.

However, humans have better frequency resolution at lower frequencies. Warped linear prediction

(WLP) warps the input signal spectrum in a way that is more faithful to the frequency resolu-

tion of the human ear. Because of this property, around half of the number of coefficients are

required for perceptual performance equivalent to standard LP. Because of these nice features, the

implementation in our robot uses WLP.

Also note that, normally, LPCs are not used for recognition because they tend to have poorer

qualities than warped scale representations, such as mel-frequency cepstral coefficients (MFCCs).

111

However, because warped LPCs are also calculated using the bark-scale model of the human ear,

so they may be more suitable than LPCs for speech recognition.

See [118–121] for more information about warped LPCs, and [48] for details on our implemen-

tation.

B.3.2.3 Log area ratios

Linear predictive coefficients (LPCs) and warped linear predictive coefficients (warped LPCs) are an

optimal representation of a one-dimensional vocal tract, linear combinations or quantized versions of

LPCs (such as those learned by a classifier) generally correspond to unstable or meaningless filters.

To alleviate this problem, we can convert the LPCs into a form that can be linearly combined and

still represent a meaningful filter. One way to do this is to convert LPCs to the corresponding

reflection coefficients (RCs) of a one-dimensional vocal tract tube model. The reflection coefficients

ki are generally obtained during the calculation of the LPCs when that calculation is done using the

Levinson-Durbin algorithm [117], but can also be calculated by iterating i in the following recursion

from p down to 1:

ki = a(i)i , (B.3)

a(i−1)j =

a(i)j + a

(i)i a

(i)i−j

1 − k2i

, 1 ≤ j ≤ i− 1 (B.4)

with the initial condition

a(p)j = aj, 1 ≤ j ≤ p. (B.5)

Reflection coefficients can guarantee a stable filter, but the spectrum is sensitive to RCs with

large magnitudes. However, it has been shown [122] that log-area ratios (LARs) have near uniform

spectral sensitivity, which allow them to be easily combined and quantized. LARS are defined by

gi = log

(Ai+1

Ai

)

= log

(1 − ki

1 + ki

)

, 1 ≤ i ≤ p, (B.6)

where Ai is the area of a segment of the one-dimensional tube model of the vocal tract.

B.3.2.4 Voicing confidence

When reproducing speech for synthesis, it is necessary to know whether the speech is voiced or

unvoiced. Because this is a necessary feature when producing speech, it is also a good discriminating

112

feature for the speech and is part of the feature vector we use for recognition. Rather than make

a hard decision about voicing for any particular segment, a voicing confidence score (VCS) is

produced, which indicates the mix of pulse train and white noise necessary to reproduce the speech.

In our robot, the voicing confidence is calculated in the first half of the current segment using

c(1)t =

∑0.5N−1n=0 snsn+t

√∑0.5N−1

n=0 snsn∑0.5N−1

n=0 sn+tsn+t

, (B.7)

with t an integer in the range [pitchperiod− searchmin, pitchperiod+ searchmax]. For the second

half of the segment, the VCS is calculated similarly, using

c(2)t =

∑N−1n=0.5N snsn−t

√∑N−1

n=0.5N snsn∑N−1

n=0.5N sn−tsn−t

. (B.8)

The total VCS is given by

ctotal = clip(

max[

maxt

(

c(1)t

)

,maxt

(

c(2)t

)])

. (B.9)

This complicated scheme is used so that we can calculate the voicing confidence without depending

on sample values outside of the current segment (i.e., neither c(1)t nor c

(2)t depend on values of

sn outside of the current segment). A final VCS score is calculated to produce zeros for strongly

unvoiced segments:

cf =clip (ctotal − Vthresh)

1 − Vthresh, (B.10)

where Vthresh is set at 0.25 for our experiments.

B.3.2.5 Log energy

Log-energy for each segment is calculated using

en = log

√√√√

N−1∑

n=0

snsn. (B.11)

B.3.3 Design and implementation

As mentioned above, the audio feature extraction was originally meant to extract features useful

for speech synthesis. Typically in speech coders, standard frame length and spacing for audio

113

WarpedLinear

Predictor

WLPCs −>LARs

VoicingConfidenceEstimator

FeatureVector

WLPCs, energy LARs, energy

s(t)

VCS

Figure B.3: Block diagram describing audio feature generation.

waveform analysis are 30 ms and 10 ms, respectively, and these are the segment sizes in our original

implementation. These choices give a rate of 100 features per second. For our purposes, however,

we did not require this resolution, so we slowed down the feature rate to 50 features per second, by

using a frame length of 60 ms and spacing of 20 ms. Since the microphones attached to our robot

are extremely noisy, we chose to use audio from a close-talk microphone attached to a remote PC.

This audio was sampled at 16 kHz, corresponding to a frame length of 960 samples and a frame

spacing of 320 samples.

The block diagram for the feature extractor is given in Figure B.3. As can be seen in the

diagram, we extract warped linear predictive coefficients from the input signal, which are then

converted to log-area ratios. Kleffner suggests using 8-12 coefficients for 16-kHz audio. However,

for recognition purposes, a much coarser representation will suffice, and we only calculate three

LARs. Interestingly, we can still synthesize intelligible speech using this coarse representation. To

round out the feature vector, we also calculate log energy and voicing, described above.

B.4 Visual Object Segmentation and Feature Extraction

This section describes the image segmentation and feature extraction algorithm developed and

implemented on our robot by Lin [123].

114

B.4.1 Problem description

Given images collected from the camera mounted on the robot, we want to distinguish objects

in these images from the background. In our robot experiments, we control the environment

by bounding it with white painted walls; the floor also has a white marble texture. However,

even in this restricted setting, there is still uncontrollable noise present in the environment and

in the robot’s sensors, making the segmentation process non-trivial. In our work, we address the

image segmentation problem using Markov random fields and apply a coarse-to-fine loopy belief

propagation to obtain an approximate solution. Our experiments demonstrate good segmentation

results.

B.4.2 Pairwise Markov random fields

Markov random fields have been widely applied to early vision problems, including optical flow,

stereo vision, and image restoration [124–126]. Here we model the formation of an image using a

square lattice pairwise Markov random field. In this setting, each pixel in the image is connected

to a node in the lattice. In addition, for adjacent pixels in the images, their corresponding nodes

are also connected in the lattice. The values of the nodes are discrete and finite. In our work, a

node can have two possible values. It can be either foreground or background denoting whether the

image pixel connected to it is a foreground or a background pixel.

Let yij be an image pixel and xij be the node it connects to in the lattice, and let Y = yij be

the whole image and X = xij be whole lattice. The joint probability P (X,Y ) can be described

by

P (X,Y ) =1

Z

∏

(ij,kl)

ψ(xij , xkl)∏

ij

φ(xij , yij), (B.12)

where ψ(xij , xkl) and φ(xij , yij) are predetermined potential functions, and Z is a scale factor.

Under this model, image segmentation becomes an inference problem. Given image Y , the optimal

segmentation X∗ is defined as

X∗ = arg maxX

P (X,Y ). (B.13)

115

There exists a potential problem in this formulation. Since the number of all possible values of

X is exponentially proportional to the size of Y , computation of X ∗ will become intractable when

the size of Y is large. Therefore, an approximation method has to be adopted. In our work, instead

of computing P (X,Y ), we measure the marginal probability P (xij|Y ) and determine the best label

of xij according to

x∗ij = arg maxxij

P (xij|Y ). (B.14)

B.4.3 Local message passing algorithm

P (xij |Y ) can be approximated by an iterative, local message-passing algorithm called belief prop-

agation [127]. At iteration n, mn(ij,kl)(xkl) is the message passed from xij to xkl, defined as

mn(ij,kl)(xkl) = α

∑

xij

ψ(xij , xkl)φ(xij , yij)∏

(g,h)∈Γ(i,j)\(k,l)

mn(gh,ij)(xij), (B.15)

where α is a scaling constant, and set Γ(i, j) contains all neighbors of xij . With the messages

known, the marginal distribution of P (xij |Y ) at iteration n is defined as

P n(xij |Y ) = γφ(xij , yij)∏

(g,h)∈Γ(i,j)

mn(gh,ij)(xgh), (B.16)

where γ is a scale factor.

If the Markov random field has a tree structure, P n(xij |Y ) will converge to P (xij |Y ) after a

message from each node has propagated to all other nodes. If the Markov random field is not a

tree, the belief propagation algorithm will not converge. However, even under this circumstance,

empirical results show that the belief propagation algorithm can still achieve excellent performance

in many applications.

In our implementation, we use the max-product algorithm [128] to approximate equation (B.15)

by

mn(ij,kl)(xkl) = βmax

xij

ψ(xij , xkl)φ(xij , yij)∏

(g,h)∈Γ(i,j)\(k,l)

mn(gh,ij)(xij), (B.17)

116

where β is a scale factor. This approximation not only reduces the computation needed, but also

enables us compute messages and marginal distributions in log space.

B.4.4 Image segmentation

There are two types of image features used in our segmentation algorithm: color and pixel intensity

gradient. The two features are used by the two potential functions, φ(xij , yij) and ψ(xij , xkl),

respectively. The definition of our two potential functions will be explained below. With the

potential functions set, we can run belief propagation through iterations to compute the marginal

distribution P n(xij |Y ). In our implementation, we compute the message in a coarse to fine manner.

This approach enable us to reduce the heavy computation without seriously deteriorating our

approximation of P n(xij |Y ).

B.4.4.1 Potential functions

Knowing that the background is mostly white, we built a white pixel model based on a set of

images containing only background of the environment. In order to remove the intensity in our

color feature, we use the following color invariants, proposed in [129]:

frgb =

(R

max(G,B),

G

max(R,B),

B

max(R,G)

)

. (B.18)

We model the distribution of white color features as a Gaussian function. Since there is noise in

both the environment and the cameras, we use robust regression [130] to remove outlier pixels in

the training images. Let µ and C be the mean and covariance matrix of our white pixel model. We

then define φ(xij , yij) as

φ(xij , yij) =

N (frgb(yij);µ,C) if xij = background

κ if xij = foreground

, (B.19)

where κ is a constant. That is, if the whiteness likelihood of pixel yij exceeds κ, it is more likely to

be a background pixel. Otherwise, it is likely to be a foreground pixel.

The other potential function ψ(xij , xkl) describes the relationship between latent variables xij

and xkl. In our segmentation experiment, if xij = xkl, we expect the image intensities of the two

117

pixels yij and ykl to be similar. Otherwise, we expect a sharp intensity difference between yij and

ykl. In addition, we include a bias that favors the same label on adjacent variables. By combining

all of these constraints, we define ψ(xij , xkl) as

φ(xij , xkl) =

exp(dif(yij , ykl) −K) if xij 6= xkl

exp(−dif(yij, ykl)) if xij = xkl

, (B.20)

where dif(yij, ykl) is the intensity difference between yij and ykl in absolute value.

B.4.4.2 Coarse-to-fine iteration

The resolution of our images are 320×240, so it will take a certain number of iterations to propagate

messages from one end of the image to the other. In order to speed up message propagation,

we execute belief propagation in a coarse to fine manner as is suggested by Felzenszwalb and

Hutterlocher [125]. Starting at a coarse level, we divide the image into a number of blocks and run

belief propagation based on these blocks. After a small number of iterations, we decompose each

block into a number of smaller blocks and copy the messages of the original block to these blocks.

We then run belief propagation based on these new set of blocks. The process continues until each

block contains exactly one image pixel. A detailed description of this coarse-to-fine algorithm is

explained in [125].

B.4.4.3 Feature extraction

Using the segmented image described above, we extract some features useful for object recognition.

In particular, we extract a normalized color histogram using the same color invariant pixels defined

in Equation (B.18). We also calculate the first moment and the height to width ratio. All of this

information is passed to the associative memory and made available to any other module which

needs it.

118

APPENDIX C

HIDDEN MARKOV MODEL

ALGORITHMS

C.1 Introduction

As described in Chapter 2, an HMM is a discrete-time stochastic process with two components,

Xn, Yn, where (i)Xn is a finite-state Markov chain, and (ii) given Xn, Yn is a sequence of

conditionally independent random variables. The conditional distribution of Yk depends on Xn

only through Xk. The name HMM arises from the assumption that Xn is not observable, and so

its statistics can only be ascertained from Yn.

Generally, there are three problems of interest when talking about these models:

1. Given an observation sequence 〈y1, . . . , yn〉, find the likelihood pn(y1, . . . , yn;ϕ) of this se-

quence, given the model.

2. Given an observation sequence 〈y1, . . . , yn〉, find the a “good” corresponding state sequence

〈x1, . . . , xn〉.

3. Adjust the model parameters ϕ to maximize the likelihood pn(y1, . . . , yn;ϕ).

Chapter 2 described a recursive solution to these problems. In this appendix, we summarize more

traditional batch techniques based on Baum-Welch reestimation and Viterbi decoding. The reader

is referred to [67] and [81] for more details on these algorithms.

The algorithms below use the model description notation described in Section 2.2.

119

C.2 Baum-Welch Algorithm

Baum-Welch reestimation, also known as the forward-backward algorithm, is an expectation-

maximization (EM) method used for reestimating HMM parameters. In addition to adjusting

model parameters, the procedure can also be used to determine the likelihood of a given observa-

tion sequence, as well as give a maximally likely state sequence corresponding to that observation

sequence. This is the most common method for learning HMM parameters.

Consider an observation sequence 〈y1, . . . , yn〉. The most direct way to calculate the likelihood

of this sequence for a given HMM ϕ is to sum over all possible state sequences, the probability of

that sequence times the likelihood of the observations given that sequence, that is

pn(y1, . . . , yn;ϕ) =∑

〈x1,...,xn〉∈Rn

p(y1, . . . , yn, x1, . . . , xn;ϕ)

=∑

〈x1,...,xn〉∈Rn

p(y1, . . . , yn|x1, . . . , xn;ϕ)P (x1, . . . , xn;ϕ). (C.1)

This calculation is computationally intractable, but a procedure known as the forward-backward

algorithm can calculate the probability efficiently. Define

αn(ϕ) = [αn1(ϕ), . . . , αnr(ϕ)]′, (C.2)

where

αni(ϕ) = p(y1, . . . , yn, Xn = i;ϕ) (C.3)

as the likelihood of the partial observation sequence 〈y1, . . . , yn〉 and state i at time n, given the

model. The vector αn(ϕ) defines set of so-called forward probabilities. Setting α1(ϕ) = B(y1;ϕ)π,

we can solve for αn inductively as

αn+1(ϕ) = B(yn+1;ϕ)A(ϕ)′αn(ϕ). (C.4)

We can similarly define a set of backward probabilities βn(ϕ) as

βn(ϕ) = [βn1(ϕ), . . . , βnr(ϕ)]′, (C.5)

where

βni(ϕ) = P (yn+1, yn+2, . . . , yN |xn = i,ϕ). (C.6)

120

Setting βn(ϕ) = 1r, we can calculate βn(ϕ) using the backward recursion

βn(ϕ) = A(ϕ)B(yn+1;ϕ)βn+1(ϕ). (C.7)

Together, these functions can compute the likelihood P of the sequence at any time 1 ≤ ` ≤ n− 1

according to

P = pn(y1, . . . , yn;ϕ) = α`(ϕ)′A(ϕ)B(y`+1;ϕ)β`+1(ϕ). (C.8)

If we set ` = n− 1, this equation becomes

P = αn(ϕ)′1r. (C.9)

A similar formula exists using the backward probabilities.

The state sequence can be determined by looking at the most likely state at each time step.

This has the disadvantage that we may find a state sequence that is invalid for a given model; i.e.,

one in which xn = i and xt+1 = j, but for which the model has aij = 0. The Viterbi algorithm,

described in the next section, avoids this pitfall.

We can use the above calculations to reestimate model parameters. Let γij be the expected

number of transitions from state i to state j, conditioned on the observation sequence. This value

can be calculated with

γij =1

P

n−1∑

`=1

αì(ϕ)aij(ϕ)bj(y`+1;ϕ)β`+1,j(ϕ).

Then the total expected number of transitions out of state i is given by

γi =

r∑

j=1

γij =1

P

n−1∑

`=1

αì(ϕ)βì(ϕ). (C.10)

The ratio of these can be used to calculate an updated value for aij(ϕ), using

aij(ϕ) =γij

γi=

∑n−1`=1 αì(ϕ)aij(ϕ)bj(y`+1;ϕ)β`+1,j(ϕ)

∑n−1`=1 αì(ϕ)βì(ϕ)

. (C.11)

Similar methods can be used to find update equations for bj(·;ϕ) = bjk(ϕ) for the case of observa-

tions from a finite-alphabet,

bjk(ϕ) =

∑

`|y`=k α`j(ϕ)β`j(ϕ)∑n

`=1 α`j(ϕ)β`j(ϕ)(C.12)

and πi,

πi =1

Pα1(i)β1(i). (C.13)

121

While the above formulas were determined intuitively, it is possible to derive the same formulas

rigorously, using either Lagrange methods or through optimization techniques. Similar formulas

are also available to estimate the parameters of a continuous observation density. See [67] or [81]

for more details.

C.3 Viterbi-Based Algorithms

The when determining the probability of an observation sequence, the forward-backward algorithm

above took into account all possible state sequences, and calculated P = pn(y1, . . . , yn;ϕ). We can

also define P as the maximum joint probability of the observation and most likely state sequence

for a given model, that is, P = maxX∈Sn p(y1, . . . , yn, x1, . . . , xn;ϕ). It is possible to calculate both

this probability and the most likely state sequence simultaneously through a dynamic programming

technique called the Viterbi algorithm.

The algorithm is defined as follows: let φ1i = πibi(y1;ϕ) for i = 1, . . . , r. Then as in the forward

and backward procedures, we can compute φ recursively by

φnj = max1≤i≤r

[φn−1,iaij ] bj(yn;ϕ), (C.14)

and keep track of the best previous state (to state j) via

ψnj = arg max1≤i≤r

[φt−1,iaij] . (C.15)

At the end of our input, we can determine the probability of the most likely sequence from

P = max1≤i≤r

φni. (C.16)

To determine the best sequence, we let state xk = arg max1≤i≤r φki, and trace back the most likely

sequence that ended in that state, using

xk−1 = ψk(xk), k = n, n− 1, ..., 2.

To determine reestimation formulas based on this model, we simply determine the best sequence

as above, and then use counting to reestimate the model parameters.

122

For all states xn = i, if we count the number of transitions from state i to state j, and divide

that by the total number of transitions from state i, we should get a better estimate of aij(ϕ).

Following this idea, the reestimation formula for each aij(ϕ) is

aij(ϕ) =number of transitions from state i to state j

number of transitions from state i.

A similar procedure can be used to estimate new parameters for each bj(yn;ϕ). Below we will

give an example for the simple case when the observation distribution in each state is defined by a

one-dimensional Gaussian pdf. Let bj(yn;ϕ) be defined by

bj(yn;ϕ) = N (yn, µj(ϕ), σj(ϕ))

where N is a Gaussian density with mean µj(ϕ) and variance σ2j (ϕ). Using the same observation

sequence 〈y1, . . . , yn〉, and the same estimated state sequence 〈x1, . . . , xn〉 as above, the parameters

µj(ϕ) and σj(ϕ) for the updated model can be estimated by

µj(ϕ) = average of all yi observed while in state xi = j

σj(ϕ) = standard deviation of all yi observed while in state xi = j

After reestimation, the parameters of the model are replaced with the new values above, and the

calculation is repeated again for the entire observation sequence using the updated model. At

each iteration, p(y1, . . . , yn;ϕ) is guaranteed to increase, and ϕ slowly converges to a model which

describes observation sequence 〈y1, . . . , yn〉.

Note that counting should be done over long and/or many sequences before actual parameter

estimation, as the potential exists, for example, to reestimate an unlikely but possible transition

probability as zero, if it does not appear in the sequence(s) used for reestimation. A small prior

can be added to each probability to prevent this occurrence.

The actual procedure defined here is similar to the procedure IBM uses to update parameters

in its Via-Voice speech recognition system.

123

APPENDIX D

HIDDEN SEMI-MARKOV MODELS

AND THE RMLE ALGORITHM

D.1 Introduction

When applying HMMs to speech and other continuous data, a general assumption is that each state

in the model represents a stationary interval over a data segment. By default, with a standard

HMM, the probability of duration of a state is modeled as a geometric distribution, which does

not accurately model the temporal structure of speech. To address this problem, Fergusson [94]

proposed the idea of a variable duration hidden Markov model, which explicitly models the duration

of a given state by a probability mass function, converting the underlying Markov chain to a semi-

Markov chain. Russell and Moore [95] and Levinson [96] extended this work by modeling the state

duration with Poisson and gamma distributions, respectively. Later in the literature, these models

became known as hidden semi-Markov models (HSMMs).

In our modeling, we have come across circumstances where the explicit duration modeling in

the HSMM would seem to have some benefit. However, as with traditional HMMs, the standard

training methods are off-line, batch methods ill-suited for running on our robot. Based on our

experience with online learning described in Chapter 2, we have derived a version of recursive

maximum-likelihood estimation (RMLE) for the HSMM. While we have not yet implemented this

algorithm, this derivation may prove useful to future researchers.

124

In the following two sections, we will describe the mathematical model for the HSMM, and then

give a derivation of the RMLE for this model. The setup and derivation for this model is very

similar to the setup and derivation of the RMLE for the hidden Markov model in Chapter 2.

D.2 HSMM Model Description and Notation

An HSMM is a discrete-time stochastic process with three components,

Xn′ , Y n′ , T n′, defined on probability space (Ω,F , P ). Let Xn′∞n′=1 be a discrete-time first-

order semi-Markov chain with state space R = 1, . . . , r, r a fixed known constant. As in an

HMM, the transition probabilities of the Markov chain in an HSMM are given by

aij = P (Xn′ = j|Xn′−1 = i) (D.1)

for i, j = 1, . . . , r, with an additional constraint that aii = P (Xn′ = i|Xn′−1 = i) = 0. Let

A = aij. Then A ∈ A\aii 6= 0, where A is the set of all r× r stochastic matrices (i.e., aij ≥ 0,

∑

j aij = 1).

Let T n′∞n′=1 be a sequence of discrete durations corresponding to Xn′. The process T n′ is

a probabilistic function of Xn′, and the corresponding conditional density of T n′ can be described

by a parametric family of densities d(·;λ) : λ ∈ Λ, where the density parameter λ is a function of

Xn′ , and Λ is the set of valid parameters for the conditional density assumed by the model. The

conditional density of T n′ given Xn′ = j can be written d(·;λj), or more simply dj(·).

Example D.1. (Gamma duration density): Suppose the durations for each state in an HMM are

approximately1 distributed according to a Gamma distribution. Then parameter set Λ = (ν, η) ∈

R+ × R

+, λj ∈ Λ, and T n′ = τn′ is a sequence of discrete valued conditionally independent

state durations on R+, with probability distribution

d(τn′ ;λj) = d (τn′ ; νj, ηj) =η

νj

j

Γ(νj)τ

νj−1n′ e−ηjτn′ (D.2)

for Xn = j. Here, the mean value of τn isνj

ηj, and the variance is

νj

η2j

.

1Since the durations are discrete, the correspondence will not be exact.

125

Example D.2. (Discrete duration density): Suppose durations T n′ are drawn from a discrete

set of times T = 1, . . . , T. Then Λ = (d1, . . . , dT )|∑Tτ=1 dτ = 1, dτ ≥ 0 is the set of length-

T stochastic vectors, λj ∈ Λ, and T n′ = τn′ is a sequence of discrete valued conditionally

independent state durations on T, each τ n having probability

d(τn′ ;λj) = djτn′ 1 ≤ τn′ ≤ T (D.3)

for Xn′ = j.

As in a standard HMM, Xn′ is not visible in an HSMM, and the corresponding duration

process T n′ is therefore unknown as well. The statistics of both are ascertained from a corre-

sponding observable stochastic process. In an HSMM, state Xn′ produces a length T n′ observation

vector Y n′ . The process Y n′ therefore is a probabilistic function of Xn′ and T n′, and the

corresponding conditional density of Y n′ is assumed to belong to a parametric family of densities

b(·|τ ; θ) : θ ∈ Θ, where τ is a sample from duration process T n′, the density parameter θ

is a function of Xn′ , and Θ is the set of valid parameters for the particular conditional density

assumed by the model. The conditional density of Y n′ given Xn′ = j and T n′ = τn′ can be writ-

ten b(·|τn′ ; θj), or more simply as bj(·|τn′). Outside of certain conditions enumerated later, the

particular form of b(·|τ n′ ; θj) is irrelevant to our discussion.

Define the HSMM parameter space as Φ = Π ×A× Λ × Θ. The model ϕ ∈ Φ is defined as

ϕ = π1, . . . , πr, a11, a12, . . . , arr, λ1, . . . , λr, θ1, . . . , θr. (D.4)

Example D.3. (Gamma duration densities with Gaussian observation densities): For the case of

gamma duration densities with single dimensional Gaussian observation distributions,

ϕ = (π1, ..., πr, a11, a12, ..., arr, ν1, η1, . . . , νr, ηr, µ1, σ1, ..., µr, σr).

As in our HMM in Chapter 3, let p be the length of ϕ. Let ϕ∗ ∈ Φ be the fixed set of “true”

parameters of the model we are trying to estimate.

For a vector or matrix v, v′ represents its transpose. Define the r-dimensional column vector

b(yn′ |τn′ ;ϕ) and r × r matrix B(yn′ |τn′ ;ϕ) by

b(yn′ |τn′ ;ϕ) = [b(yn′ ; θ1τn′ (ϕ)), ..., b(yn′ ; θrτn′ (ϕ))]′ (D.5)

126

and

B(yn′ |τn′ ;ϕ) = diag[b(yn′ ; θ1τn′ (ϕ)), ..., b(yn′ ; θrτn′ (ϕ))]. (D.6)

Similarly, for the duration densities, define the r-dimensional column vector d(τ n′ ;ϕ) and r × r

matrix D(τn′ ;ϕ) by

d(τn′ ;ϕ) = [d(τn′ ;λ1(ϕ)), . . . , d(τn′ ;λr(ϕ))]′ (D.7)

and

D(τn′ ;ϕ) = diag[d(τn′ ;λ1(ϕ)), . . . , d(τn′ ;λr(ϕ))]. (D.8)

For convenience of notation, we will define a third vector g(yn′ , τn′ ;ϕ) and matrix G(yn′ , τn′ ;ϕ)

as

g(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)1r

= B(yn′ |τn′ ;ϕ)d(τn′ ;ϕ) (D.9)

and

G(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ). (D.10)

Until now, we have described the model entirely using what we will call model time, where one

time unit corresponds to the duration the model stays in a particular state. In the model time scale,

time variables are marked with a prime (′), and sequence variables are marked with an overbar, as

in τn′ . We would like to relate this description to normal time, where each time unit represents

one real unit of time.

For a given sequence of durations τn′, define the functions t0 : Z+ → Z

+ and t1 : Z+ → Z

+

by

t0τn′(k′) =

k′−1∑

i=1

τ i + 1 (D.11)

t1τn′(k′) =

k′∑

i=1

τ i. (D.12)

These functions mark, respectively, the real beginning and end of the k ′th state for duration sequence

τn′.

Similarly, define a function ξ : Z+ → Z

+ as

ξτn′(n) = k′ if t0τn′(k′) ≤ n ≤ t1τn′(k

′). (D.13)

127

This function returns the model time corresponding to normal time n. Together, these functions

allow us to convert between the two time scales. We will often drop the explicit dependence on

τn′ and simply write t0(n′), t1(n

′), and ξ(n).

Using these functions, we can define real-time analogs Xn and Yn to Xn and Y n,

respectively. The process Xn is related to Xn′ by Xn = Xξτn′

(n). Random sample Y n′ can

be written as Y n′ =⟨Yto(n′), . . . , Yt1(n′)

⟩.

For model ϕ, we would like to calculate the likelihood of a sequence of n normal time observations

〈y1, . . . , yn〉. Since our model is defined in terms of yn′, we partition the sequence yn into n′ ≤ n

subsequences such that each subsequence corresponds to the output of a single state of the model,

i.e.,

y1, . . . , yt1(1)︸︷︷︸

y1

, yt0(2), . . . , yt1(2)︸︷︷︸

y2

, . . . , yt0(n′), . . . , yn︸︷︷︸

yn′

.

For a given partition, the joint likelihood of the observation sequence and state durations is given

by

pn′(y1, . . . , yn′ , τ 1, . . . , τn′ ;ϕ) = π(ϕ)′D(τ 1;ϕ)B(y1|τ1;ϕ)n′∏

k′=2

A(ϕ)D(τ k′ ;ϕ)B(yk′ |τ k′ ;ϕ)1r.

(D.14)

= π(ϕ)′G(y1, τ 1;ϕ)n′∏

k′=2

A(ϕ)G(yk′ , τ k′ ;ϕ)1r. (D.15)

Averaging over all possible partitions, we can calculate pn(y1, . . . , yn;ϕ) as

pn(y1, . . . , yn;ϕ) =

n∑

n′=1

∑

τ1,...,τn,Pn′

i=1 τ i=n

P (n′)pn′(y1, . . . , yn′ , τ 1, . . . , τn′ ;ϕ) (D.16)

=

n∑

n′=1

∑

τ1,...,τn,Pn′

i=1 τ i=n

P (n′)π(ϕ)′G(y1, τ 1;ϕ)

n′∏

k′=2

A(ϕ)G(yk′ , τ k′ ;ϕ)1r, (D.17)

where P (n′) is the probability that there are n′ subsequences in the partition.

128

D.3 RMLE for the HSMM

This derivation follows from the derivation of the RMLE for the standard hidden Markov model

presented in Section 2.3.1.

For the HSMM, define the prediction filter un′(ϕ) as

un′(ϕ) = [un′1(ϕ), . . . , un′r(ϕ)]′ (D.18)

where

un′j(ϕ) = P (Xn′ = j|y1, . . . , yn′−1, τ 1, . . . , τn′−1, n′) (D.19)

is the probability of transitioning to state j at (model) time n′ given all previous observations and

a partition of those observations. For our derivation below, it will be useful to have a normal time

analog to un′(ϕ). Let un(ϕ) be

un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′ (D.20)

where

unj(ϕ) = P (Xn = j|y1, . . . , yn−1, τ1, . . . , τn′−1, n′, ξ(n− 1) = n′ − 1, ξ(n) = n′). (D.21)

For given n′ and τn′, un′(ϕ) = ut0(n′)(ϕ).

Using this filter, the likelihood pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) can be written as

pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) =

n′∏

k′=1

d(τ k′ ;ϕ)′B(yk′ |τk′ ;ϕ)uk′(ϕ).

=

n′∏

k′=1

g(yk′ , τk′ ;ϕ)′uk′(ϕ). (D.22)

(For this derivation, see Appendix E, Section E.2). As above, the likelihood at normal time n can

be calculated by averaging over all partitions of n, as

pn(y1, . . . , yn;ϕ) =

n∑

n′=1

∑

τ1,...,τn′Pn′

i=1 τ i=n

P (n′)

n′∏

k′=1

g(yk′ , τ k′ ;ϕ)′uk′(ϕ). (D.23)

Our goal is to maximize this likelihood with respect to parameter set ϕ, and in particular find

a recursive update. Unfortunately, there are a number of pragmatic problems with recursively

129

maximizing Equation (D.23). In particular, we would like to calculate this likelihood recursively,

so the summation over all partitions of 〈y1, . . . , yn〉 is undesirable. To alleviate this problem, we

will consider only the most likely partition of 〈y1, . . . , yn〉. Rewrite Equation (D.23) as

pn(y1, . . . , yn;ϕ) = maxn′=1,..,n

maxτ1,...,τn′

Pn′

i=1 τ i=n

n′∏

k′=1

g(yk′ , τ k′ ;ϕ)′uk′(ϕ). (D.24)

Maximizing pn(y1, . . . , yn;ϕ) is equivalent to maximizing its log-likelihood. For a given parti-

tion size n′ and sequence of durations τn′, define the normalized log-likelihood of (model time)

observations 〈y1, . . . , yn′〉 as

`n′(τn′,ϕ) =1

n′ + 1log pn′(y1, ..., yn′ , τ 1, . . . , τn′ ;ϕ)

=1

n′ + 1

n′∑

k′=1

log g(yk′ , τk′ ;ϕ)′uk′(ϕ). (D.25)

As in Equation (D.24), we can then write the log-likelihood of real time observations 〈y1, . . . , yn〉

as

`n(ϕ) = maxn′=1,...,n

maxτ1,...,τn′

Pn′

i=1 τ i=n

`n′(τn′,ϕ)

= maxn′=1,...,n

maxτ1,...,τn′

Pn′

i=1 τ i=n

1

n′ + 1

n′∑

k=1

log[g(yk, τk;ϕ)′uk(ϕ)]. (D.26)

At any time n, the values of n′ and τn′ which maximize `n(ϕ) can be determined recursively,

and can also be used in the recursive update of un(ϕ). Let n′∗n be the number of segments which

maximizes `n(ϕ), and let τ ∗n be the length of the last segment of yn′ which maximizes `n(ϕ).

Given the sequence of log-likelihoods up to `n−1(ϕ), as well as the optimal state sequence lengths

n′∗1 through n′∗n−1, we can maximize `n(ϕ) recursively with

τ∗n = arg maxτ

1

n′∗n−τ + 2

(

(n′∗n−τ + 1)`n−τ (ϕ)+

log[g(yn−τ+1, . . . , yn, τ ;ϕ)′un−τ+1(ϕ)

] )

, (D.27)

n′∗n = n′∗n−τ∗n

+ 1, (D.28)

130

and

`n(ϕ) = maxτ

1

n′∗n−τ + 2

(

(n′∗n−τ + 1)`n−τ (ϕ)+

log[g(yn−τ+1, . . . , yn, τ ;ϕ)′un−τ+1(ϕ)

] )

. (D.29)

with initialization τ ∗1 = 1, n′∗1 = 1, and `1(ϕ) = 12 log[g(y1, τ

∗1 ;ϕ)′u1(ϕ)].

As suggested above, we then use τ ∗n to recursively calculate un(ϕ), using

un+1(ϕ) =A(ϕ)′G(yn−τ∗

n+1, . . . , yn, τ∗n;ϕ)un−τ∗

n+1(ϕ)

g(yn−τ∗n+1, . . . , yn, τ∗n;ϕ)′un−τ∗

n+1(ϕ)

=A(ϕ)′G(yξ(n), τ

∗n;ϕ)un−τ∗

n+1(ϕ)

g(yξ(n), τ∗n;ϕ)′un−τ∗

n+1(ϕ). (D.30)

As with RMLE in the standard HMM, un(ϕ) is initialized with u1(ϕ) = π(ϕ).

Let w(l)n (ϕ) = (∂/∂ϕl)un(ϕ) be the partial derivative of un(ϕ) with respect to (wrt) the lth

component of ϕ. Each w(l)n (ϕ) is an r-length column vector, and

wn(ϕ) = (w(1)n (ϕ),w(2)

n (ϕ), . . . ,w(p)n (ϕ)) (D.31)

is an r × p matrix. Taking the derivative of un(ϕ) from Equation (D.30), we get

w(l)n+1(ϕ) =

∂un+1(ϕ)

∂ϕl

= R1(yξ(n), τ∗n,ϕ)w

(l)n−τ∗

n+1(ϕ) +R(l)2 (yξ(n), τ

∗n,ϕ) (D.32)

with

R1(yn′ , τ,ϕ) = A(ϕ)′[

I − G(yn′ , τ ;ϕ)un(ϕ)1′r

g(yn′ , τ ;ϕ)′un(ϕ)

]G(yn′ , τ ;ϕ)

g(yn′ , τ ;ϕ)′un(ϕ)(D.33)

R(l)2 (yn′ , τ,ϕ) = A(ϕ)′

[

I − G(yn′ , τ ;ϕ)un(ϕ)1′r

g(yn′ , τ ;ϕ)′un(ϕ)

][∂G(yn′ , τ ;ϕ)/∂ϕl]un(ϕ)

g(yn′ , τ ;ϕ)′un(ϕ)+

[∂A(ϕ)′/∂ϕl]G(yn′ , τ ;ϕ)un(ϕ)

g(yn′ , τ ;ϕ)′un(ϕ)(D.34)

where

∂

∂ϕlG(yn′ , τ ;ϕ) =

∂

∂ϕl[D(τ)B(yn′ ;ϕ)]

=∂D(τ)

∂ϕlB(yn′ ;ϕ) + D(τ)

∂B(yn′ ;ϕ)

∂ϕl. (D.35)

Using these equations, we can recursively calculate wn(ϕ) at every iteration.

131

To estimate the set of optimal parameters ϕ∗, we want to find the maximum of `n(ϕ) with

respect to ϕ, which we will attempt via recursive stochastic approximation. For each parameter l in

ϕ, at each time n, we take (∂/∂ϕl) of the most recent term inside the summation in Equation (D.26),

to form an “incremental score vector”

S(Yn′ ;ϕ) =(

S(1)(Yn′ ;ϕ), ..., S(p)(Yn′ ;ϕ))′

(D.36)

with

S(l)(Yn;ϕ) =∂

∂ϕllog[g(yξ(n), τ

∗n;ϕ)′un(ϕ)]

=g(yξ(n), τ

∗n;ϕ)′[(∂/∂ϕl)un(ϕ)] + [(∂/∂ϕl)g(yξ(n), τ

∗n;ϕ)]′un(ϕ)

g(yξ(n), τ∗n;ϕ)′un(ϕ)

=g(yξ(n), τ

∗n;ϕ)′wn(ϕ) + [(∂/∂ϕl)g(yξ(n), τ

∗n;ϕ)]′un(ϕ)

g(yξ(n), τ∗n;ϕ)′un(ϕ)

(D.37)

where

∂

∂ϕlg(yn′ , τ ;ϕ) =

∂

∂ϕl[d(τ)B(yn′ ;ϕ)]

=∂d(τ)

∂ϕlB(yn′ ;ϕ) + d(τ)

∂B(yn′ ;ϕ)

∂ϕl(D.38)

and

Yn , (Yn, Tn,un(ϕ),wn(ϕ)), (D.39)

where Tn = τ∗n.

As before, the RMLE algorithm takes the form

ϕn+1 = ΠG

(

ϕn + εnS(Yn;ϕn))

(D.40)

where εn is a sequence of step sizes satisfying εn ≥ 0, εn → 0 and∑

n εn = ∞, G is a compact and

convex set, and ΠG is a projection onto set G.

Equations (D.34) and (D.37) can both be simplified for each type of parameter in ϕ. For the

HMM, we have completed this simplification for model parameters for different assumed observation

densities. In the HSMM, this simplification must in particular be done for the parameters of the

chosen duration density d(τ ;ϕ) and observation density b(y;ϕ).

132

APPENDIX E

RMLE DERIVATIONS

E.1 Proof that pn(y1, . . . , yn; ϕ) =∏n

k=1 b(yk; ϕ)′uk(ϕ)pn(y1, . . . , yn; ϕ) =∏n

k=1 b(yk; ϕ)′uk(ϕ)pn(y1, . . . , yn; ϕ) =∏n

k=1 b(yk; ϕ)′uk(ϕ)

In Section 2.3.1, we state that pn(y1, . . . , yn;ϕ) is equivalent to∏n

k=1 b(yk;ϕ)′uk(ϕ). We have

pn(y1, . . . , yn) = p(y1)p(y2|y1)p(y3|y2, y1) · · · p(yn|y1, . . . , yn−1)

=∑

j

p(y1, x1 = j)∑

j

p(y2, x2 = j|y1)∑

j

p(y3, x3 = j|y1, y2) · · ·∑

j

p(yn, xn = j|y1, . . . , yn−1)

=∑

j

p(y1|x1 = j)P (x1 = j)∑

j

p(y2|x2 = j, y1)P (x2 = j|y1) · · ·∑

j

p(yn|xn = j, y1, . . . , yn−1)P (xn = j|y1, . . . , yn−1)

=∑

j

p(y1|x1 = j)P (x1 = j)∑

j

p(y2|x2 = j)P (x2 = j|y1) · · ·∑

j

p(yn|xn = j)P (xn = j|y1, . . . , yn−1)

=∑

j

bj(y1)u1j

∑

j

bj(y2)u2j

∑

j

bj(y3)u3j · · ·∑

j

bj(yn)unj

= (b(y1)′u1) · (b(y2)

′u2) · · · (b(yn)′un)

=∏

k

b(yk)′uk, (E.1)

where the fourth line is true because of the fact that each observation depends only on the current

state in an HMM.

133

E.2 Proof that pn′(y1, . . . , yn′, τ 1, . . . , τn′; ϕ) =∏n′

k=1 d(τk)′B(yk|τ k)ukpn′(y1, . . . , yn′, τ 1, . . . , τn′; ϕ) =

∏n′

k=1 d(τk)′B(yk|τ k)ukpn′(y1, . . . , yn′, τ1, . . . , τn′; ϕ) =

∏n′

k=1 d(τ k)′B(yk|τ k)uk

In Section D.3, we make a similar assertion regarding the HSMM, that pn(y1, . . . , yn, τ1, . . . , τn′ ;ϕ)

is equivalent to∏n′

k=1 d(τ k;ϕ)′B(yk;ϕ)uk(ϕ). We have

pn(y1, . . . , yn′ , τ1, . . . , τn′) = p(y1, τ1|n′)p(y2, τ 2|y1, τ1)p(yn′ , τn′ |y1, . . . , yn′−1, τ1, . . . , τn′−1)

=∑

j

p(y1, τ 1, x1 = j|n′)∑

j

p(y2, τ 2, x2 = j|y1, τ 1, n′) · · ·

∑

j

p(yn′ = j, τn′ , xn′ |y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)

=∑

j

p(y1, τ 1|x1 = j)P (x1 = j|n′)·

∑

j

p(y2, τ2|x2 = j)P (x2 = j|y1, τ1, n′) · · ·

∑

j

p(yn′ , τn′ |xn′ = j)P (xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)

=∑

j

p(y1|τ1, x1 = j)p(τ 1|x1 = j)P (x1|n′)·

∑

j

p(y2|τ2, x2 = j)p(τ 2|x2 = j)P (x2 = j|y1, τ 1, n′) · · ·

∑

j

p(yn′ |τn′ , xn′ = j)p(τn′ |xn′ = j)·

P (xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)

=∑

j

bj(y1|τ1)dj(τ1)u1j

∑

j

bj(y2|τ2)dj(τ2)u2j · · ·∑

j

bj(yn′ |τn′)dj(τn′)un′j

=

n′∏

k′=1

1′rB(yk′ |τk′)D(τ k′)uk′

=

n′∏

k′=1

b(yk′ |τk′)′D(τk′)uk′

=

n′∏

k′=1

d(τk′)′B(yk′ |τ k′)uk′ (E.2)

where the third line is because of the assumption that each observation and each duration depend

only on the current state in an HSMM. The definitions for b(·), B(·), d(·), D(·), and un′ come

from Sections D.2 and D.3.

134

E.3 Specialized RMLE Formulas

Section 2.3 of Chapter 2 gives the basic derivation of the RMLE algorithm. For completeness, we

restate the generalized parameter estimation formulas here, followed by their specialization for each

parameter type in particular HMMs.

Remember that the log-likelihood is defined as

`n(ϕ) =1

n+ 1

n∑

k=1

log[b(yk;ϕ)′uk(ϕ)], (E.3)

where b(yn;ϕ) is the observation likelihood vector for observation yn, and un(ϕ) is the set of prior

probabilities for each state at time n, i.e., un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′, where uni(ϕ) = P (xn =

i|y1, . . . , yn−1). This vector can be calculated recursively using

un+1(ϕ) =A′(ϕ)B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ). (E.4)

Taking the derivative of the last term in the summation of Equation (E.3) with respect to each

ϕl, we get

S(l)(Yn;ϕ) =b(yn;ϕ)′w

(l)n (ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)

b(yn;ϕ)′un(ϕ), (E.5)

where Yn = (Yn,un(ϕ),wn(ϕ)), and w(l)n = (∂/∂ϕl)un(ϕ). The value of w

(l)n (ϕ) can be calculated

recursively using

w(l)n+1(ϕ) =

∂un+1(ϕ)

∂ϕl

= R1(yn,ϕ)w(l)n (ϕ) +R

(l)2 (yn,ϕ), (E.6)

where

R1(yn,un(ϕ),ϕ) = A(ϕ)′[


b(yn;ϕ)′un(ϕ)

]B(yn;ϕ)

b(yn;ϕ)′un(ϕ)(E.7)

R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′

[


b(yn;ϕ)′un(ϕ)


b(yn;ϕ)′un(ϕ)+

[∂A(ϕ)′/∂ϕl]B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ). (E.8)

A version of both S(l)(Yn;ϕ) and R(l)2 (yn,un(ϕ),ϕ) must be derived separately for each type of

parameter in `n(ϕ).

135

E.3.1 Transition probabilities

For transition probabilities ϕl = aij(ϕ), ∂B(yn;ϕ)/∂ϕl will be zero. Abusing notation slightly, let

l = aij refer to parameter aij in HMM ϕ. Then

S(aij )(Yn;ϕ) =b(y;ϕ)′w

(aij)n (ϕ)


and

R(aij)2 =

[∂A(ϕ)/∂ϕaij]B(yn;ϕ)un(ϕ)

b(yn;ϕ)′un(ϕ), (E.10)

where ∂A(ϕ)/∂ϕaijis a matrix with a 1 at position aij and zeros elsewhere.

E.3.2 Discrete observation probabilities

For observations drawn from a finite discrete set V = v1, . . . , vs, let ϕl = bjk(ϕ), and, as above,

let l = bjk. Then

S(bjk)(Yn;ϕ) =

b(yn;ϕ)′w(bjk)n (ϕ)+[(∂/∂ϕbjk

)b(yn ;ϕ)]′un(ϕ)

b(yn;ϕ)′un(ϕ) if yn = vk

b(yn;ϕ)′w(bjk)n (ϕ)

b(yn;ϕ)′un(ϕ) if yn 6= vk

(E.11)

and

R(bjk)2 (yn,un(ϕ),ϕ) =

A(ϕ)′[


b(yn;ϕ)′un(ϕ)


b(yn;ϕ)′un(ϕ) if yn = vk

0 if yn 6= vk

. (E.12)

Note that even when yn 6= vk, R1(yn,un(ϕ),ϕ) and therefore w(l)n (ϕ) and S(l)(Yn;ϕ) are non-zero.

E.3.3 Gaussian observation likelihoods

For the case of continuous observation likelihood pdfs,

S(l)(Yn;ϕ) =b(yn;ϕ)′w

(l)n (ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)


and

R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′

[


b(yn;ϕ)′un(ϕ)


b(yn;ϕ)′un(ϕ). (E.14)

136

Here we assume that the observation likelihoods are given by a multidimensional Gaussian function

with dimension d, mean vector µ and covariance matrix Σ. This likelihood is defined as

b(y; θ) = N (y;µ(θ),Σ(θ)) =1

(2π)d2 |Σ(θ)| 12

exp

[

−1

2(y − µ(θ))′Σ(θ)−1 (y − µ(θ))

]

, (E.15)

where |Σ| indicates the determinant of Σ, and y′ indicates the transpose of vector y. In the

formulation here and derivation below, a, b, y, and µ are all column vectors, Σ is the covariance

matrix, and X is a square matrix. For convenience of notation, we will drop the explicit dependence

on θ. Assume real values for all calculations. We need to take the derivative of N (y;µ,Σ) with

respect to the elements of mean vector µ and covariance matrix Σ.

E.3.3.1 Mean vector µ

For µ, we can compute all elements at once by taking the vector derivative (gradient), as

∂

∂µN (y;µ,Σ) =

∂

∂µ

(

1

(2π)d2 |Σ| 12

exp

[

−1

2(y − µ)′Σ−1 (y − µ)

])

=1

(2π)d2 |Σ| 12

× ∂

∂µexp

[

−1

2(y − µ)′Σ−1 (y − µ)

]

=1

(2π)d2 |Σ| 12

exp

[

−1

2(y − µ)′Σ−1 (y − µ)

]

×

∂

∂µ

(

−1

2(y − µ)′Σ−1 (y − µ)

)

=1

(2π)d2 |Σ| 12

exp

[

−1

2(y − µ)′Σ−1(y − µ)

]

×(

1

2

)((

Σ−1 + Σ−T)(y − µ)

)

= N (y;µ,Σ)(Σ−1(y − µ)

), (E.16)

where in step 4 we used the identity ∂∂a

(a′X−1a) = (X + X′) a [131], and in step 5 we used the fact

that Σ (and therefore Σ−1) is symmetric.

137

E.3.3.2 Covariance matrix Σ

To take the derivative with respect to Σ, let g1(Σ) =(

(2π)d2 |Σ| 12

)−1and

g2(Σ) = exp[−1

2(y − µ)′Σ−1(y − µ)]. We then take

∂

∂ΣN (y;µ,Σ) =

∂

∂Σ

(

1

(2π)d2 |Σ| 12

exp

[

−1

2(y − µ)′Σ−1(y − µ)

])

=∂

∂Σ(g1(Σ)g2(Σ))

= g1(Σ)∂

∂Σg2(Σ) + g2(Σ)

∂

∂Σg1(Σ). (E.17)

Taking the derivative of g1(Σ) with respect to Σ,

∂

∂Σg1(Σ) = (2π)−

d2∂

∂Σ|Σ|− 1

2

= (2π)−d2

(

−1

2

)

|Σ|− 32∂

∂Σ|Σ|

= (2π)−d2

(

−1

2

)

|Σ|− 32 × |Σ|

(2Σ−1 − diag(Σ−1)

)

= −1

2

(

(2π)d2 |Σ| 12

)−1 (2Σ−1 − diag(Σ−1)

), (E.18)

where in line 3, we used the identity ∂∂X

|X| = |X|(2X−1 − diag(X−1)) when X is symmetric (see

Section F.2), where diag(X) is a square matrix containing the main diagonal of X, with zeros on

the off diagonal.

For g2(Σ),

∂

∂Σg2(Σ) =

∂

∂Σexp

[

−1

2(y − µ)′ Σ−1 (y − µ)

]

= exp

[

−1

2(y − µ)′ Σ−1 (y − µ)

]∂

∂Σ

(

−1

2(y − µ)′ Σ−1 (y − µ)

)

= exp

[

−1

2(y − µ)′ Σ−1 (y − µ)

](1

2

)

(2Υ − diag(Υ)) (E.19)

where

Υ = Σ−1 (y − µ) (y − µ)′ Σ−1. (E.20)

In line 2, we use the identity ∂∂X

a′X−1a = (−2X−1aa′X−1 + diag(X−1aa′X−1)) when X is sym-

metric [131, 132] (for this derivation, see also Section F.3).

138

Substituting Equations (E.18) and (E.19) into Equation (E.17), we get

∂

∂ΣN (y;µ,Σ) =

1

2

(

(2π)d2 |Σ| 12

)−1exp

[

−1

2(y − µ)′Σ−1(y − µ)

]

×((2Υ − diag(Υ)) −

(2Σ−1 − diag(Σ−1)

))

=1

2N (y;µ,Σ)

(2Υ − diag(Υ)

), (E.21)

where Υ is defined as

Υ = Υ−Σ−1

= Σ−1 (y − µ) (y − µ)′ Σ−1 −Σ−1. (E.22)

E.3.3.3 Upper triangular matrix R, for R′R = Σ

A particularly useful form used to store the covariance matrix information is upper triangular matrix

R, where R is the upper triangular matrix of the Cholesky decomposition of Σ, i.e., R ′R = Σ.

This form is convenient for two reasons. First, it is an intermediate form when taking the inverse of

a symmetric matrix, i.e., we can write Σ−1 = R−1R−T , where R−T is the inverse of the transpose

of R. Second, the determinant of Σ, written |Σ|, is equal to the product of the diagonal elements

of R.

Define a modified Gaussian function N (y;µ(θ),R(θ)) as

N (y;µ(θ),R(θ)) =1

(2π)d2 |R(θ)′R(θ)| 12

exp

[

−1

2(y − µ(θ))′R(θ)−1R(θ)−T (y − µ(θ))

]

. (E.23)

For convenience of notation, we will drop the explicit dependency on θ. The derivative of N (y;µ,R)

with respect to µ is the same as before. The derivative with respect to matrix R is similar to the

derivative of N (y;µ,Σ) with respect to Σ. As before, let

h1(R) =(

(2π)d2

∣∣RR′

∣∣12

)−1(E.24)

and

h2(R) = exp

[

−1

2(y − µ)′R−1R−T (y − µ)

]

. (E.25)

139

We then take

∂

∂RN (y;µ,R) =

∂

∂R

(

1

(2π)d2 |R′R| 12

exp

[

−1

2(y − µ)′R−1R−T (y − µ)

])

=∂

∂Σ(h1(R)h2(R))

= h1(R)∂

∂Rh2(R) + h2(R)

∂

∂Rh1(R). (E.26)

Taking the derivative of h1(R) with respect to R,

∂

∂Rh1(R) = (2π)−

d2∂

∂R

∣∣R′R

∣∣−

12

= (2π)−d2

(

−1

2

)∣∣R′R

∣∣−

32∂

∂R

∣∣R′R

∣∣

= (2π)−d2

(

−1

2

)∣∣R′R

∣∣−

32 × 2|R′R|R

(R′R

)−1

= (2π)−d2

∣∣R′R

∣∣−

12 ×R

(R′R

)−1

= −(

(2π)d2

∣∣R′R

∣∣12

)−1

×RΣ−1, (E.27)

where in line 3, we used the identity ∂∂X

|X′X| = 2|X′X|X(X′X)−1 when C is real symmetric [131].

For h2(R),

∂

∂Rh2(R) =

∂

∂Rexp

[

−1

2(y − µ)′ R−1R−T (y − µ)

]

= exp

[

−1

2(y − µ)′ R−1R−T (y − µ)

]∂

∂R

(

−1

2(y − µ)′ R−1R−T (y − µ)

)

= exp

[

−1

2(y − µ)′ R−1R−T (y − µ)

][RR−1R−T (y − µ) (y − µ)′ R−1R−T

]

exp

[

−1

2(y − µ)′ Σ−1 (y − µ)

][RΣ−1 (y − µ) (y − µ)′ Σ−1

]

= exp

[

−1

2(y − µ)′ Σ−1 (y − µ)

]

RΥ (E.28)

where, as before,

Υ = Σ−1 (y − µ) (y − µ)′ Σ−1. (E.29)

In line three, we use the identity ∂∂X

(aT X−1X−Ta) = −2X · X−1X−TaaTX−1X−T , for which the

derivation appears in Section F.4.

140

Substituting Equations (E.27) and (E.28) into Equation (E.26), we get

∂

∂RN (y;µ,R) =

(

(2π)d2

∣∣RR−T

∣∣12

)−1

exp

[

−1

2(y − µ)′R−1R−T (y − µ)

]

×

R(Υ −Σ−1

)

= N (y;µ,R)RΥ, (E.30)

where, as before, Υ is defined as

Υ = Υ−Σ−1

= Σ−1 (y − µ) (y − µ)′ Σ−1 −Σ−1. (E.31)

141

APPENDIX F

MATRIX CALCULUS

F.1 Introduction

A few of the derivations in Appendix E depend on matrix derivatives. Some of these derivatives

were taken from other sources [131, 132], but at least one requires some additional derivation not

found elsewhere.

F.2 Preliminaries

For the derivations below, X is assumed to be a square matrix, (X)ij = xij is the element at

position (i, j) of matrix X, a is a vector, and ei is the ith column of identity matrix I. Let Xij refer

to cofactor (i, j) of matrix X. For a vector or matrix v, vT indicates its transpose. Let X−1 be

the inverse of X, and let X−T be the inverse of the transpose of X. When taking the inverse, we

assume that X is non-singular.

We will use the following identities below. For nonsymmetric X,

∂

∂xijX = eie

Tj , (F.1)

where eieTj is a square matrix with a one at position (i, j) and zeros elsewhere. If X is symmetric,

∂

∂xijX =

eieTj if i = j

eieTj + eje

Ti if i 6= j.

(F.2)

The derivative of XTX is defined by

142

∂

∂xijXTX =

(∂

∂xijXT

)

X + XT

(∂

∂xijX

)

= ejeTi X + XTeie

Tj (F.3)

for nonsymmetric X.

For matrix X,

|X| =∑

j

xijXij (F.4)

for any fixed i [132]. Because each cofactor Xij is independent of xij, this implies, for nonsymmetric

X, that

∂

∂xij|X| = Xij (F.5)

and

∂

∂X|X| =

X11 X12 . . . X1r

X21 X22 . . . X2r

......

. . ....

Xr1 Xr2 . . . Xrr

= |X|X−T (F.6)

(from [131, 132]). If X is symmetric,

∂

∂xij|X| =

Xij if i = j

2Xij if i 6= j

(F.7)

and

∂

∂X|X| =

X11 2X12 . . . 2X1r

2X21 X22 . . . 2X2r

......

. . ....

2Xr1 2Xr2 . . . Xrr

= 2|X|(X−T − diag(X−T )

). (F.8)

143

F.3 Derivation of ∂∂XaTX−1a

This derivation comes from [131]. We start with

0 =∂

∂xijI

=∂

∂xij

(XX−1

)

=∂

∂xij(X)X−1 + X

∂

∂xij(X−1), (F.9)

which implies

∂

∂xijX−1 = −X−1 ∂

∂xij(X)X−1. (F.10)

Therefore, for nonsymmetric X,

∂

∂xijaTX−1a = −aTX−1 ∂

∂xij(X)X−1a

= −aTX−1eieTj X−1a

= −aTX−1ei · eTj X−1a

= −eTi X−Ta · aTX−Tej

= −(X−TaaTX−T )ij , (F.11)

which implies that

∂

∂XaTX−1a = −X−TaaTX−T . (F.12)

For symmetric X, ∂∂xij

aTX−1a is the same as Equation (F.11) if i = j. If i 6= j,

∂

∂xijaTX−1a = −aTX−1 ∂

∂xij(X)X−1a

= −aTX−1(eieTj + eje

Ti )X−1a

= −aTX−1ei · eTj X−1a− aTX−1ej · eT

i X−1a

= −eTi X−1a · aTX−1ej − eT

j X−1a · aTX−1ei

= −eTi X−1aaTX−1ej − eT

i X−1aaTX−1ej

= −2eTi X−1aaTX−1ej

= −2(X−T aaTX−T )ij , (F.13)

144

where in lines four and five, we take advantage of the fact that X (and therefore X−1) is symmetric.

The full derivative for symmetric X is then

∂

∂XaTX−1a = −2X−TaaTX−T + diag(X−T aaTX−T ). (F.14)

F.4 Derivation of ∂∂XaX−1X−Ta

This derivation is similar to but more complicated than the derivation in the previous section.

Starting with ∂∂X

X−1X−T , note that

0 =∂

∂xijI

=∂

∂xij

(XTXX−1X−T

)

=∂

∂xij

(XTX

)X−1X−T + XTX

∂

∂xij

(X−1X−T

), (F.15)

which implies

∂

∂xij

(X−1X−T

)= −X−1X−T ∂

∂xij

(XTX

)X−1X−T . (F.16)

Therefore,

∂

∂xij

(aTX−1X−Ta

)= aT ∂

∂xij

(X−1X−T

)a

= −aTX−1X−T ∂

∂xij

(XTX

)X−1X−Ta

= −aTX−1X−T(eje

Ti X + XTeie

Tj

)X−1X−Ta

= −aTX−1X−TejeTi XX−1X−Ta− aTX−1X−TXTeie

Tj X−1X−Ta

= −aTX−1X−Tej · eTi XX−1X−Ta− aTX−1X−TXTei · eT

j X−1X−Ta

= −eTj X−1X−Ta · aTX−1X−TXTei − eT

i XX−1X−Ta · aTX−1X−Tej

= −eTi XX−1X−TaaTX−1X−Tej − eT

i XX−1X−TaaTX−1X−Tej

= −2eTi XX−1X−TaaTX−1X−Tej

= −2(XX−1X−TaaTX−1X−T

)

ij. (F.17)

145

The full derivative is then

∂

∂X(aT X−1X−Ta) = −2XX−1X−TaaTX−1X−T (F.18)

= −2X−TaaTX−1X−T . (F.19)

In our work, we actually use Equation (F.18) rather than Equation (F.19).

146

REFERENCES

[1] A. Turing, “Computing machinery and intelligence,” Mind, vol. 59, pp. 433–460, 1950.

[2] G. Stojanov, “Petitage: A case study in developmental robotics,” in Proc. 1st Int. Workshopon Epigenetic Robotics, Lund, Sweden, 2001.

[3] R. A. Brooks, “Achieving artificial intelligence through building robots,” Massachusettes In-stitute of Technology, Artificial Intelligence Laboratory, Tech. Rep. 899, 1986.

[4] R. A. Brooks and L. A. Stein, “Building brains for bodies,” Massachusettes Institute of Tech-nology, Artificial Intelligence Laboratory, Tech. Rep. 1439, 1993.

[5] S. E. Levinson, “The role of sensorimotor function, associative memory and reinforcementlearning in automatic acquisition of spoken language by an autonomous robot,” in Proc. NSFDarpa Workshop on Development and Learning, Michigan State University, Apr. 2000.

[6] J. Krichmar and G. Edelman, “Machine psychology: Autonomous behavior, perceptual cat-egorization and conditioning in a brain-based device,” Cerebral Cortex, vol. 12, pp. 818–830,2002.

[7] Y. Zhang and J. Weng, “Grounded auditory development by a developmental robot,” in Proc.INNS/IEEE Int. Joint Conf. Neural Networks, Washington DC, July 2001, pp. 1059–1064.

[8] J. D. Han, S. W. Zeng, K. Y. Tham, M. Badgero, and J. Weng, “Dav: A humanoid robotplatform for autonomous mental development,” in Proc. 2nd Int. Conf. on Development andLearning, Cambridge, MA, June 2002.

[9] M. Lungarella and G. Metta, “Beyond gazing, pointing, and reaching: A survey of develop-mental robotics,” in Proc. 3rd Int. Workshop on Epigenetic Robotics, 2003.

[10] R. Brooks, C. Breazeal, M. Marjanovic, B. Scassellati, and M. Williamson, “The Cog project:Building a humanoid robot,” in Computation for Metaphors, Analogy and Agents, C. Nehaniv,Ed. Berlin: Springer-Verlag, 1998, pp. 52–87.

[11] P. Varshavskaya, “Behavior-based early language development on a humanoid robot,” in Proc.2nd Int. Workshop on Epigenetic Robotics, 2002.

[12] C. Breazeal and B. Scassellati, “How to build robots that make friends and influence people,”in Proc. Int. Conf. on Intell. Robots and Systems, Kyongju, Korea, 1999.

[13] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends in Cognitive Sciences,vol. 3, no. 6, pp. 233–242, 1999.

147

[14] H. Kozima and H. Yano, “A robot that learns to communicate with human caregivers,” inProc. 1st Int. Workshop on Epigenetic Robotics, Lund, Sweden, 2001.

[15] J. Weng, “A theory for mentally developing robots,” in Proc. 2nd Int. Conf. on Developmentand Learning, Cambridge, MA, June 2002, pp. 131–140.

[16] J. Weng, Y. Zhang, and Y. Chen, “Developing early senses about the world: ‘Object perma-nence’ and visuoauditory real-time learning,” in Proc. INNS/IEEE Int. Joint Conf. NeuralNetworks, vol. 4, July 2003, pp. 2710–2715.

[17] N. Almassy, G. M. Edelman, and O. Sporns, “Behavioral constraints in the development ofneuronal properties: a cortical model embedded in a real world device,” Cerebral Cortex,vol. 8, pp. 346–361, 1998.

[18] O. Sporns and W. H. Alexander, “Neuromodulation in a learning robot: Interactions betweenneural plasticity and behavior,” in Proc. INNS/IEEE Int. Joint Conf. Neural Networks, vol. 4,July 2003, pp. 2789–2794.

[19] A. K. Seth, J. L. McKinstry, G. M. Edelman, and J. L. Krichmar,“Visual binding, reentry andneuronal synchrony in a physically situated brain-based device,” in Proc. 3rd Int. Workshopon Epigenetic Robotics, 2003.

[20] K. Fischer and R. Moratz, “From communicative strategies to cognitive modelling,” in Proc.1st Int. Workshop on Epigenetic Robotics, Lund, Sweden, 2001.

[21] L. Hugues and A. Drogoul, “Shaping of robot behaviors by demonstration,” in Proc. 1st Int.Workshop on Epigenetic Robotics, Lund, Sweden, 2001.

[22] P. R. Cohen, C. Sutton, and B. Burns, “Learning effects of robot actions using temporalassociations,” in Proc. 2nd Int. Conf. on Development and Learning, Cambridge, MA, June2002, pp. 96–101.

[23] I. Fasel, G. O. Deak, J. Triesch, and J. Movellan, “Combining embodied models and empiricalresearch for understanding the development of shared attention,” in Proc. 2nd Int. Conf. onDevelopment and Learning, Cambridge, MA, June 2002, pp. 21–27.

[24] R. A. Grupen, “A developmental organization for robot behavior,” in Proc. 3rd Int. Workshopon Epigenetic Robotics, 2003.

[25] Arrick Robotics, http://www.robotics.com/.

[26] B. Gold and N. Morgan, Speech and Audio Signal Processing. New York: Wiley, 2000.

[27] M. J. Tovee, An Introduction to the Visual System. Cambridge: Cambridge University Press,1996.

[28] W. Zhu and S. E. Levinson, “Edge orientation-based multiview object recognition,” in Proc.IEEE Int’l Conf. on Pattern Recognition, vol. 1, Barcelona, Spain, 2000, pp. 936–939.

[29] W. Zhu, S. Wang, R. S. Lin, and S. E. Levinson, “Tracking of object with SVM regression,”in Proc. IEEE Int. Conf. on Comput. Vision & Pattern Recognition, vol. 2, Hawaii, 2001, pp.240–245.

148

[30] R. S. Lin, “Learning vision-based robot navigation,” M.S. thesis, University of Illinois atUrbana-Champaign, 2004.

[31] D. Li and S. E. Levinson, “A robust linear phase unwrapping method for dual-channel soundsource localization,” in Int. Conf. on Robot. Automat., Washington D.C., May 2002.

[32] D. Li and S. E. Levinson, “A Bayes-rule based hierarchical system for binaural sound sourcelocalization,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Hong Kong, Apr.2003.

[33] M. Hakozaki, H. Oasa, and H. Shinoda, “Telemetric robot skin,” in Proc. IEEE Int. Conf. onRobot. Automat., Detroit, Michigan, May 1999.

[34] A. Loutfi, S. Coradeschi, T. Duckett, and P. Wide, “Odor source identification by groundinglinguistic descriptions in an artificial nose,” in Proc. SPIE Conf. on Sensor Fusion: Architec-tures, Algorithms and Applications V, vol. 4385, Orlando, Florida, 2001, pp. 273–282.

[35] S. Savoy et al., “Solution-based analysis of multiple analytes by a sensor array: Towardthe development of an electronic tongue,” in SPIE Conf. on Chemical Microsensors andApplications, vol. 3539, Boston, MA, Nov. 1998.

[36] F. W. Edridge-Green, Memory and Its Cultivation. New York: D. Appleton and Co., 1900.

[37] M. H. Ashcraft, Human Memory and Cognition. New York: Harper Collins, 1989.

[38] A. Baddeley, “Memory,” in MIT Encyclopedia of Cognitive Science, R. A. Wilson and F. Keil,Eds. Cambridge, MA: The MIT Press, 1999.

[39] D. L. Schacter and E. Tulving, Eds., Memory Systems 1994. Cambridge, MA: The MITPress, 1994.

[40] D. R. Shanks, The Psychology of Associative Learning. Cambridge: Cambridge UniversityPress, 1995.

[41] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J.Artif. Intell. Research, vol. 4, pp. 237–285, 1996.

[42] M. Wines, “For sniffing out land mines, a platoon of twitching noses,” The New York Times,p. A1, May 18, 2004.

[43] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl, The Scientist in the Crib. New York: HarperCollins, 1999.

[44] C. Garvey, Play. Cambridge, MA: Harvard University Press, 1990.

[45] J. Kaminski, J. Call, and J. Fischer, “Word learning in a domestic dog: Evidence for ‘fastmapping’,” Science, vol. 304, pp. 1682–1683, June 2004.

[46] B. Breidegard and C. Balkenius, “Speech development by imitation,” in Proc. 3rd Int. Work-shop on Epigenetic Robotics, 2003.

[47] M. Cabido-Lopes and J. Santos-Victor, “Visual transformations in gesture imitation: Whatyou see is what you do,” in Proc. Int. Conf. Robot. Automat., 2003, pp. 2375–2381.

149

[48] M. Kleffner, “A method of automatic speech imitation via warped linear prediction,” M.S.thesis, University of Illinois at Urbana-Champaign, 2003.

[49] W. Zhu and S. E. Levinson, “PQ-learning: An efficient robot learning method for intelligentbehavior acquisition,” in Proc. 7th Int. Conf. on Intell. Autonomous Systems, vol. 1, Marinadel Rey, CA, Mar. 2002, pp. 404–411.

[50] M. McClain, “The role of exploration in language acquisition for an autonomous robot,” M.S.thesis, University of Illinois at Urbana-Champaign, 2003.

[51] Q. Liu, “Interactive and incremental learning via a multisensory mobile robot,” Ph.D. disser-tation, University of Illinois at Urbana-Champaign, 2001.

[52] W. Zhu and S. E. Levinson, “JPDF-based visual concept learning by an autonomous agent,”in Proc. Int. Conf. Vision Interface, 2003.

[53] S. Carey and E. Bartlett, “Acquiring a single new word,” Papers and Reports on Child Lan-guage Development, vol. 15, pp. 17–29, 1978.

[54] L. Markson and P. Bloom,“Evidence against a dedicated system for word learning in children,”Nature, vol. 385, pp. 813–815, Feb. 1997.

[55] K. Yip and G. J. Sussman, “Sparse representations for fast, one-shot learning,” Proc. Nat.Conf. Artif. Intell., 1997.

[56] J. C. Nieh, “Stingless-bee communication,” American Scientist, vol. 87, no. 5, pp. 428–435,Sept. 1999.

[57] S. Laurence and E. Margolis, “Concepts and cognitive science,” in Concepts: Core Readings,S. Laurence and E. Margolis, Eds. Cambridge, MA: The MIT Press, 1999, pp. 3–81.

[58] R. L. Solso, Cognitive Psychology, 4th ed. Boston: Allyn and Bacon, 1995.

[59] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks. Reading, MA: Addison-Wesley, 1989.

[60] J. M. Zurada, Introduction to Artificial Neural Systems. St. Paul, MN: West PublishingCompany, 1992.

[61] F. V. Jensen, Bayesian Networks and Decision Graphs. New York: Springer-Verlag, 2001.

[62] H. Pan, Z.-P. Liang, and T. Huang, “Fusing audio and visual features of speech,” in Proc.2000 Int. Conf. Image Processing, vol. 3, 2000, pp. 214–217.

[63] H. Pan, “A Bayesian fusion approach and its application to integrating audio and visualsignals in HCI,” Ph.D. dissertation, University of Illinois at Urbana-Champaign, 2001.

[64] M. Brand, “Couple hidden Markov models for modeling interacting processes,” MIT MediaLab, Tech. Rep. 405, 1997.

[65] S. M. Chu, “Multimodal fusion with applications to audio-visual speech recognition,” Ph.D.dissertation, University of Illinois at Urbana-Champaign, 2003.

150

[66] V. Krishnamurthy and G. G. Yin, “Recursive algorithms for estimation of hidden Markovmodels and autoregressive models with Markov regime,” IEEE Trans. Inform. Theory, vol. 48,no. 2, pp. 458–476, Feb. 2002.

[67] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ:Prentice Hall PTR, 1993.

[68] P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Hidden Markov models for DNA sequencing,”in Workshop on Genomic Signal Processing and Statistics (GENSIPS 2002), Oct. 2002.

[69] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov models incomputational biology: Applications to protein modeling,” J. of Molecular Biology, vol. 235,pp. 1501–1531, 1994.

[70] D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, “A generalized hidden Markov modelfor the recognition of human genes in DNA,” in Proc. 4th Int. Conf. Intell. Syst. MolecularBio., 1996, pp. 134–142.

[71] R. L. Cave and L. P. Neuwirth, “Hidden Markov models for English,” in Proc. of the Sym-posium on the Applications of Hidden Markov Models to Text and Speech. Princeton, NJ:IDA-CRD, Oct. 1980, pp. 16–56.

[72] A. B. Poritz, “Linear predictive hidden Markov models and the speech signal,” in Proc. IEEEInt. Conf. Acoust. Speech Signal Processing, 1982, pp. 1291–1294.

[73] A. Ljolje and S. E. Levinson, “Development of an acoustic-phonetic hidden Markov modelfor continuous speech recognition,” IEEE Trans. Signal Processing, vol. 29, no. 1, pp. 29–39,1991.

[74] A. Arapostathis and S. I. Marcus, “Analysis of an identification algorithm arising in theadaptive estimation of Markov chains,” Math Control Signals Systems, vol. 3, no. 1, pp. 1–29,1990.

[75] I. B. Collings, V. Krishnamurthy, and J. B. Moore, “On-line identification of hidden Markovmodels via recursive prediction error techniques,” IEEE Trans. Signal Processing, vol. 42,no. 12, pp. 3535–3539, Dec. 1994.

[76] F. LeGland and L. Mevel, “Recursive estimation in hidden Markov models,” in Proc. 36thIEEE Conf. Decision Contr., San Diego, CA, Dec. 1997.

[77] U. Holst and G. Lindgren, “Recursive estimation in mixture models with Markov regime,”IEEE Trans. Inform. Theory, vol. 37, no. 6, pp. 1683–1690, Nov. 1991.

[78] V. Krishnamurthy and J. B. Moore, “On-line estimation of hidden Markov model parametersbased on the Kullback-Leiber information measure,” IEEE Trans. Signal Processing, vol. 41,no. 8, pp. 2557–2573, Aug. 1993.

[79] T. Ryden, “On recursive estimation for hidden Markov models,” Stochastic Processes andtheir Applications, vol. 66, pp. 79–96, 1997.

[80] F. LeGland and L. Mevel, “Recursive identification of HMM’s with observations in a finiteset,” in Proc. 34th IEEE Conf. Decision Contr., New Orleans, Dec. 1995, pp. 216–221.

151

[81] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the application of thetheory of probabilistic functions of a Markov process to automatic speech recognition,” TheBell System Technical Journal, vol. 62, no. 4, pp. 1035–1074, Apr. 1983.

[82] H. V. Poor, An Introduction to Signal Detection and Estimation. New York: Springer-Verlag,1994.

[83] F. LeGland and L. Mevel, “Geometric ergodicity in hidden Markov models,” INRIA, Tech.Rep. RR-2991, Sept. 1996.

[84] S. P. Meyn, Markov chains and stochastic stability. London: Springer-Verlag, 1993.

[85] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Ap-plications. New York: Springer-Verlag, 2003.

[86] V. Krishnamurthy and T. Ryden, “Consistent estimation of linear and non-linear autoregres-sive models with Markov regime,” J. of Time Series Analysis, vol. 19, no. 3, pp. 291–307,1998.

[87] N. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,” in Proc. 9th Int.Conf. on Artif. Neural Networks, 1999.

[88] N. N. Schraudolph and T. Graepel, “Combining conjugate direction methods with stochasticapproximation of gradients,” in Proc. 9th Int. Workshop Artif. Intell. and Statistics, 2003.

[89] Y. Ephraim and N. Merhav,“Hidden Markov processes,” IEEE Trans. Inform. Theory, vol. 48,no. 6, pp. 1518–1569, June 2002.

[90] E. Gassiat and S. Boucheron, “Optimal error exponents in hidden Markov models orderestimation,” IEEE Trans. Inform. Theory, vol. 49, no. 4, pp. 964– 980, Apr. 2003.

[91] T. Ryden, “Estimating the order of hidden Markov models,” Statistics, vol. 26, pp. 345–354,1995.

[92] R. J. MacKay, “Estimating the order of a hidden Markov model,” The Canadian Journal ofStatistics, vol. 30, no. 4, pp. 573–589, 2002.

[93] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York: Wiley, 2001.

[94] J. D. Ferguson, “Variable duration models for speech,” in Proc. of the Symposium on theApplications of Hidden Markov Models to Text and Speech, J. D. Ferguson, Ed. Princeton,NJ: IDA-CRD, Oct. 1980, pp. 143–179.

[95] M. J. Russell and R. K. Moore, “Explicit modelling of state occupancy in hidden Markovmodels for automatic speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech SignalProcessing, vol. 10, Apr. 1985, pp. 5–8.

[96] S. E. Levinson, “Continuously variable duration hidden Markov models for automatic speechrecognition,” Computer Speech and Language, vol. 1, pp. 29–45, 1986.

[97] S. Fine, Y. Singer, and N. Tishby, “The hierarchical hidden Markov model: Analysis andapplications,” Machine Learning, vol. 32, no. 1, pp. 41–62, 1998.

152

[98] K. Murphy and M. Paskin, “Linear time inference in hierarchical HMMs,” in Advances inNeural Information Processing Systems, 2002.

[99] Z. Ghahramani and G. E. Hinton, “Factorial hidden Markov models,” Machine Learning,vol. 29, pp. 245–273, 1997.

[100] V. Pavlovic, J. M. Rehg, T.-J. Cham, and K. P. Murphy, “A dynamic Bayesian networkapproach to figure tracking using learned dynamic models,” in Proc. Int. Conf. on Comput.Vision, 1999, pp. 94–101.

[101] Z. Ghahramani and G. Hinton, “Variational learning for switching state-space models,”NeuralComputation, vol. 12, pp. 831–864, 2000.

[102] K. S. Fu, Syntactic methods in pattern recognition. New York: Academic Press, 1974.

[103] E. Charniak, Statistical Language Learning. Cambridge, MA: The MIT Press, 1996.

[104] N. Chomsky, “Three models for the description of language,” IEEE Trans. Inform. Theory,vol. 2, no. 3, pp. 113–124, Nov. 1956.

[105] D. Roy, “Grounded spoken language acquisition: Experiments in word learning,” IEEE Trans.Multimedia, vol. 5, no. 2, June 2003.

[106] L. Steels, “Language games for autonomous robots,” IEEE Intell. Syst., pp. 17–22, Sept./Oct.2001.

[107] L. Steels and F. Kaplan, “Aibo’s first words: The social learning of language and meaning,”Evolution of Communication, vol. 4, no. 1, pp. 3–32, 2001.

[108] T. Oates, Z. Eyler-Walker, and P. R. Cohen, “Using syntax to learn semantics: An experimentin language acquisition with a mobile robot,” University of Massachusetts Computer ScienceDepartment, Tech. Rep. 99-35, 1999.

[109] T. Oates, “Grounding knowledge in sensors: Unsupervised learning for language and plan-ning,” Ph.D. dissertation, University of Massachusettes, Amherst, 2001.

[110] B. Burns, C. Sutton, C. Morrison, and P. Cohen, “Information theory and representation inassociative word learning,” in Proc. 3rd Int. Workshop on Epigenetic Robotics, 2003.

[111] C. Crangle and P. Suppes, Language and Learning for Robots. Stanford, CA: Center for theStudy of Language and Information, 1994.

[112] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief-propagation for approximate infer-ence: An empirical study,” in Proc. 15th Conf. Uncertainty in Artif. Intell., K. B. Laskey andH. Prade, Eds., San Mateo, CA, 1999.

[113] IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. AudioElectroacoust., vol. AU-17, no. 3, pp. 225–246, Sept. 1969.

[114] S. E. Levinson (personal communication), 2004.

[115] P. K. Kuhl, “Early language acquisition: Cracking the speech code,” Nature Reviews Neuro-science, vol. 5, Nov. 2004.

153

[116] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partiallyobservable stochastic domains,” Artif. Intell., vol. 101, no. 1-2, pp. 99–134, May 1998.

[117] L. R. Rabiner and R. W. Schafer, Digital processing of Speech Signals. Upper Saddle River,NJ: Prentice Hall, 1978.

[118] H. W. Strube, “Linear prediction on a warped frequency scale,” J. of the Acoustical Societyof America, vol. 68, pp. 1071–1076, 1980.

[119] U. K. Laine, M. Karjalainen, and T. Altosaar, “Warped linear prediction (WLP) in speechand audio processing,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, vol. III,1994, pp. 349–352.

[120] J. O. Smith III and J. S. Abel, “Bark and ERB bilinear transforms,” IEEE Trans. SpeechAudio Processing, vol. 7, pp. 697–708, 1999.

[121] A. Harma, “Evaluation of a warped linear predictive coding scheme,” in Proc. IEEE Int. Conf.Acoust. Speech Signal Processing, vol. II, 2000, pp. 897–900.

[122] R. Viswanathan and J. Makhoul, “Quantization properties of transmission parameters inlinear predictive systems,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 23, pp. 309–321, 1975.

[123] R.-S. Lin (personal communication), 2004.

[124] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and Bayesian restora-tion of images,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 6, pp. 721–741, Nov.1984.

[125] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” inProc. IEEE Int. Conf. on Comput. Vision & Pattern Recognition, vol. 1, 2004, pp. 261–268.

[126] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” Int. J. ofComput. Vision, vol. 40, no. 1, pp. 25–47, 2000.

[127] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation and itsgeneralizations,” Mitsubishi Electric Research Laboratories, Inc., Tech. Rep. TR-2001-22,Jan. 2002.

[128] Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” IEEE Trans. Inform. Theory, vol. 47, no. 2, pp.726–744, Feb. 2001.

[129] T. Gevers and A. W. M. Smeulders, “Pictoseek: Combining color and shape invariant featuresfor image retrieval,” IEEE Trans. Image Processing, vol. 9, no. 1, pp. 102–119, Jan. 2000.

[130] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection. New York:Wiley, 1987.

[131] Matrix Reference Manual, http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html, June2004.

[132] A. Graham, Kronecker Products and Matrix Calculus With Applications. Chichester, Eng-land: Ellis Horwood Limited, 1981.

154

VITA

Kevin Michael Squire received his BS in computer engineering from Case Western Reserve Univer-

sity in 1995, his MS in electrical engineering from the University of Illinois at Urbana-Champaign

(UIUC) in 1998, and with this dissertation has completed his PhD in electrical engineering at UIUC

in 2004. He has conducted research on artificial intelligence, stochastic modeling, learning, image

processing, and speech and language processing at UIUC and at the Tokyo Institute of Technology,

Tokyo, Japan.

155

c 2004 by Kevin Michael Squire. All rights reserved.k-squire/thesis/Kevin_thesis_full.pdf · BY...

Documents

Transcript of c 2004 by Kevin Michael Squire. All rights reserved.k-squire/thesis/Kevin_thesis_full.pdf · BY...