c 2004 by Kevin Michael Squire. All rights reserved.k-squire/thesis/Kevin_thesis_full.pdf · BY...
Transcript of c 2004 by Kevin Michael Squire. All rights reserved.k-squire/thesis/Kevin_thesis_full.pdf · BY...
c© 2004 by Kevin Michael Squire. All rights reserved.
HMM-BASED SEMANTIC LEARNING FOR A MOBILE ROBOT
BY
KEVIN MICHAEL SQUIRE
B.S., Case Western Reserve University, 1995M.S., University of Illinois at Urbana-Champaign, 1998
DISSERTATION
Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical Engineering
in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2004
Urbana, Illinois
ABSTRACT
We are developing a intelligent robot and attempting to teach it language. While there are many
aspects of this research, for the purposes of this dissertation the most important are the following
ideas. Language is primarily based on semantics, not syntax, which is the focus in speech recogni-
tion research these days. To truly learn meaning, a language engine cannot simply be a computer
program running on a desktop computer analyzing speech. It must be part of a more general,
embodied intelligent system, one capable of using associative learning to form concepts from the
perception of experiences in the world, and further capable of manipulating those concepts symboli-
cally. This dissertation explores the use of hidden Markov models (HMMs) in this capacity. HMMs
are capable of automatically learning and extracting the underlying structure of continuous-valued
inputs and representing that structure in the states of the model. These states can then be treated
as symbolic representations of the inputs. We show how a model consisting of a cascade of HMMs
can be embedded in a small mobile robot and used to learn correlations among sensory inputs to
create symbolic concepts, which can eventually be manipulated linguistically and used for decision
making.
iii
To my parents.
iv
ACKNOWLEDGMENTS
First and foremost, I would like thank my adviser, Dr. Stephen Levinson, for providing an extremely
ambitious and stimulating project. Steve has an amazingly broad perspective on our research, and
my own views and understanding have noticeably broadened under his tutelage. He has also had
seemingly unfailing belief in me and my work, even during times of difficulty, which I greatly
appreciate. I have gained a profound respect for him and his ideas and opinions, and I am deeply
grateful for having had the opportunity to work under him.
I would like to thank my committee members, Dr. Seth Hutchinson, Dr. Thomas Huang,
Dr. Mark Hasegawa-Johnson, and Dr. Patrick Xavier, for their questions, suggestions, and sup-
port during my research. In particular, Seth expressed interest early on in participating in my
research process, and has asked some of the deepest and most interesting questions regarding the
research; Tom has pushed me to search for ways to more broadly apply my research and the re-
search of the project; Mark has been very interested in and supportive of some of the more technical
aspects of my work; and Patrick has, from a distance, offered frequent advice and took the time to
fly in for my defense. For all of these interactions, I am very appreciative.
For the month before my defense, Ruei-Sung Lin and Matthew McClain were amazingly sup-
portive of the technical aspects of this project, pulling very long nights with me and writing and
changing code to fit my specifications. Without their help, a final demonstration my work would
not have been possible, and I thank them deeply.
I would like to thank Matthew Kleffner, Dr. Danfeng Li, Dr. Weiyu Zhu, and Dr. Qiong Liu,
whose technical contributions have helped form the foundation of our project, upon which my work
is built. I would additionally like to thank Matt for our many stimulating discussions.
v
Throughout my graduate studies, Dr. Rajiv Maheswaran and Dr. Sarunya “Noke” Hemjinda
have both listened intently when I have needed to talk, whether about technical aspects of my
research, about or real or mundane issues of life. Thanks to both for being really amazing friends.
I would like to thank the other members of the Beckman Institute Robotics Laboratory, for their
warm welcome and aid to our group when we joined their lab earlier this year. I would especially
like to thank James Davidson and Dr. Fred Rothganger for some very stimulating conversations
and for enthusiastic support of our project.
My participation in the artificial neural networks and computational brain theory (ANNCBT)
seminar has been one of the most interesting and intellectually stimulating experiences of my PhD,
and has been a strong guide for my research. I thank the members of that group for some very
interesting discussions, especially Samarth Swarup and Dr. Thomas Anastasio.
I would like to thank Dr. Donna Brown for her strong support and help while I was working
on my master’s degree. Without her support and encouragement, I would not have gone on for my
PhD.
While at the Beckman Institute, Mike Smith has been amazingly helpful to me and our group,
especially in helping to organize our lab and offices, and with setting up open house demonstrations.
I thank him for all he has done over the years.
I am deeply indebted to Dominic Frigon, Hala Jawlakh, Dr. Saptarshi Bandyopadhyay, Kwan-
rawee“Joy”Sirikanchana, Dr. Consuelo Waight, Ankur Garg, and Sarah Miller, for their enthusiastic
support, for interesting and helpful discussions about my work and about life, and for their close
friendship.
For helping keep me healthy and nourished, I would like to thank the members of the Friday
dinner gang—Anand Selvaraj, Chetan Pahlajani, Zaki Mohammed, Shivi Bansal, Carrie Owen,
Natasha Kipp, Hala and Dom, Siddhartha Raja, Deepti Samant, and Apurva Chitnis.
For keeping me sane, I would like to thank the past and present salseros and salseras of Urbana-
Champaign for giving me the chance to work off some frustration and energy, especially Rajiv,
Consuelo, Sarah, Joy, Ruben Aveledo, Julie Baterna, Lyre Murao, and the Regent Ballroom.
Last but not least, I would like to thank my father, Craig Squire, for his love and support, and
especially for slogging through early drafts of this dissertation, some of which was probably seemed
vi
quite foreign and unintelligible, and I would like to thank my mother for her constant prayers and
love, without which this process would have been much, much harder.
vii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Developmental Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 A Robotic System for Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Somatic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Noetic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Semantic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.2 General associative memory model for semantic learning . . . . . . . . . . . . 16
1.5 Contributions and Layout of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 2 HIDDEN MARKOV MODELS AND THE RMLE ALGORITHM . . . . . . . 222.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Model Description and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3 Recursive Maximum-Likelihood Estimation of HMM Parameters . . . . . . . . . . . 26
2.3.1 RMLE derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.3 Model averaging and tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.4 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.5 Estimating a model with unknown model order . . . . . . . . . . . . . . . . . 42
2.4 HMMs as Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
CHAPTER 3 CASCADE OF HMMS: THEORY AND SIMULATION . . . . . . . . . . . . 483.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 HMMs for Learning Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Unimodal structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.2 Multimodal structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Cascade of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
viii
3.3.2 Recursive maximum-likelihood estimation for the cascade model . . . . . . . 523.3.3 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
CHAPTER 4 CASCADE OF HMMS AS AN ASSOCIATIVE MEMORY . . . . . . . . . . 644.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Associative Learning of Language Using Robots . . . . . . . . . . . . . . . . . . . . . 644.3 Concept Learning Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3.2 Model scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Robotic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4.1 Finite state machine controller . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4.2 Sensory inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4.3 HMM cascade model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.4.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Insights and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Derivation of recursive maximum-likelihood estimation algorithms . . . . . . 965.2.2 Generative modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.3 A language-learning robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
APPENDIX A HARDWARE AND SYSTEM-LEVEL SOFTWARE SPECIFICATIONS . . 99A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A.2 Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.2.1 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A.2.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.3 Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
APPENDIX B MOBILE ROBOT SOFTWARE . . . . . . . . . . . . . . . . . . . . . . . . . 103B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103B.2 Distributed Computing and Communication System . . . . . . . . . . . . . . . . . . 103
B.2.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104B.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105B.2.3 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.3 Speech Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.4 Visual Object Segmentation and Feature Extraction . . . . . . . . . . . . . . . . . . 114B.4.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115B.4.2 Pairwise Markov random fields . . . . . . . . . . . . . . . . . . . . . . . . . . 115
ix
B.4.3 Local message passing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 116B.4.4 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
APPENDIX C HIDDEN MARKOV MODEL ALGORITHMS . . . . . . . . . . . . . . . . . 119C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119C.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120C.3 Viterbi-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
APPENDIX D HIDDEN SEMI-MARKOV MODELS AND THE RMLE ALGORITHM . . 124D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124D.2 HSMM Model Description and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 125D.3 RMLE for the HSMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
APPENDIX E RMLE DERIVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133E.1 Proof that pn(y1, . . . , yn;ϕ) =
∏nk=1 b(yk;ϕ)′uk(ϕ) . . . . . . . . . . . . . . . . . . . 133
E.2 Proof that pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) =∏n′
k=1 d(τk)′B(yk|τk)uk . . . . . . . . . . 134
E.3 Specialized RMLE Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135E.3.1 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136E.3.2 Discrete observation probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 136E.3.3 Gaussian observation likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . 136
APPENDIX F MATRIX CALCULUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142F.3 Derivation of ∂
∂XaTX−1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
F.4 Derivation of ∂∂X
aX−1X−Ta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
x
LIST OF TABLES
Table
2.1 Simulation results for various combinations of learning rate ε and averaging history k. 35
3.1 Average classification accuracy for learned HMM ϕu over 50 simulation runs. . . . . 62
4.1 Average classification accuracy for learned HMM ϕc over 50 simulation runs. . . . . 744.2 List of words used in our robot demonstration. . . . . . . . . . . . . . . . . . . . . . 764.3 Harvard phonetically balanced sentences. . . . . . . . . . . . . . . . . . . . . . . . . 814.4 Initial observation probabilities used by the concept HMM for visible objects. . . . . 884.5 Initial observation probabilities used by the concept HMM for words. . . . . . . . . . 884.6 Trained transition probabilities for the concept HMM. . . . . . . . . . . . . . . . . . 914.7 Trained observation probabilities used by the concept HMM for visible objects. . . . 914.8 Trained observation probabilities used by the concept HMM for words. . . . . . . . . 91
A.1 Computing hardware mounted on robots. . . . . . . . . . . . . . . . . . . . . . . . . 102A.2 Computer Workstations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
xi
LIST OF FIGURES
Figure
1.1 Cognitive Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Our robot Illy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Expanded view of the cognitive cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 The concept of apple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5 Visual/auditory concept hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.6 Associative learning of the word “apple.” . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 The effect of learning rate ε on parameter convergence during RMLE training, forconstant ε. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 The effect of ε0 and γ on parameter convergence during RMLE training, with anexponentially decreasing εn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 The effect of history size k on parameter averaging during RMLE training. . . . . . 392.4 Examples of learning in HMMs with finite-alphabet observation densities. . . . . . . 412.5 Initialization of an HMM with two-dimensional Gaussian observation densities. . . . 432.6 Learning an HMM using a model with a large number of states. . . . . . . . . . . . . 44
3.1 Semantic memory implemented using HMMs. . . . . . . . . . . . . . . . . . . . . . . 513.2 An HMM cascade model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 A dynamic Bayesian network (DBN) model showing the dependence among output
and state variables assumed by our cascade HMM. . . . . . . . . . . . . . . . . . . . 533.4 A switching HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.5 A cascaded switching HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Monte Carlo simulation for learning a cascaded switching HMM ϕ using a cascade
HMM ˆϕ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.7 Parameter learning for model ϕu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Parameter learning for model ϕl1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.9 Training run output for model ϕl2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.10 State sequence comparison between generative HMM ϕu and learned HMM ϕu. . . . 62
4.1 Concept learning scenario using a cascade of HMMs. . . . . . . . . . . . . . . . . . . 664.2 Model topology for robot concept learning. . . . . . . . . . . . . . . . . . . . . . . . 684.3 Parameter learning for model ϕc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Training run output for model ϕv. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.5 Parameter learning for model ϕa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.6 Objects used in our robot demonstration. . . . . . . . . . . . . . . . . . . . . . . . . 76
xii
4.7 The robot’s finite state machine controller. . . . . . . . . . . . . . . . . . . . . . . . 774.8 Auditory model used for speech recognition in our robot. . . . . . . . . . . . . . . . 794.9 Parameter estimation for phonetic HMM ϕaud. . . . . . . . . . . . . . . . . . . . . . 804.10 Equalized quantization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.11 Parameter learning for word model ϕword. . . . . . . . . . . . . . . . . . . . . . . . . 834.12 Parameter learning for model ϕvis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.13 Recognition of visual representations and concepts. . . . . . . . . . . . . . . . . . . . 854.14 Recognition of auditory representations and concepts. . . . . . . . . . . . . . . . . . 864.15 Recognition and learning using both auditory and visual information. . . . . . . . . 874.16 Illy learning about various objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.17 Parameter learning for model ϕcon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
B.1 Audio ring buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106B.2 Audio ring buffer on multiple machines. . . . . . . . . . . . . . . . . . . . . . . . . . 107B.3 Block diagram describing audio feature generation. . . . . . . . . . . . . . . . . . . . 114
xiii
LIST OF ABBREVIATIONS
AI artificial intelligence
CDF cumulative distribution function
CELL cross-channel early lexical learning
CHMM coupled hidden Markov model
DBN dynamic Bayesian network
EM expectation maximization
FSM finite state machine
GOFAI good old-fashioned artificial intelligence
HHMM hierarchical hidden Markov model
HMM hidden Markov model
HSMM hidden semi-Markov model
iid independent and identically distributed
JPDF joint probability density function
LAR log-area ratio
LP linear prediction
LPC linear prediction coefficient
MFCC mel-frequency cepstral coefficient
MLE maximum-likelihood estimation
ODE ordinary differential equation
RC reflection coefficient
RCLSE recursive conditioned least-squares estimation
RMLE recursive maximum-likelihood estimation
xiv
pdf probability density function
VCS voicing confidence score
VDHMM variable duration hidden Markov model
WLP warped linear prediction
WLPC warped linear prediction coefficient
wrt with respect to
xv
LIST OF SYMBOLS
CHAPTER 2
Xn, Yn discrete-time stochastic process defining a hidden Markov model (HMM)
(Ω,F , P ) probability space
Xn discrete-time first-order Markov chain
Yn observable stochastic process corresponding to Xn
xn the particular state value of Xn
yn the particular observation of Yn
r number of states in a Markov chain
R state space for Markov chain Xn; R = 1, . . . , r
πi probability of an HMM starting in state i, i ∈ R
π length-r vector of initial probabilities; π = πii∈R
Π set of all length-r stochastic vectors
aij P (Xn = j|Xn−1 = i); probability of transitioning from state i to state j in anHMM
A r × r transition probability matrix for an HMM; A = aiji,j∈R
A set of all r × r stochastic matrices
E space upon which each Yn takes values
b(·; θj), bj(·) observation density for state j of an HMM
θj parameters of a density function describing the observations of state j
Θ set of valid parameters for a family of observation densities
µj, σj mean and standard deviation parameters for state j of a single dimensional Gaus-sian distribution
s number of observations per state of an HMM with observations in a finite alphabet
xvi
V set of symbols in an HMM with observations from a finite-alphabet
vk observation symbol k of an HMM with observations from a finite-alphabet
bjk probability of observing symbol vk in state j of an HMM with observations froma finite-alphabet
g(·, θ) real-valued function on R indexed by θ; output is produced according to a prob-ability distribution with θ as a parameter
en sequence of independent and identically distributed (iid) random variables
Φ HMM parameter space; Φ = Π ×A× Θ or Φ = A× Θ
ϕ vector of model parameters for an HMM; ϕ ∈ Φ(e.g., ϕ = a11, a12, . . . , arr, θ1, . . . , θr)
ϕ estimate of model parameters for an HMM
ϕ∗ true model parameters for an HMM
p length of vector ϕ
ϕl the lth parameter of parameter vector ϕ; 1 ≤ l ≤ p
π(ϕ) initial probability vector for HMM ϕ
A(ϕ) transition probability matrix for HMM ϕ
aij(ϕ) i,jth element of A(ϕ)
θj(ϕ) observation density parameter(s) for state j of HMM ϕ
bj(·;ϕ) observation density of state j for HMM ϕ; equivalent to b(·; θj(ϕ))
bjk(ϕ) probability of observing symbol vk in state j of finite-alphabet HMM ϕ
µj(ϕ) observation mean of a single dimensional Gaussian distribution for state j ofHMM ϕ
σj(ϕ) observation standard deviation of a single dimensional Gaussian distribution forstate j of HMM ϕ
b(yn;ϕ) length-r column vector of observation density values for HMM ϕ; b(yn;ϕ) =[b1(yn;ϕ), . . . , br(yn;ϕ)]′
B(yn;ϕ) r×r diagonal matrix of observation pdf values for HMM ϕ; B(yn;ϕ) = diag[b1(yn;ϕ), . . . , br(yn;ϕ)]
〈y1, . . . , yn〉 a length-n sequence of observations
pn(y1, . . . , yn;ϕ) n-dimensional likelihood of observation sequence 〈y1, . . . , yn〉 for HMM ϕ
1` length-` column vector of all ones
uni(ϕ) probability of state i at time n given all previous observations; uni(ϕ) = P (Xn =i|y1, . . . , yn−1)
xvii
un(ϕ) length-r column vector of prior state probabilities for HMM ϕ at time n; un(ϕ) =[un1(ϕ), . . . , unr(ϕ)]′
w(l)n (ϕ) length-r column vector of derivative of un(ϕ) with respect to parameter l of ϕ;
1 ≤ l ≤ p
wn(ϕ) r × p matrix of derivatives of un(ϕ) with respect to all model parameters
R1(yn;ϕ) part of the calculation of w(l)n+1(ϕ)
R(l)2 (yn;ϕ) part of the calculation of w
(l)n+1(ϕ)
`n(ϕ) log-likelihood of 〈y1, . . . , yn〉 for HMM ϕ; `n(ϕ) = 1n+1 log pn(y1, . . . , yn;ϕ)
Yn a collection of parameters; Yn = (Yn,un(ϕ),wn(ϕ)).
S(l)(Y ;ϕ) the derivative of the last update to the likelihood function with respect to ϕl
S(Yn;ϕ) length-p “incremental score vector”; the collected derivatives of the likelihoodfunction with respect to each parameter; S(Yn;ϕ) = [S(1)(Yn;ϕ), . . . , S(p)(Yn;ϕ)]′
(∂/∂ϕl)h partial derivative of function h(·) with respect to ϕl
εn learning rate parameter; εn → ∞;∑
n εn = ∞
ΠG Projection onto set G
G compact and convex set; subset of parameter space Φ; G ⊆ Φ
µj(ϕ) observation mean of a multi-dimensional Gaussian distribution for state j ofHMM ϕ
Σj(ϕ) observation covariance matrix of a multi-dimensional Gaussian distribution forstate j of HMM ϕ
Rj(ϕ) the upper triangular matrix of the Cholesky decomposition of Σj(ϕ); Σj(ϕ) =Rj(ϕ)′Rj(ϕ)
Pϕ∗ probability measure for ϕ∗
K(ϕ) Kullback-Leibler information of ϕ; K(ϕ) = −[`(ϕ) − `(ϕ∗)]
LML set of global minima of K(ϕ)
Mn projection term needed to get ϕn + εnSn(Yn;ϕn) back to constraint set G
ϕ first derivative of ϕ, when described as an ordinary differential equation (ODE)
H(ϕ) (∂/∂ϕ)K(ϕ)
m force term needed to keep ODE ϕ(·) ∈ G
LG the set of limit points of finite difference equation 2.24
Nη(A) an η neighborhood of A
xviii
Sn averaged version of update S.
ϕn averaged version of parameter set ϕn
k the maximum history size used for averaging Sn and ϕn
fn(ϕ) length-r vector of posterior probabilities of states at time n for HMM ϕ
fni(ϕ) probability that the state is i after n observations; fni(ϕ) = P (Xn = i|y1, . . . , yn)
CHAPTER 3
ϕl1 ,ϕl2 the two lower level HMMs in an HMM cascade model
ϕu the upper level HMM in an HMM cascade model
ϕ an HMM cascade model; ϕ = ϕl1 ,ϕl2 ,ϕu
yu,1n , yu,2
n the observations of ϕu, corresponding to states in ϕl1 and ϕl2 respectively
xlγ generic term referring to xl1 or xl2 , the states of ϕl1 and ϕl2
λ a switching HMM
s number of transition probability matrices in a switching HMM
Am(λ) the set of transition probability matrices in switching HMM λ; m = 1, . . . , s
qn an external signal which chooses the transition probability matrix to use at timen
ϕ a cascaded switching HMM; ϕ = ϕu,λl1 ,λl2
CHAPTER 4
ϕc the robot’s concept HMM
ϕa the robot’s auditory HMM
ϕv the robot’s visual HMM
ˆϕrobot the cascade HMM being learned by the robot (simulation); ˆϕrobot = ϕc, ϕa, ϕv
ϕc the boy’s concept HMM
λa the boy’s auditory switching HMM
ϕv the boy’s visual HMM
ϕboy the cascaded switching HMM used by the“boy”(simulation); ϕboy = ϕc,λa,ϕv
ϕvis the visual HMM producing real-world outputs (simulation)
yvisn the output of the visual model
xix
xvn estimate of the state of the boy’s visual HMM (ϕv)
xcn estimate of the state of the boy’s concept HMM (ϕc)
ycan generated output of the boy’s concept HMM (ϕc) corresponding to auditory
information
xan the boy’s auditory model state
yan, y
audn the boy’s auditory model output
xa, xv estimated state sequences for the robot’s auditory and visual models
xc estimated state sequence for the robot
APPENDIX B
s(n) speech signal
ak linear predictive coefficient
e(n) prediction error
E(n) squared prediction error
ki reflection coefficients (RCs) for a one-dimensional vocal tract tube model
Ai area of one segment of a one-dimensional vocal tract tube model
gi log-area ratios; gi = Ai+1
Ai
c(1)t , c
(2)t voicing confidence scores for the first and second half of a speech segment, re-
spectively.
ctotal initial voicing confidence score estimate
cf final voicing confidence score estimate
Vthresh threshold for determining strongly unvoiced speech segments
en log-energy of a speech segment
yij an image pixel at location (i, j)
xij lattice point at location (i, j); corresponds to yij
Y random variable representing an entire image; Y = yij
X random variable representing an entire lattice, corresponding to a segmentationof image Y ; X = xij
P (X,Y ) joint probability of X and Y
Z scale factor
xx
ψ(xij , xkl) within-lattice potential function
φ(xij , yij) lattice-image potential function
X∗ optimal segmentation of image Y
x∗ij optimal segmentation label of lattice point xij
n iteration number
mn(ij,kl)(xkl) message passed from xij to xkl at time n
α scaling constant
Γ(i, j) set of neighbors of (i, j)
γ scale factor
APPENDIX C
Xn, Yn discrete-time stochastic process defining a hidden Markov model (HMM)
Xn discrete-time first-order Markov chain
Yn observable stochastic process corresponding to Xn
xn the particular state value of Xn
yn the particular observation of Yn
〈y1, . . . , yn〉 a length-n sequence of observations
〈x1, . . . , xn〉 a length-n sequence of states
pn(y1, . . . , yn;ϕ) n-dimensional likelihood of observation sequence 〈y1, . . . , yn〉 for HMM ϕ
αni(ϕ) forward probability; the joint likelihood of 〈y1, . . . , yn〉 and Xn = i; αni(ϕ) =p(y1, . . . , yn, Xn = i;ϕ)
αn(ϕ) length-r column vector of forward probabilities for HMM ϕ at time n; αn(ϕ) =[αn1(ϕ), . . . , αnr(ϕ)]′
βni(ϕ) backward probability; given Xn = i, the conditional likelihood of 〈yn+1, . . . , yN 〉;βni(ϕ) = p(yn+1, . . . , yN |Xn = i;ϕ)
βn(ϕ) length-r column vector of backward probabilities for HMM ϕ at time n; βn(ϕ) =[βn1(ϕ), . . . , βnr(ϕ)]′
P for Baum-Welch parameter estimation, the n-dimensional likelihood of observa-tion sequence 〈y1, . . . , yn〉; P = pn(y1, . . . , yn;ϕ); for the Viterbi algorithm, the n-dimensional joint likelihood of 〈y1, . . . , yn〉 and 〈x1, . . . , xn〉;P = pn(y1, . . . , yn, x1, . . . , xn;ϕ)
xxi
γij the expected number of transitions from state i to state j for a given model andobservation sequence
γi the expected number of transitions out of state i for a given model and observationsequence
aij(ϕ) new estimate of transition probability aij(ϕ) in Baum-Welch or Viterbi reesti-mation
bjk(ϕ) new estimate of observation probability bjk(ϕ) in Baum-Welch or Viterbi reesti-mation
πi(ϕ) new estimate of initial probability πi(ϕ) in Baum-Welch or Viterbi reestimation
µj(ϕ) new estimate of observation mean µj(ϕ) in Baum-Welch or Viterbi reestimation
σj(ϕ) new estimate of observation standard deviation σj(ϕ) in Baum-Welch or Viterbireestimation
φni the maximum joint likelihood of 〈y1, . . . , yn〉 , 〈x1, . . . , xn−1〉, and Xn = i, calcu-lated recursively in the Viterbi algorithm
ψnj the most likely state at time n− 1 leading to state j at time n
s number of observations per state of an HMM with observations in a finite alphabet
V set of symbols in an HMM with observations from a finite-alphabet
vk observation symbol k of an HMM with observations from a finite-alphabet
bjk probability of observing symbol vk in state j of an HMM with observations froma finite-alphabet
APPENDIX D
Xn′ , Y n′ , T n′ discrete-time stochastic process defining a hidden semi-Markov model (HSMM)
Xn′ discrete-time first-order Markov chain
Y n′ observable stochastic process corresponding to Xn′
T n′ sequence of discrete state durations corresponding to Xn′
n′ model time counter
(Ω,F , P ) probability space
xn′ the particular state value of Xn′
yn′ the particular τn′ -length observation of Y n′
τn′ the duration of observation sequence yn′
r number of states in a Markov chain
xxii
R state space for Markov chain Xn′ ; R = 1, . . . , r
πi probability of an HSMM starting in state i, i ∈ R
π length-r vector of initial probabilities; π = πii∈R
Π set of all length-r stochastic vectors
aij P (Xn′ = j|Xn′−1 = i); probability of transitioning from state i to state j in anHSMM
A r × r transition probability matrix for an HSMM; A = aiji,j∈R
A set of all r × r stochastic matrices
d(·;λj), dj(·) parametric duration density for state j of an HSMM
λj parameters of a density function describing the durations of state j of an HSMM
Λ set of valid parameters for a family of duration densities
νj, ηj parameters of a gamma function describing the durations of state j of an HSMM
(dj1 . . . , djT ) discrete probability distribution describing the durations of state j of an HSMM
b(·|τ ; θj), bj(·|τ ) conditional observation density for state j of an HSMM
θj parameters of a density function describing the observations of state j
Θ set of valid parameters for a family of observation densities
Φ HSMM parameter space; Φ = Π ×A× Λ × Θ or Φ = A× Λ × Θ
ϕ vector of model parameters for an HSMM; ϕ ∈ Φ (e.g., ϕ = a11, a12, . . . , arr, λ1, . . . , λr, θ1, . . . , θr)
ϕ estimate of model parameters for an HSMM
ϕ∗ true model parameters for an HSMM
p length of vector ϕ
ϕl the lth parameter of parameter vector ϕ; 1 ≤ l ≤ p
π(ϕ) initial probability vector for HSMM ϕ
A(ϕ) transition probability matrix for HSMM ϕ
aij(ϕ) i,jth element of A(ϕ)
λj(ϕ) duration density parameter(s) for state j of HSMM ϕ
θj(ϕ) observation density parameter(s) for state j of HSMM ϕ
b(yn′ |τn′ ;ϕ) length-r column vector of observation density values for HSMM ϕ; b(yn′ |τn′ ;ϕ) =[b1(yn′ |τn′ ;ϕ), . . . , br(yn′ |τn′ ;ϕ)]′
xxiii
B(yn′ |τn′ ;ϕ) r × r diagonal matrix of observation pdf values for HSMM ϕ; B(yn′ |τn′ ;ϕ) =diag[b1(yn′ |τn′ ;ϕ), . . . , br(yn′ |τn′ ;ϕ)]
d(τn′ ;ϕ) length-r column vector of duration density values for HSMM ϕ; d(τ n′ ;ϕ) =[d1(τn′ ;ϕ), . . . , dr(τn′ ;ϕ)]′
D(τn′ ;ϕ) r × r diagonal matrix of duration density values for HSMM ϕ; D(τ n′ ;ϕ) =diag[d1(τn′ ;ϕ), . . . , dr(τn′ ;ϕ)]
g(yn′ , τn′ ;ϕ) length-r column vector, product of observation and duration densities for HSMMϕ; g(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)1r
G(yn′ , τn′ ;ϕ) r × r diagonal matrix, product of observation and duration densities for HSMMϕ; G(yn′ |τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)
n normal time counter
t0τn′(k′) function defining the normal-time beginning of the k ′th state for duration sequence
τn′
t1τn′(k′) function defining the normal-time end of the k ′th state for duration sequence τn′
ξτn′(n) function defining the model time corresponding to real time n
Xn normal-time state process of an HSMM; Xn = Xξτn′
(n)
Yn normal-time observable process of an HSMM; Y n′ =⟨Yt0(n′), . . . , Yt1(n′)
⟩
〈y1, . . . , yn〉 a length-n sequence of normal-time observations
〈y1, . . . , yn′〉 a length-n′ sequence of model-time observations; y1 =⟨y1, . . . , yt1(1)
⟩, y2 =
⟨yt0(2), . . . , yt1(2)
⟩, . . . , yn′ =
⟨yt0(n′), . . . , yn
⟩
pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) n′-dimensional likelihood of observation sequence 〈y1, . . . , yn′〉 forHSMM ϕ
1` length-` column vector of all ones
un′j(ϕ) probability of state j at model time n′ given all observations through yn′−1;unj(ϕ) = P (Xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n
′)
un′(ϕ) length-r column vector of prior state probabilities for HSMM ϕ at model timen’; un′(ϕ) = [un′1(ϕ), . . . , un′r(ϕ)]′
unj(ϕ) probability of state j at normal time n given all previous observations, and giventhat we just changed states; unj(ϕ) = P (Xn = j|y1, . . . , yn−1, τ1, . . . , τn′−1, n
′, ξ(n−1) = n′ − 1, ξ(n) = n′)
un(ϕ) length-r column vector of prior state probabilities for HSMM ϕ at normal timen; un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′
`n′(τn′,ϕ) normalized log-likelihood of model-time observations 〈y1, . . . , yn′〉 for HSMM ϕ
`n(ϕ) log-likelihood of normal-time observations 〈y1, . . . , yn〉 for HSMM ϕ
xxiv
n′∗n the number of segments which maximizes `n(ϕ)
Tn process describing the most likely sequence of durations
τ∗n the length of the last segment of yn′ which maximizes `n(ϕ)
w(l)n (ϕ) length-r column vector of derivative of un(ϕ) with respect to parameter l of
HSMM ϕ; 1 ≤ l ≤ p
wn(ϕ) r × p matrix of derivatives of un(ϕ) with respect to all model parameters
R1(yn′ , τ ;ϕ) part of the calculation of w(l)n+1(ϕ)
R(l)2 (yn′ , τ ;ϕ) part of the calculation of w
(l)n+1(ϕ)
Yn a collection of parameters; Yn = (Yn, Tn,un(ϕ),wn(ϕ)).
S(l)(Y ;ϕ) the derivative of the last update to the likelihood function with respect to ϕl
S(Yn;ϕ) length-p “incremental score vector”; the collected derivatives of the likelihoodfunction with respect to each parameter; S(Yn;ϕ) = [S(1)(Yn;ϕ), . . . , S(p)(Yn;ϕ)]′
(∂/∂ϕl)h partial derivative of function h(·) with respect to ϕl
εn learning rate parameter; εn → ∞;∑
n εn = ∞
ΠG Projection onto set G
G compact and convex set; subset of parameter space Φ; G ⊆ Φ
xxv
CHAPTER 1
INTRODUCTION
1.1 Background and Motivation
Cognitive development has been studied in various environments—on the playground by the psy-
chologist, under the microscope by the neuroscientist, and in the armchair by the philosopher. Our
study occurs in a robotics lab, where we attempt to embody cognitive models in steel and silicon.
How did we choose this particular habitat? First and foremost, we are scientists and engineers,
which immediately suggests forming theories and building things to test them. The particular
question we are examining is one of the most fascinating questions that has been asked in the last
century: Can machines think?
Alan Turing raised this very question back in 1950. He introduced the idea of a machine
engaging in “pure thought” and communicating to the world via teletype writer. As an answer to
The Question, he suggested that when the machine’s discourse (via teletype) was indistinguishable
from a human’s, we could say that the machine was thinking. He goes on at the end of the paper to
suggest that initially, machines could perhaps learn to compete with men at some purely intellectual
task, such as chess, but then, he suddenly presents an alternative approach for creating machine
intelligence:
It can also be maintained that it is best to provide the machine with the best sense
organs that money can buy, and then teach it to understand and speak English. This
process could follow the normal teaching of a child. Things would be pointed out and
named, etc. [1, p. 76]
1
Most artificial intelligence research has followed the former proposal. We believe the latter method
holds more promise.
As scientists, we start with a hypothesis. Our hypothesis forms a constructive theory of mind,
and can be summarized as follows. We believe that human intelligence, and hence language, is
primarily semantic. We believe that the mind forms semantic concepts through the association of
events close together in time, or events or cues close together in space, or both. We further believe
that an integrated sensory-motor system is necessary to ground these concepts and allow the mind
to form a semantic representation of reality—there is no such thing as a disembodied mind.
To test our hypothesis, we are developing a robotic platform, complete with basic sensory-
motor and computing capabilities. The sensory-motor components are functionally equivalent to
their human or animal counterparts, and include binaural hearing, stereo vision, tactile sense,
and basic proprioceptive control. On top of these components, our group is implementing various
processing and learning models, with the intention of creating and aiding semantic understanding.
Our goal is to produce a robot that will learn to understand and carry out simple tasks in response
to natural language requests.
At this point in time, we have already developed a robust base system and conducted a number
of experiments on the way to our goal of a language-learning robot. In particular, we have developed
the basic hardware and software framework necessary for our work, have run numerous experiments
to study ideas in learning, memory and behavior. My primary contribution, and the main focus of
this dissertation, is an associative semantic memory based on hidden Markov models (HMMs) and
built as part of the robot’s cognitive system.
In the following introductory sections, we will discuss previous work in the field of developmental
robotics (which is the subfield of robotics to which our work belongs), give an overview of our
project, and then describe the semantic learning ideas used as a basis for the research described in
the bulk of this dissertation. Note that previous related work in stochastic modeling is described
in Section 3.2, and previous work in language grounding and associative language learning appears
in Section 4.2.
2
1.2 Developmental Robotics
From Turing’s 1950 paper until the mid-1980s, the field of artificial intelligence (AI) was dominated
by research on what Turing referred to as“purely intellectual” tasks. Despite the agreement that the
long-term goal of AI and robotics was to design physical systems exhibiting intelligent behavior, AI
research had until that time focused mostly on isolated topics: representation, search algorithms,
planning, etc. [2]. Turing’s suggestion to provide machines “with the best sense organs that money
can buy” was largely forgotten.
About 20 years ago, some AI researchers were beginning to feel that there were some fundamen-
tal problems with this good old-fashioned AI (GOFAI). Rodney Brooks was one of the first people to
articulate this point. In 1986, he argued that the most important aspects of intelligence were being
ignored by the AI community. Specifically, he suggested that much more focus needed to be given
to interaction with the environment, rather than just focusing on representation, that mobility,
vision, and survival behavior “provide a necessary basis for the development of intelligence” [3, p.
2].
Since this time, a number of researchers have built on and expanded on these basic ideas, and
the subdiscipline of developmental, or epigenetic, robotics have emerged. Developmental robotics
focuses on the use of robots to study cognitive development, and draws people from a wide variety
of backgrounds, including developmental psychology, neuroscience, biology, and robotics.
There are many common themes among research in this area. The following ideas were selec-
tively compiled from [3–9]:
1. Biological and cognitive systems consist of a large number of simple, integrated modules—
these systems are not monolithic. Complex behavior can emerge from this integrated system.
2. Biological and cognitive systems develop incrementally, both through evolution and through
learning.
3. Any form of cognition requires embodiment. Any representations that exist in the brain are
fundamentally based in the world and have no meaning outside of this context. An integrated
sensory-motor system is thus necessary for cognition.
3
4. Higher cognitive development depends on social interaction, including mimicry/imitation and
shared attention.
Most research in developmental robotics incorporates multiple ideas from this list. We highlight a
few projects below.
After some time focusing on insect robots and refining his initial ideas, Brooks group used what
they learned to change direction toward studying humanlike intelligence. In the early 1990s, they
built Cog [4, 10], an upper-torso humanoid robot, with the goal of studying issues in embodiment,
integration of multiple sensory and motor systems, and social interaction. A second robot, Kismet
[11, 12], was also built to study social interactions between robots and humans. In general, their
research has been based on ideas from psychology, with an initial focus on creating robust modules
for low-level cognitive functions, then progressing to study the relationship between low-level and
high-level cognitive functions and social interaction. Other projects focusing on social interaction
include [13, 14].
John Weng’s group at Michigan State have been working on two mobile humanoid robot
projects, SAIL and DAV [7, 8, 15, 16]. The focus of these projects is a developmental learning
model based on human cognitive and behavioral development, and with focus on sensory integra-
tion and high level cognitive function. Their research has mainly drawn from ideas and research in
developmental psychology.
At the other end of the spectrum, various researchers [6, 17–19] are concerned with studying
brain activity through the development of machines built on neurobiological principles. Specifically,
their approach is to develop low-level neurological models of the brain, and put them in simple
animal-like robots with the ability to sense and interact with the world. With these experiments,
they study the emergent behavior the models allow the robot to produce, as well as how closely
the response of the models match responses from neurological research.
Various other researchers [2,20–24] study aspects of cognition using robotics; see [9] for a recent
survey. One key aspect of our project is our focus on language learning and interaction as a basis
for higher level learning. We describe our project in the next section.
4
SensorySystem
MotorSystem
OutsideWorld
NoeticSystem Somatic
System
Feedback
Proprioceptive
Figure 1.1: Cognitive Cycle. This figure shows the flow of cognition among the senses, the noeticsystem, the motor system, and the environment. The noetic system refers physically to the brainand nervous system, which are assumed to be responsible for mental processes.
1.3 A Robotic System for Language Acquisition
As with other researchers in developmental robotics, our group is using robots to study cognition.
The description of our research begins with the cognitive cycle depicted in Figure 1.1. This simple
diagram shows the flow of cognition among three systems (a sensory system, a noetic system,
a motor system) and the environment. The fact that this diagram equally emphasizes these four
components is significant, as we feel that grounding and interaction with the world are requirements
for cognition. We describe the components of the cycle in more detail below, with discussion on
how they relate to human cognition and implementation of functional equivalents.
1.3.1 Somatic system
The somatic system is the “body” component of the mind-body system. It is composed of the
physical components necessary for cognition: the senses, muscular (motor) system, nervous system,
and the brain.
5
1.3.1.1 The senses
The necessary start of the cognition is the gathering of information from the environment through
the sensory inputs. In humans, these inputs include the five senses—tactile (touch), gustatory
(taste), olfactory (smell), auditory (hearing), and visual (sight). We also perceive information
about ourselves, through proprioception (sense of body position and movement) and interoception
(internal sensory perception of such things as hunger and body temperature). From these we draw
all of our experience, and while we can learn and adapt without one or more of them, sensory
perception is a prerequisite to our cognitive abilities.
1.3.1.2 Muscular/motor system
Our senses provide us with information from the environment, but the ability to perceive the
environment is only half of the connection with the world necessary for cognition. Humans and
other animals also have the ability to move around, interact with and affect our environment. We
can identify two classes of human movement that we wish to emulate:
1. Full body movement in the environment
2. Actuated and articulated movement of body parts (e.g., movement of arms and head, speech)
To do the most humanlike cognitive studies, we would like to work with a robot which is as
anthropomorphic as possible.
1.3.1.3 Brain and nervous system
The last fundamental components of the somatic system necessary for modeling cognition are the
brain and nervous system. For the study of cognition, we obviously need to emulate the functions
of these as well. We need a way to connect the sensory-motor periphery to the brain, and we, of
course, need to model certain functional aspects of the brain. The functionality of the brain which
we wish to model is described in more detail in Section 1.3.2.
6
Figure 1.2: Our robot Illy. Illy is one of three Arrick Robotics Trilobots we use for our cognitionand language acquisition research. The base unit for the robots was heavily augmented to includestereo cameras and microphones, an on-board computer, and wireless ethernet.
1.3.1.4 Implementation
Sensory perception and motor expression are the essential connections of the mind to the outside
world, and require a body. The body we chose to work with is Arrick Robotics’ Trilobot [25] (see
Figure 1.2). The robot’s anthropomorphic capabilities are rich enough to suit our purposes. In
particular, the robot can move freely via wheels, can move its head, and use its arm to manipulate
common objects, allowing relatively complex behaviors. A speaker is available on-board for the
production of sounds and, with additional processing, speech.
For embodied cognition, we desire our robot to have as many of the previously mentioned senses
as possible. For our robot’s eyes and ears, we have added cameras and microphones to the robot
to give it stereo vision and hearing capabilities. The sounds and images that humans receive are
of course processed by our brain, but even before that, the ear and eye do significant processing
on their inputs. It is well known, for example, that the human ear acts as a spectral filter (see
e.g., [26]), and that a large amount of feature extraction occurs in the retina before the signal
even leaves the eye (see e.g., [27]). Since the cameras and microphones mounted on our robots
7
do not handle this processing, we have implemented, in software, some basic audio and visual
processing and feature extractors to mimic aspects of these systems. For visual inputs, we use
mostly standard image processing and computer vision techniques. See [28–30] for details. For
auditory inputs, in addition to standard spectral filtering, D. Li has developed some important
processing techniques useful for anthropomorphic behavior. These include binaural sound source
localization and sound characterization. A robust sound source localization algorithm based on his
work is currently implemented on the robot, and is a key component of our work. Details can be
found in [31] and [32].
Equivalents for other senses are slightly more difficult to incorporate. Touch sensors, while not
nearly as versatile as skin, do allow for limited input of tactile sensations, and the Trilobot has a
number of touch and other sensors available. Some aspects of proprioception are implemented in
software and by using feedback sensors on some of the actuators located on the robot. Olfactory and
gustatory sensors are more difficult to include, and we chose to ignore these senses for now. However,
research has progressed in the development of artificial skin [33], noses [34], and tongues [35].
Sometime in the not too distant future, researchers will be able to use these organs to allow a robot
to perceive an even richer set of sensory inputs. For now, we have chosen to focus on the senses of
sight, sound, and touch, with minimal simulation of the others (e.g., proprioception) as needed.
Analogous to the brain and nervous system in humans and higher animals, our robot needs a
computational brain and a way to deliver information from its various sensors to this brain. On the
hardware level, we have incorporated a computer on board our robot which collects input from the
cameras, microphones, and sensors, and sends control commands to the robot. The computer can
also handle limited processing of the data, but a wireless transmitter is available to transmit the
data to other workstations, where most processing occurs. This distributed system of computers
houses the “brain” of our robot. To facilitate the communications necessary for this system, we
did extensive design and coding of a distributed communications and processing framework early
in this research. Details of this work appear in Appendix B.2. Hardware and system-level software
specifications can be found in Appendix A.
8
1.3.2 Noetic system
The noetic system in Figure 1.1 represents the “mind” aspect of the mind-body paradigm, which
we expand in further detail in Figure 1.3. Here we are not as interested in emulating the physiology
and low-level connectivity of the brain, except at the grossest levels; e.g., we would like our robots
to exhibit aspects of self-organization and emergent behavior. Our goal, though, is implementing
functional equivalents for high-level cognitive functions. In this section we review some of the
fundamentals of memory, learning, and behavior, and describe how these fundamentals are reflected
in our research. We would like to note that, even though we divide the various aspects of cognition
into these three areas, they are all interdependent; none of these cognitive components could exist
without the other two.
1.3.2.1 Memory
Memory is the most important function of the brain; without it life would be a blank.
Our knowledge is all based on memory. Every thought, every action, our very conception
of personal identity, is based on memory.... Without memory all experience would be
useless. (Edridge-Green, 1900) [36, p. 188]
Browsing through any recent psychology textbook, one can discover a plethora of views and theories
concerning the organization of the human memory system [37]. Some of these are complementary,
others are overlapping, but most simply look at memory from a different perspective. While all
of these views are constructive, here we choose and briefly describe one of the most fundamental
classifications of memory.
William James is generally regarded as the first person to suggest that memory is divided into
primary and secondary systems [38]. This idea later evolved into the concepts of short-term memory
(or working memory), and long-term memory, which we refer to as associative memory. These two
systems are presented as primary components of the noetic system in Figure 1.3.
Short-term memory refers to the immediate thoughts going through our head, whether obtained
from our senses or by manipulation of thoughts or knowledge retrieved from our long-term memory.
The term working memory came about after more research, and refines the idea of short-term
9
SensorySystem
MotorSystem
OutsideWorld
NoeticSystem
SomaticSystem
SemanticMemory
EpisodicMemory
ProceduralMemory
MemoryAssociative
WorkingMemory Noetic
System
MemoryAssociative
Feedback
Proprioceptive
Figure 1.3: Expanded view of the cognitive cycle. This expanded view shows the breakdown andrelationship among various components of the noetic system.
10
memory as a system consisting of a central executive and a number of subsidiary systems, including
at least visual and phonological subsystems [37, 38].
Conceptualization of long-term memory has also been considerably refined since James’ time.
One of the most common models, attributed to Endul Tulving [39], divides long-term memory into
procedural, semantic, and episodic memory, as shown at the bottom of Figure 1.3. Procedural
memory is concerned with our knowledge of how to do things, e.g., how to walk or drive a car.
Semantic memory concerns meaning and our general knowledge about the world. This includes,
for example, meanings of words and knowledge of where we live. Episodic memories are memories
of specific events that have occurred in the past, or alternatively, events that we anticipate in the
future.
1.3.2.2 Learning
Learning can be described as a transition from one mental state to another where information is
gained [40]. In this section, we will highlight what we feel are some essential aspects of learning
that we need to incorporate into our research.
Associative learning. If, as proposed earlier, an associative memory is the central component
of memory, the corollary states that associative learning is the primary mechanism of learning.
According to Shanks [40], in associative learning, “the environment provides a relationship among
contingent events, allowing [a] person to predict one [event] in the presence of others” (p. 2).
Possible events include both environmental cues and the subject’s own behavior. The relationship
between or among events can be causal or structural. In causal relationships, one event occurs,
followed by another, perhaps after a brief time interval. For example, there is a consistent causal
relationship between touching a hot burner and feeling pain. Structural relationships relate features
or properties of an object or event with other features which frequently co-occur. For example, after
both seeing and smelling a fire, the presence of one of these events generally indicates the presence
of the other. A less obvious example of a structural relationship is the association of a word with
a particular object or event, a key focus of our research.
11
Reinforcement learning. Reinforcement learning is one aspect of associative learning. It can
refer to a couple of distinct but related concepts, depending on the type of relationship being
learned:
1. knowledge gained through repeated stimulation of co-occurring cues from the environment;
or
2. behavior learned through the repeated association of an action and a reward or punishment
[41] (i.e., behaviorism).
Here, we will briefly describe the first version of reinforcement learning. Formally, a subject is
connected to its environment through perception and action. Through its senses, it perceives some
indication i of the state of the environment. The subject then produces some action a which has
an effect on the environment. This effect is evaluated through a reinforcement signal r (the reward
or punishment). The reinforcement signal may be internally or externally generated, but in either
case is a function of input i. In general, the subject’s goal is to choose actions which in some way
maximize the long-run sum of r.
A simple example of reinforcement learning occurs in the training of animals. An interesting
example comes from a recently published New York Times article, which describes how Gambian
giant pouched rats are being trained to find land mines [42]. Finding a mine earns each rat a snap of
a clicker and a snack of peanuts or banana. At times, the rats try to game the system by randomly
scratching the earth in the hopes of getting free treats, but they are rewarded with food only for
actual finds. Of course, reinforcement learning examples do not have to be so esoteric, nor are they
necessarily limited to other animals. Almost any activity humans attempt can involve evaluation
which causes a modification of future behavior.
1.3.2.3 Behavior
If memory contains our knowledge about the world, and learning modifies that knowledge, behavior
puts that knowledge into use. Behavior is, of course, intimately linked to the reinforcement learning
mechanism described above. Some human or animal behaviors would be difficult to emulate (e.g.,
12
procreation), but there are specific behaviors and aspects of behaviors which we would like to model.
A few are listed below.
Curiosity and exploration. Humans are curious creatures. Gopnik et al. [43] suggest that
infants and children are wired to explore, experiment, and learn about the world. Garvey [44]
also states that the cognitive abilities learned in the first two years “are developed by acting on
and interacting with ... things and people.... [T]hese developments also reflect the beginnings
of symbolic representation, a prerequisite to the development of language and abstract thinking”
(p. 41). We feel that exploration is necessary to obtain as much information as possible about our
environment, and is instrumental in our cognitive development.
Language understanding and acquisition. Since the focus of our research is language acquisi-
tion, some of the behaviors we hope to emulate are directly related to language and communication.
We have already mentioned Garvey’s comments above concerning exploration and language devel-
opment. More direct examples of linguistic behaviors are available. For example, dogs can be
taught to retrieve named objects [45], and children begin to understand and say object names at a
young age. Both of these behaviors are essential targets for our research.
Imitation. Children learn extensively through imitation of both speech and action [43]. One
benefit of imitation is that it gives an example specific behavior and desired outcome, which can
be used for evaluation in a reinforcement learning paradigm. Learning through imitation has also
been proposed as an efficient and perhaps necessary mechanism for learning in robots [13, 46–48].
1.3.2.4 Implementation
The noetic system in our robot should be able to express the aspects of memory, learning, and
behavior outlined above. Among other things, the robot needs to:
1. look around, navigate, and perform actions (procedural memory, using reinforcement learning
and imitation);
13
2. learn about and understand its environment (semantic memory, with associative learning);
and
3. make decisions using what it knows and currently senses (working memory and a central
decision maker, interacting with long-term memory).
The ability to remember specific past events or sequences of events, and the ability to predict or
even desire future events (episodic memory), are also essential for our study of language learning,
the principal long term goal of our work.
In our group’s research, we have studied various incarnations of these ideas. Our work includes
research in many of the topics just discussed, including (1) navigation and interaction via rein-
forcement training, (2) autonomous exploration, (3) speech imitation, and (4) concept learning via
association. We highlight this work below.
Environment navigation and interaction via reinforcement learning. Just as a child
must learn to move and interact with the world, our robot needs to learn to move around and
interact with its environment. To this end, members of our group have developed and implemented
reinforcement learning algorithms which allow the robot to learn navigation. In Lin’s work [30],
the robot learns to visually navigate a maze using Q-learning, a reinforcement learning algorithm.
Zhu and Levinson [49] developed an improved Q-learning algorithm called propagated Q-learning,
or PQ-learning, and used this method on the robot to learn general navigation toward a goal,
including obstacle avoidance.
Autonomous exploration. As mentioned above, children have a natural curiosity about the
world, and set out and explore as soon as they are able. McClain [50] identifies three general
instincts necessary for exploration:
1. The motivation and ability to search for and identify new objects
2. The motivation and ability to interact with objects
3. A survival instinct
14
Starting with these built-in behaviors, the robot explores its environment looking for objects. It is
particularly interested in objects that it has not seen before. Each time it discovers a new object,
it will approach the object and play with it, first attempting to pick it up and then attempting to
knock it over. The robot will also turn toward loud sounds, under the assumption that it will find
an object of interest in that direction. This work also demonstrates the ability of the robot to run
autonomously for long periods of time in a robust manner.
Speech imitation. From a young age, children learn to speak by mimicking those around them.
We plan to use speech imitation as a vehicle for the robot to learn to speak. Kleffner [48] has
developed a robust method for speech imitation, involving extracting phonetic and phonemic fea-
tures from the sound stream which give an internal representation correlating to the vocal tract
shape, while taking into account the resolution of the human ear. The features that are extracted
can be reused for speech synthesis or combined with features from other modalities for recognition
and learning. Experiments with the robot, described in Chapter 4, use these features for speech
recognition.
Semantic concept learning via association. As noted earlier, one aspect of learning funda-
mental to our work is the idea that learning and recall occur mostly as the association of sensory
input data. The main focus of the rest of this dissertation is the development and use of a cascade
of HMMs for associative learning of semantics. The topic, as it pertains to this dissertation, is
introduced and discussed in more detail in the next section.
In addition to the research described herein, two others in our group have addressed this research
question. Liu [51] developed a system whereby a benevolent teacher would push on a touch sensor
on the robot while speaking a movement command. For example, the teacher might push on a
sensor on the back of the robot and say “forward.” A touch on the rear sensor would “push” the
robot forward (its wheel would straighten and its motor would start running). After a training
period, the robot could, on a speaker dependent basis, be controlled by voice. In this work the
robot shows the beginnings of a conceptual understanding of commands and directions through
voice and tactile sensors.
15
Zhu and Levinson [52] also conducted some experiments on scene concept learning. In their
work, they proposed a joint probability density function (JPDF) representation for learning such
visual concepts as color, shape, and object name. Zhu and Levinson’s model was able to successfully
learn labels for 6 color concepts, 3 shape concepts, and 13 object concepts drawn from 15 natural
objects.
In the next section, we give more details and expand on the basic ideas of semantic learning.
1.4 Semantic Learning
1.4.1 Introduction
Let us restate our basic assumptions: first, that language is primarily semantic (that is, it is
concerned mostly with our knowledge of the world); second, that this understanding is gained by
recognizing and learning relationships between or among events and cues in the environment; and
third, that this learning requires the learner to be embodied and situated in the environment. In this
section, we will develop a basic model for learning semantic associations from environmental cues.
We note that our focus is on semantic knowledge gained primarily through repeated stimulation
from the environment, and so, for now, we are ignoring one-shot or fast-map learning [45, 53–55].
1.4.2 General associative memory model for semantic learning
Semantics is meaning. It is our knowledge of the world and how it works. Through evolution and in
our early development, we first learn to understand the world by associating sensory-motor events
and cues. Some examples pointed out in Section 1.3.2.2 include learning what happens when one
touches a hot burner, learning to associate the sight and smell of fire, learning to associate a word
with an event or some other co-occurring cue, or some combination of these.
If we refer to learning simply as association, this has a high degree of agreement with behaviorist
theories, particularly with regard to learning the relationship between cues or events and one’s own
actions. When talking about animal learning, behaviorism is often the best explanation, and it
can describe much of human behavior as well. How do human and animal behaviors differ then?
One important difference is that humans can communicate meaning linguistically, using symbols
16
* crunch *"Apple"
Other knowledge:facts, stories,
experiences, etc.
AppleConcept of
Figure 1.4: The concept of apple. The apple concept is associated with the different ways we senseapples, as well as with other related knowledge.
representing concepts.1 The question becomes, can we mimic this behavior? That is, can we build
a system that can learn meaning in a behaviorist manner (i.e., via association) and, in addition,
that can create symbols that can be manipulated and communicated? We think so.
According to Laurence and Margolis [57], “concepts are the most fundamental constructs in
theories of mind” (p. 1) While there is some debate about the definition of concepts, or even
whether they exist [57], a concept is generally defined in terms of the features that are associated
with it, as well as the rules that relate these features [58, p. 409]. Figure 1.4 shows an example,
where the concept of “apple” is associated with the smell, taste, sight, sounds, and feel of an apple,
as well as other related knowledge.
One feature to note about Figure 1.4 is the fact that the concept is represented as a discrete unit.
It does not simply exist as a set of weights connecting two sensory modalities. This formulation
differs from that of many of the models often used to associate different information streams,
where associative relationships are related directly (e.g., Hopfield networks and related work [59,
60], some instantiations of Bayesian Networks [61], and some HMM formulations tying together
multiple sensory modalities, such as fused [62,63] or coupled HMMs [64,65]). Why is this difference
important? Because it allows the concept to be manipulated as a symbol.
Figure 1.5 gives a more abstract illustration of concept connections. Taking the models one
at a time, the visual model independently learns visual concepts of the different objects or other
1As an aside, chimpanzees, dogs, bees, and some other animals may be able to communicate or understand symbolsto a limited extent. See, e.g., [45, 56].
17
ConceptModel
VisualModel Model
Auditory
SensoryInputs
SemanticMemory
to Working Memory
Figure 1.5: Visual/auditory concept hierarchy. This figure shows how representations from a generalauditory and visual model of the world are combined to create a conceptual model of the world.
distinguishable sights in its environment. These concepts could include such things as colors, shapes,
textures, or types of motion, although each of these may be put into a separate model. The audio
model learns concepts from audio cues, including speech. At the lowest level, this might include
environmental sounds and phonemes. The concept model learns frequently co-occurring states or
classifications of the lower models. Learning in all models is unsupervised, although depending
on the model and learning method chosen, models may be initialized with a bias to learn better
or faster or both. Although we do not yet do this, it should also be also possible to incorporate
feedback from other models, as well as positive or negative feedback from the environment for
reinforcement type learning. The model can, of course, scale up to include more types of sensory
models.
One necessary condition for effective communication is that the two people communicating
(or in our case, a person and a robot communicating) share a similar set of concepts. Thus, the
learning of concepts can be described as an attempt to learn a model of another person’s knowledge.
Figure 1.6 shows this idea graphically. This figure shows an interaction between two subjects, a
person and a robot, each with his own cognitive model of the world. The immediate goal of the
robot is to learn the cognitive model the person is using to understand the immediate environment.
Just learning concepts may be interesting and useful by itself, but as hinted by Figure 1.6, we
do envision this model as simply one part of a more complex model, designed around the cognitive
cycle described by Figure 1.1. The model as presented is very general, so any number of models
could be plugged into the clouds in the figures. For reasons highlighted in Chapter 2, we have
18
ConceptModel
VisualModel Model
Auditory VisualModel
ConceptModel
ModelAuditory
"Apple"
Figure 1.6: Associative learning of the word “apple.” By hearing the boy’s word in response to ashared visual stimulus, the robot can attempt to learn a model of the world compatible with theboy’s model.
chosen to use HMMs for the individual components of the hierarchy. This realization of the model
is presented in Chapter 3.
1.5 Contributions and Layout of Dissertation
Our group is attempting a complex and ambitious project, that of creating the body and mind of
an intelligent robot. It is necessary to stress the collaborative aspect of this project, which has
been quite rewarding, and without which progress would be extremely slow and limited. Within
this collaboration, I have made a significant contribution in three main areas. First, I was heavily
involved with the initial design and development of two of the robotic platforms used by the group.
Second, I was lead designer and developer of a robust system for transparently connecting the
various computing modules. My third area of contribution is the development of an HMM cascade
architecture for concept learning, described in detail in the following chapters. With regard to my
work involving HMMs, my contribution includes
1. noting an extension of the analysis of the recursive maximum-likelihood estimation (RMLE)
algorithm presented by Krishnamurthy and Yin [66] to finite-alphabet HMMs (their analysis
applies specifically to observations with continuous densities) (Section 2.3);
19
2. giving experimental results for various modifications of the RMLE algorithm (Section 2.3.4);
3. proposal and analysis of an HMM cascade architecture for learning associations among mul-
tiple observation streams, including arguments extending RMLE convergence analysis to our
proposed cascade architecture and experimental evaluation (Chapter 3);
4. implementation and use of the above-mentioned cascade model for learning semantic concepts
on our robot (Chapter 4); and
5. derivation of a version of RMLE for hidden semi-Markov models (HSMMs) (Appendix D).
In the previous sections of this introduction, we described the somatic and cognitive framework we
use, and we would like to note how the work of this dissertation fits into the above framework.
Within the somatic system, we are using the existing hardware and software framework. In
particular, our work runs on base platform described in Section 1.3.1.4, using the system of cameras,
microphones, and touch sensors described therein. For visual processing we use feature extraction
developed by R. S. Lin, described in Appendix B. We also use the sound source localization scheme
developed by D. Li, and audio feature extraction developed by M. Kleffner, both mentioned above.
Kleffner’s work is directly relevant to our work, and is therefore described in Appendix B. All of
these components are connected by the distributed communications framework developed mostly
by myself, described in the same Appendix in Section B.2.
For cognitive modeling, our work focuses on semantic concept learning using stochastic models,
similar to, but improving upon, the work by Q. Liu and W. Zhu described in Section 1.3.2.4. Our
work is built on top of autonomous exploration work by M. McClain, described in the same section.
The rest of this dissertation is organized as follows. Chapter 2 describes HMMs, and introduces
the RMLE algorithm for learning model parameters. HMMs and the RMLE algorithm are key
components of our composite HMM-based associative memory. Chapter 3 describes the theory and
gives simulation results for this associative memory, and Chapter 4 describes the experiments we
have run on our robot using this model. In Chapter 5, we summarize and discuss the significance
of our work. The appendices contain a wealth of additional information, including details of the
forementioned robotic hardware (Appendix A) and software (Appendix B), discussion of standard
algorithms used with HMMs (Appendix C), definition and derivation of the RMLE algorithm for
20
HSMMs (Appendix D), some additional RMLE derivations (Appendix E), and some matrix calculus
used in some of our derivations (Appendix F).
21
CHAPTER 2
HIDDEN MARKOV MODELS AND
THE RMLE ALGORITHM
2.1 Introduction
In Section 1.4, we described a hierarchical structure for modeling concepts. The structure is generic
enough that a variety of models could be used throughout the structure, even in a heterogeneous
manner. Our work focuses on the use of HMMs in this hierarchy.
An HMM is a discrete-time stochastic process with two components, Xn, Yn, where (i) Xn
is a finite-state Markov chain, and (ii) given Xn, Yn is a sequence of conditionally independent
random variables. The conditional distribution of Yk depends on Xn only through Xk. The name
hidden Markov model arises from the assumption that Xn is not observable, and so its statistics
can only be ascertained from Yn.
HMMs have many interesting features that we believe can be easily exploited for concept learn-
ing. As noted previously, concepts are formed from the correlation in time among events. HMMs
by construction have a notion of sequence, and have proven quite effective at learning time series
and spatial models in such areas as speech processing [67] and computational biology [68–70]. This
characteristic of HMMs provides a useful starting point for learning time correlation.
Another property of HMMs useful for learning concepts is their ability to discover structure in
input data. Cave and Neuwirth [71] demonstrated this capability by training a low-order ergodic
HMM on text. They found that the states of the model represented broad categories of letters,
discovering some of the underlying structure of the text. Poritz [72] developed a similar model for
22
speech data, and Ljolje and Levinson [73] created a speech recognizer based on this type of model.
Our hierarchical model exploits this natural capability of HMMs to discover structure in order to
learn higher level concepts.
Finally, in addition to their familiar role as recognizers, HMMs can be used in a generative
capacity. In particular, when placed in a hierarchy, we can drive the various HMMs to produce
sequences of states and corresponding output, roughly simulating thoughts and actions.
Some characteristics of HMMs are not as useful for our work, however. Two of the most common
methods used for HMM parameter estimation, the Baum-Welch method and methods based on the
Viterbi algorithm, both require off-line processing of large amounts of data. (See Appendix C for
details on these algorithms.) For our goal of learning concepts in real time using a robot, these
methods are not very useful. We would much prefer an iterative or on-line training procedure.
There are generally two approaches researchers have used to implement on-line training for
HMMs. The first minimizes the prediction error of the model via recursive methods. This approach
was first suggested by Arapostathis and Marcus [74], who proposed a recursive Gauss-Newton algo-
rithm and a general recursive stochastic gradient algorithm, although they only treat the learning
of transition probabilities in finite-alphabet HMMs. Collings et al. [75] present a similar technique
for when the observations for each state have a Gaussian distribution. They treat both transition
probability and observation mean estimation, though they do not estimate variances. LeGland and
Mevel [76] prove convergence of the recursive conditioned least squares estimator (RCLSE), which
is a generalization of the approach in [74] to the case of observations in Rd.
The other approach used to implement on-line training in HMMs is to maximize the Kullback-
Leibler information between the estimated model and true model, or equivalently, to maximize
the likelihood of the estimated model for an observation sequence. Holst and Lindgren [77] were
the first to propose an RMLE algorithm for HMMs. Krishnamurthy and Moore [78] derive an
on-line algorithm based on sequential expectation maximization (EM) schemes which minimize the
Kullback-Leibler information. In both of these papers, convergence was shown only in simulation.
Ryden [79] provides convergence analysis for a general class of batch-iterative recursive maximum-
likelihood estimators. Independently, LeGland and Mevel [76,80] suggest and prove the convergence
of RMLE, and compare it to the RCLSE (mentioned above). Krishnamurthy and Yin [66] extend
23
the RMLE results of [76] to autoregressive models with Markov regime, and add a number of
results on convergence, rate of convergence, model averaging, and parameter tracking. Because
they offer the most complete results, our RMLE implementation for HMMs (and the explanation
of the algorithm in this chapter) is based mostly on [66].
For the remainder of this section, we will formulate our model, derive the RMLE algorithm
for HMMs, sketch the proof of convergence given by Krishnamurthy and Yin [66], and discuss a
number of HMM training results using the algorithm. While the main purpose of this section is
to establish use of these algorithms in our cascade model in the next chapter, we will also provide
some analysis and commentary, including a discussion at the end of this chapter on why HMMs are
better Bayesian classifiers.
2.2 Model Description and Notation
An HMM is a discrete-time stochastic process with two components, Xn, Yn, defined on proba-
bility space (Ω,F , P ). Let Xn∞n=1 be a discrete-time first-order Markov chain with state space
R = 1, . . . , r, r a fixed known constant. The model starts in a particular state i = 1, . . . , r with
probability πi = P (X1 = i). Define π ∈ Π by π = πi, where Π is the set of length-r stochastic
vectors. For i, j = 1, . . . , r, the transition probabilities of the Markov chain are given by
aij = P (Xn = j|Xn−1 = i). (2.1)
Let A = aij. Then A ∈ A, where A is the set of all r × r stochastic matrices.
In an HMM, Xn is not visible, and its statistics can only be ascertained from a corresponding
observable stochastic process, Yn. The process Yn is a probabilistic function of Xn, i.e., given
Xn, Yn takes values from some space E according to a conditional probability distribution. The
corresponding conditional density of Yn is generally assumed to belong to a parametric family of
densities b(·; θ) : θ ∈ Θ, where the density parameter θ is a function of Xn, and Θ is the set
of valid parameters for the particular conditional density assumed by the model. The conditional
density of Yn given Xn = j can be written b(·; θj), or simply bj(·) when the explicit dependence on
θj is understood.
24
Example 2.1. (Gaussian observation density): Suppose the observation density for each state in
an HMM is described by a univariate Gaussian distribution. Then parameter set Θ = (µ, σ) ∈ R×
(0,∞), θj ∈ Θ, and Yn = yn is a sequence of continuously valued, conditionally independent
outputs on R, each with probability distribution
b(yn; θj) = b (yn;µj , σj) =1√
2πσj
exp
[
−(yn − µj)2
2σ2j
]
(2.2)
for Xn = j.
Example 2.2. (Finite-alphabet observation density): Suppose observations Yn are drawn from a
finite set of symbols V = vk, k = 1, . . . , s. Then Θ = (b1, . . . , bs)|∑s
k=1 bk = 1, bk ≥ 0 is the
set of length-s stochastic vectors, θj ∈ Θ, and Yn = yn is a sequence of symbols draw from a
finite alphabet, each yn having probability
b(yn; θj) = bjk|yn=vk(2.3)
for Xn = j.
For simplicity, the last two examples and the following discussion assume Yn to be scalar valued,
although the formulation easily generalizes to vector-valued observations.
Conceptually, it is useful to think of Yn as being generated by a hidden Markov process.
When Xn = j, the observation Yn is generated using
Yn = g(en; θj)|Xn=j , (2.4)
where g(·, θ) is a real valued function on R indexed by θ ∈ Θ, and en is a sequence of independent
and identically distributed (iid) random variables. This formulation is equivalent to a Monte Carlo
simulation, where g(·; θ) could be, for example, the inverse of the cumulative distribution function
(CDF) corresponding to observation density b(·; θ), and en a sequence of uniform random variables
distributed on [0, 1]. Other formulations, of course, are possible.
For later analysis, it will be convenient to collect model parameters together in a single parameter
vector. Define the HMM parameter space as Φ = Π ×A× Θ. The model ϕ ∈ Φ is then defined as
ϕ = π1, . . . , πr, a11, a12, . . . , arr, θ1, . . . , θr. (2.5)
25
The model parameters for a particular model are accessed via coordinate projections, e.g., aij(ϕ) =
aij . In some cases (such as when considering the RMLE algorithm below), we will not be concerned
with estimating π. In that case, Φ = A× Θ, and ϕ changes accordingly. Note that the literature
occasionally describes other model parameterizations (see, e.g., [75, 77]).
Example 2.3. For Example 2.1 above,
ϕ = (π1, ..., πr, a11, a12, ..., arr, µ1, σ1, ..., µr, σr).
Let p be the length of ϕ. When estimating model parameters, let ϕ∗ ∈ Φ be the fixed set of
“true” parameters of the model we are trying to estimate.
For a vector or matrix v, v′ represents its transpose. Define the r-dimensional column vector
b(yn;ϕ) and r × r matrix B(yn;ϕ) by
b(yn;ϕ) = [b1(yn; θ1(ϕ)), ..., br(yn; θr(ϕ))]′ (2.6)
and
B(yn;ϕ) = diag[b1(yn; θ1(ϕ)), ..., br(yn; θr(ϕ))]. (2.7)
Vector b(yn;ϕ) and matrix B(yn;ϕ) give the observation density evaluated at yn for each state (in
model ϕ), as a vector and diagonal matrix, respectively.
Using the definitions above, it can be shown (see, e.g., [81]) that the likelihood of the sequence
of observations 〈y1, . . . , yn〉 for model ϕ is given by
pn(y1, . . . , yn;ϕ) = π(ϕ)′B(y1;ϕ)
n∏
k=2
A(ϕ)B(yk;ϕ)1r, (2.8)
where 1r refers to the r-length vector of ones.
2.3 Recursive Maximum-Likelihood Estimation of HMM
Parameters
Maximum-likelihood estimation (MLE) is formally defined as follows. For observation sequence
〈y1, . . . , yn〉, find
ϕ = arg maxϕ∈Φ
pn(y1, . . . , yn;ϕ), (2.9)
26
where ϕ is the most likely estimate of the true underlying parameters ϕ∗. The recursive maximum-
likelihood estimation (RMLE) algorithm defined here is an iterative, stochastic gradient solution
to this problem.
2.3.1 RMLE derivation
The derivation of the RMLE algorithm for HMMs proceeds as follows. We first show how to
calculate the likelihood pn(y1, . . . , yn;ϕ) for a given HMM model recursively, using prediction (or
forward) filters. We note that maximizing log pn(y1, . . . , yn;ϕ) is equivalent to and generally easier
than maximizing pn(y1, . . . , yn;ϕ) [82], and that log pn(y1, . . . , yn;ϕ) can also be calculated recur-
sively. We can then search for the maximum of log pn(y1, . . . , yn;ϕ) using the derivative of the
update of this recursion.
For the results of this section to hold, it is necessary to assume various conditions on periodicity,
continuity, and ergodicity for the model. For simplicity, we will assume that all necessary conditions
hold and will introduce them in the next section.
Define the prediction filter as
un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′, (2.10)
where
uni(ϕ) = P (Xn = i|y1, . . . , yn−1) (2.11)
is the probability of being in state i at time n given all previous observations. Using this filter, the
likelihood pn(y1, . . . , yn;ϕ) can be written as
pn(y1, . . . , yn;ϕ) =
n∏
k=1
b(yk;ϕ)′uk(ϕ). (2.12)
(For this derivation, see Appendix E, Section E.1.)
The value of un(ϕ) can be calculated recursively as
un+1(ϕ) =A(ϕ)′B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ)(2.13)
when initialized by u1(ϕ) = π(ϕ).
27
Let w(l)n (ϕ) = (∂/∂ϕl)un(ϕ) be the partial derivative of un(ϕ) with respect to (wrt) the lth
component of ϕ. Each w(l)n (ϕ) is an r-length column vector, and
wn(ϕ) = (w(1)n (ϕ),w(2)
n (ϕ), . . . ,w(p)n (ϕ)) (2.14)
is an r × p matrix. Taking the derivative of un+1(ϕ) from Equation (2.13),
w(l)n+1(ϕ) =
∂un+1(ϕ)
∂ϕl
= R1(yn,un(ϕ),ϕ)w(l)n (ϕ) +R
(l)2 (yn,un(ϕ),ϕ) (2.15)
where
R1(yn,un(ϕ),ϕ) = A(ϕ)′[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
]B(yn;ϕ)
b(yn;ϕ)′un(ϕ)(2.16)
R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′
[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
][∂B(yn;ϕ)/∂ϕl]un(ϕ)
b(yn;ϕ)′un(ϕ)+
[∂A(ϕ)′/∂ϕl]B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ). (2.17)
Using these equations, we can recursively calculate wn(ϕ) at every iteration.
For a set of observations 〈y1, ..., yn〉, we would like to find the maximum of pn(y1, . . . , yn;ϕ).
Equivalently, we can maximize log pn(y1, . . . , yn;ϕ). Define the log-likelihood of observations
〈y1, . . . , yn〉 as
`n(ϕ) =1
n+ 1log pn(y1, ..., yn;ϕ). (2.18)
Using Equation (2.12), we can rewrite this as
`n(ϕ) =1
n+ 1
n∑
k=1
log[b(yk;ϕ)′uk(ϕ)]. (2.19)
To estimate the set of optimal parameters ϕ∗, we want to find the maximum of `n(ϕ), which
we will attempt via recursive stochastic approximation. For each parameter l in ϕ, at each time
n, we take (∂/∂ϕl) of the most recent term inside the summation in Equation (2.19), to form an
“incremental score vector”
S(Yn;ϕ) =(
S(1)(Yn;ϕ), ..., S(p)(Yn;ϕ))′
(2.20)
28
with
S(l)(Yn;ϕ) =∂
∂ϕllog[b(yn;ϕ)′un(ϕ)]
=b(yn;ϕ)′[(∂/∂ϕl)un(ϕ)] + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)
b(yn;ϕ)′un(ϕ)
=b(yn;ϕ)′wn(ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)
b(yn;ϕ)′un(ϕ)(2.21)
where
Yn , (Yn,un(ϕ),wn(ϕ)). (2.22)
The RMLE algorithm takes the form
ϕn+1 = ΠG
(
ϕn + εnS(Yn;ϕn))
, (2.23)
where εn is a sequence of step sizes satisfying εn ≥ 0, εn → 0 and∑
n εn = ∞, G is a compact and
convex set (here, G ⊆ Φ, the set of all valid parameter sets ϕ), and ΠG is a projection onto set G.
The purpose of the projection is generally to ensure valid probability distributions and maintain
all necessary conditions. Note that Equation (2.23) is a gradient update rule, with constraints.
Equations (2.17) and (2.21) can both be simplified for each type of parameter in ϕ. Appendix E
contains derivations of both equations for
1. transition probabilities aij(ϕ),
2. observation probabilities bjk(ϕ) when assuming observations from a finite alphabet,
3. mean vector µj(ϕ) and covariance matrix Σj(ϕ) when assuming continuous observations
taken from a multidimensional Gaussian distribution, and
4. upper triangular matrix Rj(ϕ), where Rj(ϕ)′Rj(ϕ) = Σj(ϕ) above. This derivation is
included for mathematical convenience and is the one we use in our implementation, as it
greatly simplifies the calculation of Σj(ϕ)−1 and |Σj(ϕ)|.
2.3.2 Convergence
For both the derivation above and the proof of convergence below, we assume the following condi-
tions (from [66]) hold.
29
Condition 2.1. The transition probability matrix A(ϕ∗) is aperiodic and irreducible (see [83]).
Condition 2.2. The mapping ϕ → A(ϕ) is twice differentiable with bounded first and second
derivatives and Lipschitz continuous second derivative. For any yk, the mapping ϕ → b(yk;ϕ) is
three times differentiable, and the function b(yk; θ) is continuous on R for every θ ∈ Θ. Alternately,
for yk drawn from a finite alphabet, the mapping ϕ → b(yk;ϕ) is twice differentiable with bounded
first and second derivatives and Lipschitz continuous second derivative.
Condition 2.3. Under Pϕ∗ , the extended Markov chain
Xn, Yn,un(ϕ),wn(ϕ)
is geometrically ergodic1 (see [66, 83] for the proof when b(yn; θ) is continuous, and [74, 80] for the
proof when the observations yn are drawn from a finite alphabet).
Because of this geometric ergodicity, the initial values of u0(ϕ) and w0(ϕ) are forgotten expo-
nentially fast, and are therefore asymptotically unimportant in the analysis of the algorithm.
Note 2.1. For the case of observations from a finite alphabet, our conditions and assumptions
above did not appear in [66]. However, geometric ergodicity was shown for this case in both [74]
(for a special case of models with observations from a finite alphabet) and [80] for a more general
case. The proof in [80] assumes only that the transition probabilities are being updated, but can
be generalized to include maximum-likelihood estimation of all model parameters. By introducting
these assumptions, the following proof by Krishnamurthy and Yin can then be extended to apply to
HMMs with finite observation alphabets.
Krishnamurthy and Yin [66] analyze the convergence and rate of convergence of the RMLE
algorithm described above. Their proofs use an ordinary differential equation (ODE) approach,
which relates the discrete-time iterations of the RMLE algorithm to an ODE, and then proves
convergence of the ODE. The general theory of this method is given in [85]. Here we will sketch
their convergence proof. For full details, see [66].
1For Markov chains, ergodicity means that the ensemble statistics of the states approach the stationary distributionof the chain as n → ∞. Geometric ergodicity means that the ensemble statistics approach the stationary distributiongeometrically fast. See [84].
30
The general idea of the proof is to treat the sequence of parameter estimates ϕn as finite-
difference estimates to a projected ODE, that is, an ODE whose dynamics are projected onto a
constraint set G. In our case, G is the set of constraints necessary to maintain stochasticity of the
transition matrix A(ϕ), and observation probability matrix bjk in the case of observations from
a finite-alphabet. They then show that the set of limit points of this ODE are ϕ∗.
First, note that if log[b(yk;ϕ)′uk(ϕ)] is locally Lipschitz and assuming Conditions 2.1 through
2.3, there exists a finite `(ϕ) such that
`n(ϕ) → `(ϕ), Pϕ∗w.p. 1 as n→ ∞.
That is, `n(ϕ) converges to a limit `(ϕ), and the update algorithm we derived in the previous
section is attempting to find parameters ϕ which minimize `(ϕ). Moreover, this minimum is also
a minimum of the Kullback-Leibler information, which is defined as
K(ϕ) = −[`(ϕ) − `(ϕ∗)] ≥ 0.
Thus, maximizing `n(ϕ) is equivalent to minimizing K(ϕ). Let LML be the set of global minima
of K(ϕ) (see [86]) given by
LML = arg minϕ∈Φ
K(ϕ).
Clearly, ϕ∗ ∈ LML.
Rewrite Equation (2.23) as
ϕn+1 = ϕn + εnS(Yn;ϕn) + εnMn, (2.24)
where Mn is a projection or correction term; i.e., it is the vector of shortest length necessary to
bring ϕn + εnS(Yn;ϕn) back to the constraint set G. Consider a piecewise-constant interpolation
of ϕn. According to the Arzela-Ascoli theorem (see [85], p. 101), we can extract a convergent
subsequence whose limit satisfies an ODE projected onto G.
Consider the projected ODE
ϕ = H(ϕ) + m, ϕ(0) = ϕo, (2.25)
where H(ϕ) = (∂/∂ϕ)K(ϕ) and m is the force or constraint term needed to keep ϕ(·) ∈ G. Let
LG = ϕ;ϕ is a limit point of (2.24),ϕ ∈ G. A set A ⊂ G is locally asymptotically stable (in the
31
sense of Lyapunov) for Equation (2.25), if for each δ > 0 there is a δ1 > 0 such that all trajectories
starting in Nδ1(A) never leave Nδ(A) and ultimately stay in Nδ1(A), where Nη(A) denotes an η
neighborhood of A.
Assume the following conditions.
Condition 2.4. For each ϕ ∈ G, S(Yj ;ϕ) is uniformly integrable, E[S(Yj ;ϕ)] = H(ϕ) =
(∂/∂ϕ)K(ϕ), H(ϕ) is continuous, and S(Y ; ·) is Lipschitz continuous for each Y .
Condition 2.5. Let L1G ⊂ LG, and suppose that LML is locally asymptotically stable. For any
initial condition ϕo /∈ L1G, the trajectory of Equation (2.24) goes to LML.
Theorem 2.4. Assuming Conditions 2.1 through 2.4, there is a convergent subsequence of ϕn
that satisfies the projected ODE in Equation (2.24), and ϕn converges to an invariant set of the
ODE in G. Further assume Condition 2.5. Then the limit points of the projected ODE are in
L1G ∪ AG w.p. 1. In particular, if L1
G ∪ AG = ϕ∗, and ϕn visits a compact set in the domain of
attraction of L1G ∪ AG infinitely often, then ϕn → ϕ∗ w.p. 1.
Proof omitted. See [66].
Krishnamurthy and Yin [66] also provide rate of convergence analysis, examining the dependence
of the estimation error (ϕn − ϕ∗) on the step size εn, as well as the behavior of fixed step size
algorithm where ε is held constant. Because of the dependence between estimation error and step
size, choosing a good step size is an important consideration when implementing the algorithm.
One way to diminish the dependence on the step size is to use averaging to give more accurate
estimations. The next section discusses this idea briefly.
2.3.3 Model averaging and tracking
As can be seen from the first column of Figure 2.1 (p. 36), oscillation can be a problem when trying
to learn a model using a fixed ε in the update procedure, depending on the size of ε. If εn is chosen
to be, say, 1/n, convergence is guaranteed, but will be quite slow, and undesirable oscillations may
still be a problem. Ideally, we would want to choose a step size which would allow learning at
an optimal rate, although this is not an easy task. In this context, Kushner and Yin [85] suggest
32
that averaging reduces the need to choose an optimal form for εn. (See [85], Chapter 11, for more
discussion on this topic.)
Krishnamurthy and Yin [66] suggest averaging in both the iterates (i.e., ϕn) and the observations
(as measured by S(Y ,ϕ)). This averaging takes the form
ϕn+1 = ΠG(ϕn + εnnSn) (2.26)
ϕn+1 = ϕn − 1
n+ 1ϕn +
1
n+ 1ϕn+1 (2.27)
Sn+1 = Sn − 1
n+ 1Sn +
1
n+ 1Sn+1, (2.28)
with ε = 1/nγ , 0.5 ≤ γ ≤ 1. In [66], Krishnamurthy and Yin provide convergence, asymptotic
optimality, and asymptotic normality proofs for the modified algorithm.
These formulas can also be modified to work with a “fixed history” by replacing n in Equa-
tions (2.26)-(2.28) with a fixed constant k or, alternatively, min(n, k). Numerical simulations of the
original, averaging, and fixed history algorithms appear in the next section.
Various sources [66, 78] suggest the use of fixed ε for use in tracking. Analysis of the RMLE
algorithm for tracking slowly varying HMM parameters also appears in [66], and we give some
examples of training with fixed ε in the next section.
2.3.4 Numerical simulations
In this section, we present a number of Monte-Carlo simulations to demonstrate the RMLE algo-
rithm under various model configurations. In the first simulation, the observations in each model
come from a one-dimensional Gaussian distribution. In the second simulation, observations are
drawn from a distribution over a finite-alphabet. The third simulation uses two-dimensional Gaus-
sian observation densities to show how the model converges when we use a model with a large
number of states to learn from data produced by a smaller model. This setup may be useful if,
for example, we know the general extents of our data, but do not know the underlying number of
states or have a good way of initializing the model.
Implementation notes. The update formula for the parameters derived in Section 2.3.1 required
a projection term to keep the updated parameters within their constraints (i.e., at a minimum, to
33
maintain the stochasticity of the probability transition matrix A(ϕ) and, for observations from
a finite alphabet, observation probabilities bjk(yn;ϕ)). In the literature, the general suggestion
has been to parameterize each r-length row in the transition matrix with r − 1 variables. For
example, in A(ϕ), each off-diagonal entry aij , i 6= j, can be represented as such, and each diagonal
entry aii is parameterized with aii = 1 −∑j,i6=j aij. For transition probabilities, this might be
reasonable, as self-transitions often have different meaning than other transitions (see Appendix D
for a discussion on this topic). However, in general, this type of parameterization leads to some
undesirable behavior during the training, because every change in an off-diagonal entry in the
matrix causes a corresponding opposite change in the diagonal entry, while the rest of that line
in the matrix is unaffected. This problem is especially evident for finite-alphabet observation
probabilities, where the parameterized variable generally has no special meaning.
Our solution to this problem was to avoid this parameterization, and instead use Lagrange
multipliers to maintain stochasticity constraints in the mapping ΠG. The perhaps nonintuitive
result is that the mapping which brings the modified parameters to the closest point within the
constraint space simply subtracts the same amount from each parameter in aj·, taking care, of
course, that no parameters become less than some ε > 0. We would like to point out that, initially,
before using Lagrange multipliers, we simply attempted to scale the parameters, which is incorrect
and caused the model not to converge.
2.3.4.1 Gaussian observations
For the first test, we generated data from a simple two state model, with transition matrix
A =
0.9 0.1
0.1 0.9
and observations generated from Gaussians with parameters
µ =
[−1.0
1.0
]
, σ =
[0.6
0.9
]
.
For training, we tested both fixed step sizes (ε = 0.006, 0.003, 0.001), and exponentially
decreasing step sizes (εn = ε0nγ , ε0 = 0.1, 0.3, 0.5, 0.5 ≤ γ ≤ 1). We also varied the aver-
aging history, replacing n in Equations (2.26)-(2.28) with a fixed history of min(n, k) for k =
34
Table 2.1: Simulation results for various combinations of learning rate ε and averaging history k.All values were measured at 50 000 iterations, over 50 runs. Values in the table indicate the meansand standard deviations (in parenthesis) of the measured values. Original model values are givenin Section 2.3.4.1.
Algorithm
[a11
a22
]
µ σ
ε = 0.001, k = 1
[0.8280 (0.7423)0.8405 (0.7848)
] [−1.0002 (0.4673)0.9791 (0.5371)
] [0.6059 (0.3329)0.9044 (0.3734)
]
ε = 0.003, k = 1
[0.8766 (0.2210)0.8600 (0.6537)
] [−0.9917 (0.3211)1.0097 (0.2758)
] [0.6085 (0.2683)0.8971 (0.1702)
]
ε = 0.006, k = 1
[0.8960 (0.0970)0.8909 0.1139()
] [−1.0029 (0.1652)1.0046 (0.1656)
] [0.6039 (0.1378)0.8986 (0.0897)
]
ε = 0.5/n0.5, k = 1
[0.8693 (0.5820)0.8758 (0.6631)
] [−0.8636 (3.5216)0.8650 (3.5274)
] [0.6360 (0.7999)0.8882 (0.6351)
]
ε = 0.5/n0.6, k = 1
[0.8983 (0.0959)0.8955 (0.0917)
] [−0.9613 (1.9682)0.9587 (1.9820)
] [0.6074 (0.3351)0.8913 (0.3208)
]
ε = 0.1/n0.5, k = 1
[0.8996 (0.0632)0.8988 (0.0703)
] [−1.0008 (0.0833)0.9966 (0.0983)
] [0.5977 (0.0905)0.8996 (0.0534)
]
ε = 0.3/n0.5, k = ∞[
0.8754 (0.6096)0.8181 (1.3387)
] [−0.7771 (3.5573)0.7368 (3.3679)
] [0.7617 (1.7868)1.0405 (1.8286)
]
ε = 0.3/n0.5, k = 10000
[0.9020 (0.3473)0.8037 (1.7091)
] [−0.8623 (2.3053)0.9317 (1.6149)
] [0.6942 (1.6433)1.0418 (2.3224)
]
ε = 0.3/n0.5, k = 1000
[0.9164 (0.2951)0.7503 (2.1159)
] [−0.7701 (3.2633)0.8142 (3.2751)
] [0.7417 (1.8322)1.0791 (2.5741)
]
1, 10, 1000, 10 000,∞, where k = 1 implies no averaging, and k = ∞ implies averaging from time
0 (i.e., min(n, k) = n. For all tests here, parameters in the learned model were started at
A =
0.5 0.5
0.5 0.5
, µ =
[−0.75
−0.50
]
, σ =
[1.0
1.0
]
.
For each parameter combination, we ran the simulation for 50 000 iterations. We then chose
a representative subset of parameter combinations and reran each of these 50 times. Results are
summarized in Table 2.1. We discovered the following trends:
1. Convergence of all parameters occurred in less than 50 000 iterations for all fixed values of
ε, with larger values converging faster, but producing larger amplitudes of oscillation around
converged values. This behavior can be seen in Figure 2.1.
35
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(a) Transition probability es-timates (ε = 0.006)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(b) Transition probability es-timates (ε = 0.003)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(c) Transition probability esti-mates (ε = 0.001)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(d) Gaussian mean estimates(ε = 0.006)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(e) Gaussian mean estimates(ε = 0.003)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(f) Gaussian mean estimates(ε = 0.001)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
time
σ
(g) Gaussian standard devia-tion estimates (ε = 0.006)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
time
σ
(h) Gaussian standard devia-tion estimates (ε = 0.003)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
time
σ
(i) Gaussian standard devia-tion estimates (ε = 0.001)
Figure 2.1: The effect of learning rate ε on parameter convergence during RMLE training, forconstant ε. Notice how larger values of ε converge faster, but have more oscillations. Originalmodel parameters are indicated by the symbol at the right edge of the graphs, and are specifiedin Section 2.3.4.1.
36
2. For exponentially decreasing εn, models only converged in 50 000 iterations for limited com-
binations of learning parameters. Larger values of ε0 generally converged faster (for those
runs which converged), but also caused larger oscillations, as was the case with large fixed ε.
Smaller values of γ (from εn = ε0nγ ) caused faster convergence than larger values of γ, since
for larger values of γ, εn decreases too quickly for the model to reach convergence. How-
ever, smaller γ also provided less attenuation of oscillations. Compare the three columns of
Figure 2.2.
3. Longer averaging histories provided much smoother learning trajectories than shorter histo-
ries, and greatly reduced the frequency of oscillations in the learned parameters, although
the oscillation magnitude did not change much. With constant ε, this large oscillation is not
desirable. In most models with long histories, 50 000 iterations was not long enough for µ
and σ to converge. See Figure 2.3.
4. The algorithm becomes quite sensitive when one or more of the parameters of the learned
transition probability matrix A approaches zero (see Figure 2.4, page 41, for an example).
Averaging reduces this problem. Holst and Lindgren [77] also suggest parameterizing A using
log likelihoods instead of probabilities in order to alleviate this problem. We did not try this
solution.
Discussion. The starting point of the learned model was chosen to make learning challenging,
which may explain the limited combinations of learning parameters that actually converged for
models with exponentially decreasing εn. Some of these models may simply have needed more time
to converge. Since we were testing many different parameter combinations, we initially only ran
each combination once, which may not produce results indicative of that parameter combination
(i.e., we could have been “unlucky” early in those simulations which did not converge). However,
our initial goal was to find combinations of parameters which are stable and rapidly converging
even in difficult situations, and the collected data provides this information.
While averaging did reduce the frequency of oscillations in the learned parameters, as pointed
out above, it did not reduce the magnitude of oscillation when learning the means and standard
deviations of the observation densities. This fact can possibly be explained by the following:
37
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(a) Transition probability es-timates (ε = 0.5
n0.5 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(b) Transition probability es-timates (ε = 0.5
n0.6 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(c) Transition probability esti-mates (ε = 0.1
n0.5 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(d) Gaussian mean estimates(ε = 0.5
n0.5 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(e) Gaussian mean estimates(ε = 0.5
n0.6 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(f) Gaussian mean estimates(ε = 0.1
n0.5 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.2
0.4
0.6
0.8
1
1.2
1.4
time
σ
(g) Gaussian standard devia-tion estimates (ε = 0.5
n0.5 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
time
σ
(h) Gaussian standard devia-tion estimates (ε = 0.5
n0.6 )
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
time
σ
(i) Gaussian standard devia-tion estimates (ε = 0.1
n0.5 )
Figure 2.2: The effect of ε0 and γ on parameter convergence during RMLE training, with an expo-nentially decreasing εn. Notice that larger values of ε0 converge faster, but have larger oscillations.Larger values of γ smooth out oscillations faster, though slow down convergence. Original modelparameters are indicated by the symbol at the right edge of the graphs.
38
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(a) Transition probability es-timates, with averaging (ε =0.006, k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(b) Transition probability es-timates, with averaging (ε =0.3
n0.5 , k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(c) Transition probability es-timates, with averaging (ε =0.3
n0.5 , k = 1000)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(d) Gaussian mean estimates,with averaging (ε = 0.006, k =∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ
(e) Gaussian mean estimates,with averaging (ε = 0.3
n0.5 , k =∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
time
µ(f) Gaussian mean estimates,with averaging (ε = 0.3
n0.5 , k =1000)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.5
1
1.5
time
σ
(g) Gaussian standard devia-tion estimates, with averaging(ε = 0.006, k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.4
0.6
0.8
1
1.2
1.4
1.6
time
σ
(h) Gaussian standard devia-tion estimates, with averaging(ε = 0.3
n0.5 , k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
time
σ
(i) Gaussian standard devia-tion estimates, with averaging(ε = 0.3
n0.5 , k = 1000)
Figure 2.3: The effect of history size k on parameter averaging during RMLE training. The abovegraphs show parameter learning for different history sizes (k = ∞, 1000) with both fixed anddecreasing ε. Notice the large oscillation amplitudes for fixed ε in the first column. Long historiesand averaging do not work well with constant the step size version of the algorithm. Compare thesegraphs with those in Figures 2.1 and 2.2, which did not use averaging. Original model parametersare indicated by the symbol at the right edge of the graphs.
39
1. The flatness of the parameter space around µ and σ; i.e., small changes in these parame-
ters may not have much affect on the likelihood function, as compared with parameters in
transition probability matrix A.
2. The fact that we are averaging score vector Sn. As n becomes large, the amount that Sn
contributes to score vector Sn in Equation (2.28) decreases dramatically, maintaining the
momentum of the score vector.
These ideas suggests a few alternative approaches:
1. We could change the update for Sn to give more weight to the current score, for example, by
rewriting the equation as
Sn+1 = (1 − α)Sn + αSn+1,
for 0 < α < 1. We have not tried this idea.
2. We could not average observations Sn, but continue averaging parameter iterates ϕn. This
approach, unfortunately, did not produce converging results.
3. Since averaging is beneficial for transition probabilities, as pointed out above, and seem-
ingly disadvantageous for observation means and variances, a combined approach of obser-
vation/iterate averaging for A and iterate-only or no averaging for µ and σ could be tried.
Although results are not presented here, this approach produced some useful results.
4. We could attempt to use a different update algorithm. In particular, Schraudolph has pro-
posed local gain adaption [87] and stochastic conjugate gradient [88] for stochastic training
of neural networks. Both ideas could be tried here. We have not yet attempted to implement
either algorithm.
We also ran a set of tests on a model with two-dimensional Gaussian observations, with similar
results.
2.3.4.2 Observations from a finite alphabet
While the derivation and proof of the RMLE algorithm described in this chapter assume continuous
observation densities, the algorithm is capable of learning models with finite-alphabet observation
40
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(a) Transition probability es-timates, with averaging (ε =0.001, k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(b) Transition probability es-timates, with averaging (ε =0.001, k = 1000)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
A’(:
,2)
(c) Transition probability esti-mates, without averaging (ε =0.001)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,2)
(d) Discrete observation den-sity estimates, with averaging(ε = 0.001, k = ∞)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,2)
(e) Discrete observation den-sity estimates, with averaging(ε = 0.001, k = 1000)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
0.2
0.4
0.6
0.8
1
time
b’(:
,2)
(f) Discrete observation den-sity, without averaging (ε =0.001)
Figure 2.4: Examples of learning in HMMs with finite-alphabet observation densities. In theseplots, we set with various averaging histories (k = ∞, 1000, 1) and fixed ε = 0.001. Original modelparameters are indicated by the symbol at the right edge of the graphs, and are specified inSection 2.3.4.2.
densities. Figure 2.4 shows some examples of learning in such models. The model used for training
was
A =
0.99 0.01
0.01 0.99
, b =
0.900 0.005 0.095
0.005 0.095 0.900
.
The model being learned was initialized with
A =
0.5 0.5
0.5 0.5
, b =
0.600 0.200 0.200
0.200 0.600 0.200
.
41
As with transition probabilities, the algorithm is quite sensitive when finite-alphabet observation
probabilities approach zero, as can be see by the graphs in the third column of Figure 2.4. As can
be seen in the first two columns, averaging helps alleviate this problem.
2.3.5 Estimating a model with unknown model order
In most of the literature dealing with HMMs, it is generally assumed that the number of states
needed to represent an underlying process is known. When working with a real system, however,
the correct or optimal number of states may be difficult or impossible to know. A recent tutorial
paper on hidden Markov processes by Ephraim and Merhav [89] summarizes the state of the art
of order estimation in HMMs. A more recent proposal can be found in [90]. Order estimation of
continuous observation HMMs has been treated in [91, 92].
Most of the proposed approaches use a penalized likelihood method, where the likelihood of
models with various orders are compared. Since the likelihood will invariably increase as the order
of the model is increased, a penalty function is added to penalize larger models. The actual penalty
function used varies. See [89–92] for more details. To our knowledge, none of these techniques have
been applied to online HMM order estimation.
For our model, numerous ad hoc methods of treating order estimation suggest themselves,
including growing the model to handle data not well modeled and attempting to cover the subspace
inhabited by the incoming data. While we have not done much study of existing techniques, we
present the results of a space covering experiment below.
2.3.5.1 Space covering
In this section we suggest an ad hoc approach for learning the underlying state order of a set
of observations, as follows. First, we initialize a model with a large number of states, with the
observation densities initially covering the region of space occupied by the observations. In our
example, we assume that our observations will be contained in the region (x, y) : x, y ∈ (−10, 10),
and we choose to start with 16 states with Gaussian densities equally spaced throughout this region.
Figure 2.5 shows this setup,where the densities for each state are drawn in blue. The densities of
the states in the model to be learned are drawn in red.
42
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
Figure 2.5: Initialization of an HMM with two-dimensional Gaussian observation densities. Eachdensity is indicated on the graph by its mean and a contour line containing 80% of the density.Each density is also shaded according to the stationary probability of its state, with more likelystates shaded darker. Densities of the model to be learned are drawn in red.
Note that there is no indication of transition probabilities on this graph. However, in this
figure and in the graphs in Figure 2.6, the density associated with each state is colored according
to that state’s stationary probability, derived from the stationary distribution of the transition
probability matrix A. Initially, all transition probabilities are equal, so the stationary distribution
(and therefore the distribution coloring) is uniform. Darker coloring of mean and contour lines
indicates higher stationary probability for a particular state.
The parameters of the source model in this experiment are
A =
0.7 0.1 0.1 0.1
0.1 0.7 0.1 0.1
0.1 0.1 0.7 0.1
0.1 0.1 0.1 0.7
,µ =
(−4.5, 4.5)
(−1, 1)
(2.4,−1.3)
(5,−5)
,
Σ1 =
1.0 0.75
0.75 1.5
Σ2 =
2.0 0.5
0.5 1.0
Σ3 =
2.0 −1.5
−1.5 2.0
Σ4 =
2.5 −0.1
−0.1 2.5
.
Figure 2.6 documents the progression of the training.2 Some points to note:
1. By 500 000 iterations, the model did converge to the four states in the model, but note
that this situation is not necessarily stable. Notice, for example that by 208 000 iterations,
the model had nearly converged to the original four states model, but at 330 000 iterations,
2The iteration numbers chosen for the graphs map to the curve y = 1000ˆ
x3˜
, with x evenly spaced on the interval
[1,3√
500], and [x] indicating the nearest integer function (i.e., rounding).
43
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(a) 1000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(b) 5000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(c) 14 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(d) 33 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(e) 67 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(f) 123 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(g) 208 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(h) 330 000 iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(i) 500 000 iterations
Figure 2.6: Learning an HMM using a model with a large number of states. Here, we are learninga 16-state model with data generated from a 4-state model. The model was run with historyk = 1000, and constant learning rate ε = 0.001.
44
some additional states have distributions which cover the same data. This phenomenon
seems to be caused mainly by observation outliers “activating” a state with a lower stationary
probability, temporarily destabilizing the model. It could be eliminated by combining states
whose distributions overlap considerably, and by removing unused or rarely used states.
2. Densities in light grey indicate states are not close to observation data and seemingly not used
in the model. Generally their stationary probabilities go to zero if transition probabilities can
go to zero, or get very small otherwise. They could be removed from the model, though we
did not do that here.
3. During training, two or more states may have distributions covering the same data, as can
be seen at 123 000 and 330 000 iterations. As mentioned previously, it may be desirable to
combine these states. Since we have control over the spatial layout of the distributions, the
search for states to combine could be limited to states in the same topological neighborhood.
One method for determining whether to combine two states would be to measure the Kullback-
Leibler distance between their observation distributions, and combine them if the value is
above some threshold.
4. The final state of the model took many more iterations to converge to the final model than
earlier experiments. This is, again, partially due to states whose distributions compete for the
same observations. Convergence, of course, is also obviously affected by parameterization.
Obviously, there are many caveats to this method. The user needs to pick the initial number,
size and shape (variance), and spacing of the observation distributions. As the dimension of the
data increases, the sheer number of states required to cover a particular region of space increases
exponentially, making the method impractical for all but small spaces. Whether, when, and how to
remove or combine states was not considered here. Nevertheless, experiments here do show that it
is possible for an HMM to learn the underlying structure of a set of inputs with an on-line training
algorithm, and in doing so validates the use of similar training in smaller models where observation
densities are initially primed with estimates of the densities of the observation process.
45
2.4 HMMs as Bayesian Classifiers
The HMM presented in this chapter is an ideal model. When attempting to use it to model real
world data, such as speech, the basic assumption of the model—that the underlying sequence of
states of the real data is a Markov chain—is almost certainly untrue. What the model does provide,
however is an improvement over the assumption that observations in a sequence are independent
of each other. That is, an assumption is made that the sequence of observations is important, and
that we can model some of the characteristics of that sequence with a first-order Markov chain.
We can see this improvement explicitly by analyzing an HMM as a stochastic classifier. A Bayes
classifier attempts to classify an input by estimating the posterior probabilities of an observation
using Bayes’ rule, i.e., for observation y and class xi,
P (xi|y) =p(y|xi)P (xi)∑
i p(y|xi)P (xi). (2.29)
If we know the prior probabilities of each xi and the prior distributions of y for each xi, this
classifier is optimal; i.e., it is the classifier with the lowest probability of error [93]. In fact, we
cannot in general know these distributions exactly, but the better our estimate of them, the better
our classifier. An HMM provides, at each time, an improved estimate of P (xi) by assuming that
the underlying sequence of states can be modeled as a Markov chain. The following analysis shows
why.
For comparison to our model definition earlier in this chapter, it will be convenient to write
Bayes’ rule in matrix form. Let ui = P (xi) be the prior probability of class i, and let u =
[u1, . . . , ur]′. Let b(y; θi) = p(y|xi) be the prior likelihood of y for class i, and let
b(y) = [b(y; θ1), . . . , b(y; θr)]′, (2.30)
where θi represents the parameters of the probability density function associated with class i. As
with our HMM analysis, let B(y) = diag[b(y)]. Finally, let fi = P (xi|y) be the posterior probability
of state i for observation y, and let f = [f1, . . . , fr]′. We can then rewrite Equation (2.29) as
f =B(y)u
b(y)′u. (2.31)
For an HMM, let fn(ϕ) be the probability distribution of Xn, i.e.,
fn(ϕ) = [fn1(ϕ), . . . , fnr(ϕ)]′ (2.32)
46
where
fni(ϕ) = P (Xn = i|y1, . . . , yn). (2.33)
That is, fni(ϕ) is the probability that the state (class) of the model is i at time n, given all
observations through time n. Using the variable definitions from Section 2.2, fn(ϕ) can be calculated
as
fn(ϕ) =B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ), (2.34)
where un(ϕ), b(yn;ϕ), and B(yn;ϕ) are defined as before.
Comparing Equations (2.31) and (2.34), clearly an HMM is a Bayesian classifier. As suggested
above, for sequential data, un(ϕ) provides a better estimate of the prior class probabilities than a
static prior u. We can see this by rewriting Equation (2.13) as
un+1(ϕ) = A(ϕ)′fn(ϕ), (2.35)
where A(ϕ) is the Markov transition probability matrix defined in Section 2.2. Thus, the prior
estimate at each time is the weighted average of transition probability vectors, with the weights
at time n+ 1 determined by the probability distribution of Xn. Thus our prior changes for every
observation, and the improved estimate of the class priors will give us an improved classifier for
sequential data, to the extent that our Markov chain accurately models the first-order statistics of
the underlying state sequence.
2.5 Discussion
This chapter explored the use of the RMLE algorithm for on-line training of HMMs. We have
successfully trained models using the algorithm, exploring how various combinations of training
parameters affect learning. We have also successfully demonstrated that a large model with states
whose distributions cover a section of space can correctly learn the structure of that space.
We have available an on-line learning algorithm for a model which can discern the underlying
structure of a set of inputs, which can turn continuous inputs into discrete states, and which learns
a notion of sequence among those states. We believe this type of model is ideal for our proposed
cascade structure for semantic learning, described in Section 1.4. The next chapter will describe
our implementation of that cascade structure using HMMs with the RMLE algorithm.
47
CHAPTER 3
CASCADE OF HMMS: THEORY
AND SIMULATION
3.1 Introduction
In Section 1.4, we introduced a general theory of semantic learning, suggesting that we form se-
mantic concepts through associations among inputs from multiple sensory modalities. Motivated
by these ideas, we have developed a new model based on a cascade of HMMs. Our ideas follow a
long line of research on attempts to identify additional structural information present in real-world
data, and incorporate that structural information into hidden Markov and related models. Below,
we review some of the previous research in this area. We then present an abstract description of
our cascade model, which attempts to model additional structure present in multiple data streams.
We discuss convergence of the model using the RMLE algorithm presented in the previous chapter,
and show some simulation results. In the next chapter, we will describe the use of this model as an
associative memory.
3.2 HMMs for Learning Structure
3.2.1 Unimodal structure
Considering the ideas in Section 2.4, one of the benefits of using HMMs for unsupervised learning is
that they have a notion of sequence, and can therefore learn some of the sequential structure of the
data during training. Later, knowledge of this additional structure can be used to better classify
48
data. In fact, HMMs do not account for all of the structure in the sequence, and one question that
comes to mind is whether or not we can learn more of this structure.
One of the first attempts at modeling additional structure beyond HMMs were hidden semi-
Markov models (HSMMs), first introduced as variable-duration HMMs (VDHMMs) [94–96]. HSMMs
still assume that the underlying sequence of states is a Markov chain, but that each state emits a
variable length sequence of observations, which is a more accurate model of, e.g., speech. Generally,
then, the model includes a specific probability density function (pdf) for the duration spent in each
state, as well as a pdf for the (variable length) sequence of observations for each state.1 During our
work on this dissertation, we explored the use of HSMMs, and derived a version of the RMLE for
the model (which we ultimately did not use). A formal description of this model and our RMLE
derivation appear in Appendix D.
Although not originally described as such, the hierarchical hidden Markov model (HHMM)
[97, 98] is a particular implementation of an HSMM which, in the terminology we use above,
assumes that the pdf of the variable length observation sequence is itself modeled by an HMM
(or another HHMM). This model has been shown [97] to be able to extract, in an unsupervised
manner, some of the higher order structure in real world data (specifically, text) that we alluded
to above.
Factorial HMMs [99] provide another approach to learning more complex structure in a stream
of data. These models assume that data output is produced from an interaction of multiple, loosely
coupled processes, and therefore can be better modeled with a distributed state representation.
This model was shown to discover complex structure in some of the melody lines of the chorales by
J. S. Bach that cannot be captured by traditional HMMs.
Following this progression, recently researchers have proposed more complicated models which
better capture the underlying structure of a given signal. One such example of these are called
switching state-space models, also known as hybrid models [100, 101]. These models assume an
underlying Markov chain modeling discrete states, with observations in each state assumed to
be produced by a Kalman filter. The Markov chain thus switches among various Kalman filters
1Often the observations are treated as independent random variables and their individual likelihoods for eachclass are multiplied together; i.e., if bj(yn, yn+1, . . . , yn+m−1) represents the m-dimensional pdf of a sequenceof observations for state j of a model, the likelihood will often be calculated as bj(yn, yn+1, . . . , yn+m−1) =bj(yn)bj(yn+1) · · · bj(yn+m−1).
49
producing output, and the Kalman filters are assumed to better model the short-term dynamics of
the observations produced by a particular state.
All of the above models assume a discrete sequence of states, transitioning according to a Markov
chain, with each state producing a possibly variable length observation sequence. In some cases,
such as the HHMM, it is possible to look at state sequences at multiple resolutions, although the
underlying sequence at each resolution is still assumed to be Markovian. With the exception of the
switching state-space models, all of the HMM-based approaches described above are mathematically
equivalent to a standard HMM, although, of course, the corresponding HMM would often be rather
large and complicated.
A class of discrete models known as stochastic grammars are the next level of complexity with
regard to modeling structured sequences [102, 103]. These grammars are stochastic versions of
those in the language grammar hierarchy proposed by Chomsky [104], and in fact, some HMMs
are equivalent to right-linear stochastic grammars. Unfortunately, algorithms for working with
stochastic grammars are computationally expensive and rather unwieldy to work with, and so to
our knowledge, little practical work has been done with them.
The various models described above have generally been used for supervised learning, although
in theory, they could all be run in an unsupervised fashion to discover structure in unimodal data.
3.2.2 Multimodal structure
The previous section discussed a class of models that modeled structure within a particular stream
of data. In our work we are interested in discovering structure among inputs from multiple streams.
Within the family of HMMs and related stochastic models, a few models have been proposed
that try to deal with multiple streams of data, typically streams representing visual and auditory
information. Two variations are coupled HMMs (CHMMs [64,65], and fused hidden Markov models
(fused-HMMs) [62,63]. CHMMs tie together two individual hidden Markov models by introducing
a conditional probability between the state variables of the two models. Fused-HMMs work in a
similar way, but model the joint distribution of the observation and state sequences of both models.
Compared with our work, the most striking difference is in how we represent the relationship
among multiple input models. Our proposal is to model the relationship between the two input
50
ConceptModel
VisualModel Model
Auditory
SensoryInputs
SemanticMemory
to Working Memory
=⇒
ϕ^
n^ visx xn
aud^
yn
aud
ϕ^ vis
ϕ^con
ϕ^aud
yn
vis
con vis aud^ ^y =x , x n n n
n^ conx
Figure 3.1: Semantic memory implemented using HMMs. Each model in the left diagram is modeledby a single HMM in the right diagram.
models not as a conditional or joint pdf, but with a third hidden Markov model. An important
aspect of this approach is that our model is compositional. That is, the states of the input models
are considered to be functions of the state of the third model, and as in regular HMMs, the outputs
are considered to be functions of these states. To our knowledge, this type of compositional cascade
model is not found in the literature. As an added benefit of our approach, we can use well known
algorithms for learning and inference in all three HMMs. We describe our model in the following
section.
3.3 Cascade of HMMs
As shown in Figure 3.1, we are using HMMs for each of the individual models in the semantic
memory model we proposed in the introduction to this dissertation. This figure shows the topology
of our model, as a cascade of HMMs with two lower “input” HMMs, ϕ`1 and ϕ`2 , and one upper
“concept” HMM ϕu, each defined as in Section 2.2. As stated previously, we propose to use the
lower HMMs to individually learn a set of classes of sensory input data in an unsupervised manner,
and the upper HMM to learn states representing frequent co-occurrences in the classifications of
the lower models. The remaining description in this chapter will be from the point of view of this
abstract model.
51
uϕ
lϕ 1 lϕ 2
n
u
n n
ll 21y =x , x
nl2xn
l1x
nl1y n
l2y
xu
ϕ_
Figure 3.2: An HMM cascade model. Abstractly, we assume information is arriving from to distinctbut related input streams, yl1
n and yl2n . This information is recognized/learned by HMMs ϕl1 and
ϕl2 , respectively, which produce estimated state sequences xl1n and xl2
n . These state sequences arethen recognized/learned by HMM ϕu. All learning is unsupervised, using RMLE.
3.3.1 Model description
Formally, let the topology of our cascade model be as shown in Figure 3.2; i.e., let our cascade model
ϕ = ϕl1 ,ϕl2 ,ϕu, where each component model ϕlj and ϕu are HMMs defined according to the
description in Section 2.2. Let X l1n , Y
l1n , X l2
n , Yl2n , and Xu
n , Yun be the state and observation
sequences corresponding respectively to ϕl1 , ϕl2 , and ϕu. In this model, observations Yljk of lower
models ϕlj are generally assumed to be continuous. The observations Y uk of upper model ϕu are
the concatenated state sequences of the lower level models; i.e., Y uk =
(
X l1k , X
l2k
)
, and ϕu models
the joint distribution of X l1k and X l2
k for each state j = 1, . . . , ru, where ru is the number of states
in ϕu. To simplify calculations and for future considerations, the joint observation density of ϕu is
modeled assuming independence between components of its input.
3.3.2 Recursive maximum-likelihood estimation for the cascade model
Even though the individual component models are generative, as discussed in Section 2.2, it is im-
possible to generate data with this model with sufficient statistics to identify all model parameters.
To see this, suppose we use upper model ϕu to generate state pair sequences yu,1n , yu,2
n , and use
these as the states of the lower models; i.e., let xl1n = yu,1
n and xl2n = yu,2
n . In this situation, there
52
ytl1
xtl1
tu,1y
tu,2y
xtl2
ytl2
xtu
yt+1l1
xt+1l1
t+1u,1y
xt+1u
t+1u,2y
xt+1l2
yt+1l2
yt−1l1
xt−1l1
t−1u,1y
xt−1u
t−1u,2y
xt−1l2
yt−1l2
Figure 3.3: A dynamic Bayesian network (DBN) model showing the dependence among outputand state variables assumed by our cascade HMM. The cascade HMM cannot generated thesedependencies, but this DBN can be fully implemented using a switching HMM.
is no direct Markov dependence between xlγn and x
lγn−1, because we do not use the state transition
matrix A(ϕlγ ). On the other hand, if we use the state transition matrix to generate the next state,
then there is no dependence on the upper model.
To alleviate this problem, we will make a slight modification of our original model for generative
purposes, and then use our proposed cascade model for learning and inference. The modification
we need is to make the states xlγn of the lower models dependent both on x
lγn−1 and on yu,γ
n . A
graphical dynamic Bayesian network (DBN) showing this relationship is shown in Figure 3.3. To
generate this dependence, we define a modification of an HMM called a switching HMM, whose
name refers to its structural similarities to the switching state-space models mentioned earlier.
A switching HMM is a discrete-time stochastic process with two components, Xn, Yn, defined
on probability space (Ω,F , P ). Let Xn∞n=1 be a discrete-time process with state space R =
53
qn
yn
λ
Figure 3.4: A switching HMM. Each Markov chain in the model has the same number of states, withthe same state in each chain corresponding to the same observation probability density function.Input qn chooses which Markov chain to use for the transition from xn−1.
1, . . . , r. Unlike an HMM, in a switching HMM the dynamics of Xn are determined by a set
of Markov chains Am(λ), m = 1, . . . , s, with each chain having order r, and the transition
probabilities for each chain defined as usual. An external discrete signal qn = 1, . . . , s, determines
which Markov chain to use for the transition from Xn−1 to Xn. As in an HMM, the process Yn is
a probabilistic function of Xn, as we have defined previously. Let λ be the vector of parameters
for this model. The topology of this model is shown in Figure 3.4.
Proceeding, let the topology of our generating model be as shown in Figure 3.5, and call this
model a cascaded switching HMM. Define ϕ = ϕu, λl1 , λl2, where ϕu is a finite-alphabet obser-
vation HMM as defined in Section 2.2, and λl1 and λl2 are switching HMMs as defined above. We
could, if we wished, attempt to estimate the model parameters of the original cascaded switching
HMM using a model with identical structure (and in fact, we have begun to look at this model,
but have not completed sufficient analysis to include results here.) Instead, we will approximate ϕ
with a cascade model ˆϕ = ϕu, ϕl1 , ϕl2, as shown in Figure 3.6.
54
lλ 1 l
λ 2
n n ny =y , y u,1 u,2u
nyl2
nyl1
nyu,1
nyu,2
ϕu
Figure 3.5: A cascaded switching HMM. As a generator, an HMM ϕu outputs a discrete pairyu,1
n , yu,2n , the components of which become the switches for a pair of switching HMMs λl1 and
λl2 , selecting which Markov chain is used to determine the next transition.
lλ 1
ϕ~
nl1y
n
u,1
n
u,2
n
uy =y , y
uϕ
lϕ 1 lϕ 2
^ ^n
u
n n
ll 21y =x , x
nl2y
lλ 2
nu,2yu,1yn
^nl2x^
nl1x
ϕu ^ϕ_
Figure 3.6: Monte Carlo simulation for learning a cascaded switching HMM ϕ using a cascadeHMM ˆϕ. The model on the left is a generative model, generating data for the model on the rightto learn.
55
Comparing the two models, we note that (1) in the cascaded switching HMM, state transitions
are determined by a set of transition probability matrices Am(λl1), which we attempt to model
by a single transition probability matrix A(ϕl1) in the cascade HMM, and (2) in the cascaded
switching HMM, the joint observation densities in ϕu represent the selection of Markov chains in
the switching HMMs, whereas in the cascade HMM, the joint distribution in ϕu correspond to‘ the
actual states in the lower models. We suggest that generally this joint distribution over states will
be sufficient to identify states in the original HMM ϕu. However, we note that not all cascaded
switching HMMs ϕ will be identifiable. An example of a switching HMM that is not identifiable
by our cascade model can be constructed (1) by selecting a particularly simple form for ϕu, such
that each state deterministically selects a single Markov chain in each of the switching HMMs, and
then (2) by considering transition probability matrices in Am(λlγ ), γ = 1, 2, whose stationary
distributions are identical, but whose actual transitions differ.
Proceeding, for the following analysis, assume that the number of states and the form of the
density function in each HMM and switching HMM in the original model ϕ are known, and that we
are attempting with ˆϕ to learn a set of first-order transition probabilities and observation density
parameters representing the original data. Consider state-observation sequence pair x l1n , y
l1n . Even
though this sequence was not generated by an HMM, there exists an HMM that represents the first-
order statistics of this sequence, i.e., one that exactly matches the first-order transition probabilities
and observation densities of this sequence. As shown in the previous chapter, the model ϕl1 in our
cascade structure will converge to this model when updated using the recursive maximum-likelihood
algorithm presented in Chapter 2. The same applies to ϕl2 .
Next, consider the estimated composite state sequence xl1n , x
l2n recognized by the models ϕl1
and ϕl2 . We will assume that, as models ϕl1 and ϕl2 converge, this sequence will be representative
of the true state sequences in the switching HMMs λl1 and λl2 which generated the data. As above,
we note that there then exists an HMM that can represent the first-order statistics of this sequence,
which, again, we can learn through recursive maximum-likelihood estimation. Each state in model
ϕu will correspond to a unique state in ϕu if the joint distribution of states in λl1 and λl2 is unique
for each state in ϕu.
56
3.3.3 Numerical simulations
We will be using the setup in Figure 3.6 for numerical simulations. Since the structure of the
generating model and the learning model are different, we cannot directly compare the learned
parameter values. What we can show in simulation is
1. that the likelihood for each model increases during training, and
2. that the learned models can classify the original data and produce state sequences represen-
tative of the original model sequences.
For the simulation, we used the following parameters for the generative cascaded switching HMM.
HMM ϕu was a three-state finite-alphabet HMM, with parameters
A(ϕu) =
0.98 0.01 0.01
0.01 0.98 0.01
0.01 0.01 0.98
,
bl1(ϕu) =
0.8 0.1 0.1
0.1 0.8 0.1
0.1 0.1 0.8
,
and
bl1(ϕu) =
0.48 0.48 0.02 0.02
0.02 0.02 0.94 0.02
0.02 0.02 0.02 0.94
,
where, to simplify calculations, the discrete observations are assumed to be independent and mod-
eled with two discrete, finite probability mass functions bl1(ϕu) and bl2(ϕu).
Switching HMM λl1was modeled using three probability transition matrices
A1(λl1) =
0.90 0.05 0.05
0.80 0.10 0.10
0.80 0.10 0.10
,
A2(λl1) =
0.10 0.80 0.10
0.05 0.90 0.05
0.10 0.80 0.10
,
57
and
A3(λl1) =
0.10 0.10 0.80
0.10 0.10 0.80
0.05 0.05 0.90
,
and single-dimensional Gaussian observation pdfs, with parameters
µ = [ 0 7 9 ]′, σ2 = [ 2.0 0.8 0.6 ]′.
Similarly, HMM λl2 was modeled using four probability transition matrices
A1(λl2) =
0.90 0.08 0.01 0.01
0.50 0.50 0.00 0.00
0.50 0.50 0.00 0.00
0.50 0.50 0.00 0.00
, A1(λl2) =
0.00 0.50 0.50 0.00
0.01 0.90 0.08 0.01
0.00 0.50 0.50 0.00
0.00 0.50 0.50 0.00
,
A1(λl2) =
0.00 0.50 0.50 0.00
0.00 0.50 0.50 0.00
0.01 0.08 0.90 0.01
0.00 0.50 0.50 0.00
, A1(λl2) =
0.00 0.00 0.50 0.50
0.00 0.00 0.50 0.50
0.00 0.00 0.50 0.50
0.01 0.01 0.08 0.90
,
and Gaussian observation pdfs with parameters
µ = [ −1 3 6 9 ], σ2 = [ 0.6 0.5 0.8 0.7 ].
For the learning model ˆϕ = ϕu, ϕl1 , ϕl2, transition probabilities for all HMMs were initial-
ized uniformly. The finite-alphabet observation densities in ϕu were initialized randomly, and the
Gaussians in ϕl1 and ϕl2 were initialized by running the generative model for 1000 iterations and
using k-means clustering to determine a set of starting means and variances.2 To make the prob-
lem slightly more interesting, Gaussian noise with zero mean and standard deviation one was then
added to the initial means, and noise with zero mean and standard deviation 0.5 was added to the
variances. For recursive maximum-likelihood training, we let learning rate εn = 0.006n0.2 , where n is
the iteration number, and we used a smoothing history of k = 1000 (see Section 2.3.3).
Figure 3.7 shows the output a training run for model ϕu. Figure 3.7(a) gives a running average
of the log-likelihood of the observations versus time. The remaining subfigures show the convergence
2In fact, this initialization could be done randomly as well, but k-means clustering and similar techniques arecommonly used to give a set of initial parameters which will converge in a reasonable amount of time [93].
58
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−3
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of pϕu(y1, . . . , yn).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
(b) Training of transition probabil-ity matrix A(ϕu).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,1
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,2
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,3
)
(c) Training of observation probabil-ity matrix bl1(ϕu).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,1
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,2
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,3
)
(d) Training of observation proba-bility matrix bl2(ϕu).
Figure 3.7: Parameter learning for model ϕu. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.
of the parameters of ϕu during training. Since we cannot compare this trained model directly to
the original model ϕu, it is difficult to draw conclusions from these graphs with regard to the
“goodness” of the model. What we can say is that the parameters did converge, and that as they
converged, the log-likelihood of the observations generally increased. Note that since we are doing
stochastic optimization, we are not guaranteed to always increase the likelihood; hence there may
be an occasional dip in the likelihood graphs, especially near the beginning. Similar graphs for
models ϕl1 and ϕl2 are shown in Figures 3.8 and 3.9. Note that since we used k-means clustering
to initialize ϕl1 and ϕl2 , their density estimations started quite close to their optimal values.
One way to compare the models in our simulation is by finding the correspondence between
the states of ϕu and ϕu after training (or after the model parameters seem to have converged)
59
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−3
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of p
ϕl1
(yl11 , . . . , yl1
n ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
(b) Training of transition probabil-ity matrix A(ϕl1 ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−1
0
1
2
3
4
5
6
7
8
9
time
µ
(c) Training of Gaussian observationdensity mean µ(ϕl1).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.5
0
0.5
1
1.5
2
2.5
3
time
σ
(d) Training of Gaussian observa-tion density variance σ2(ϕl1 ).
−10 −5 0 5 10 150
500
1000
1500
2000
2500
3000
3500
(e) Data histogram and observationdistribution learned by the model.
Figure 3.8: Parameter learning for model ϕl1 . Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.
60
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−3
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of p
ϕl2
(yl11 , . . . , yl2
n ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,4)
(b) Training of transition probabil-ity matrix A(ϕl2 ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−2
0
2
4
6
8
10
time
µ
(c) Training of Gaussian observationdensity mean µ(ϕl2).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
time
σ
(d) Training of Gaussian observa-tion density variance σ2(ϕl2 ).
−4 −2 0 2 4 6 8 10 12 140
500
1000
1500
2000
2500
3000
(e) Data histogram and observationdistribution learned by the model.
Figure 3.9: Training run output for model ϕl2 . Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.
61
0 20 40 60 80 100 120 140 160 180 200
1
2
3
States of φu
0 20 40 60 80 100 120 140 160 180 200
1
2
3
States of φu
Figure 3.10: State sequence comparison between generative HMM ϕu and learned HMM ϕu. Ascan be seen from the figure, state 1 in model ϕu corresponds to state 3 in model ϕu, state 2 in ϕu
corresponds to state 1 in ϕu, and state 3 in ϕu corresponds to state 2 in ϕu.
Table 3.1: Average classification accuracy for learned HMM ϕu over 50 simulation runs. Thenumber in parenthesis is the standard deviation.
Maximum LikelihoodClassification
Viterbi Classification
ϕl1 92.6%(3.5%) 91.9%(4.5%)
ϕl2 94.2%(3.8%) 94.5%(4.2%)
ϕu 91.1%(3.3%) 95.3%(3.7%)
and then measuring the classification accuracy of ϕu against the original sequences generated by
ϕu. Figure 3.10 shows a plot with a portion of the state sequence generated by ϕu, along with
the same portion recognized by ϕu. As can be surmised from the figure, each state in the original
model was shifted “up” one state in the learned model. A similar analysis can be done for the lower
level HMMs/switching HMMs in each model. Using this correspondence, we ran the generative
model for 10 000 iterations and measured the accuracy of the trained model. We repeated this
experiment for 50 training episodes of 50 000 iterations each. Means and standard deviations for
model accuracy are summarized in Table 3.1. Maximum-likelihood classification was based on the
forward filters described in the previous chapter. For Viterbi classification, we calculated the state
sequence backwards at appropriate times according to the algorithm presented in Section C.3 of
Appendix C. Specifically, if the backpointers ψn all pointed to the same state at time n − 1, we
62
set a break and calculated the most likely sequence back to the previous break, or back to the
beginning of the sequence if there was no previous break. This sequence was then compared with
the generated state sequence to produce the second column in Table 3.1.
Note that for many of these runs, the HMMs in the cascade model did not all completely
converge after 50 000 iterations. In particular, because of our model setup, the observations for ϕl2
often required longer to converge for certain initializations. Despite this, the model still proved to
be a reasonable classifier at all levels, as indicated by Table 3.1.
3.4 Discussion
In this chapter we have presented a cascade hidden Markov model architecture, offered some in-
formal analysis concerning convergence of the model, and presented a Monte Carlo simulation of
the model. Since the cascade model cannot be used in a fully generative fashion, we proposed a
cascaded switching HMM in order to incorporate proper dependencies into our data, that is, to
give it structure at multiple time scales. The cascaded switching HMM is an interesting model
itself, and we hope in some future life to be able to study it in more detail. Initial study indicates
that a version of the RMLE could be derived for this model in a manner similar to the RMLE
derivation for the hidden semi-Markov model presented in Appendix D. That said, the cascade
model presented in this chapter offers a distinct advantage in simplicity.
We believe the discussion and results presented in this chapter justify the use of the cascade
model even in situations where the model does not match the underlying model of the system. The
simulation results above showed that the model can learn information about the structure of data
at multiple scales, and can use that information to make classification decisions. The model also
seems to be rather robust: in many of the simulation runs, one state of lower model ϕl1 did not
converge to the original model within 50 000 iterations, and yet the upper model ϕu still recognized
the original state sequence from ϕu with over 90% accuracy.
The next chapter describes the application of this model to real world data, as a means for a
mobile robot to learn concepts.
63
CHAPTER 4
CASCADE OF HMMS AS AN
ASSOCIATIVE MEMORY
4.1 Introduction
Our original motivation for proposing the model described in Chapter 3 was to create a model able
to learn simple concepts, which, as we suggested in Chapter 1, are formed by associations within
and among information from a sensory-motor system. In this chapter, we will describe the use of
the cascade model from Chapter 3 for this purpose. Specifically, we will demonstrate the model’s
ability to learn concepts among features from visual and auditory streams as sensed by a mobile
robot.
4.2 Associative Learning of Language Using Robots
A number of researchers are using robotics to study language grounding and/or associative language
learning. Most of the work in this area has focused on the association of auditory and sensory infor-
mation, where the auditory information generally represents speech, and the sensory information
is generally visual information. We highlight some of this work below.
For association of speech and visual information, Roy [105] has proposed a model of grounded
language learning called cross-channel early lexical learning (CELL), in which speech provides noisy
and ambiguous labels for video, and vice versa. In this work, words are discovered by searching
for segments of speech which reliably predict co-occurring visual cues. Since these pairings are
64
extremely noisy, the technique used to find potential speech segments searches for matching sub-
segments of speech in matching visual contexts. Initial training used recorded speech from mothers
playing with their infants for auditory input, and static images of related objects for visual input.
Later, the system was incorporated into a real-time speech and vision system embodied in a robotic
arm. Notably, the system incrementally learns words, then a rudimentary grammar. It can also
generate spoken outputs from stored word prototypes.
Steels [106] and Steels and Kaplan [107] focus not on specific learning models, but on the in-
teraction between the robot and researcher. Steels presents the idea of language games, whereby a
person interacts with a robot for the purpose of teaching the robot words. For experiments, Steels
and Kaplan use an off-the-shelf speech recognizer to associate words and contextual information us-
ing simple instance-based learning algorithms. Our own experiments are similar to those described
in [106, 107].
For the association of words and general sensory information, Oates et al. [108] and Oates [109]
present a stochastic method for clustering words according to syntactic information, then separately
estimate the conditional probability that the word would be uttered given a set of generic sensor
readings from a mobile robot. They use their system to first associate written descriptions, and later
spoken descriptions, of the activities of a robot with the sensor readings. Later, Burns et al. [110]
proposed an information theoretic approach for learning similar associations with the same robot.
As mentioned in Section 1.3.2.4, two others on our own project have conducted research in
similar areas. Liu [51] developed a system where the robot learned associations between words and
“pushes” (tactile inputs), by which a robot learned to understand spoken navigational commands.
Zhu and Levinson [52] proposed a method to learn a joint probability density function (JPDF)
representation of the relationship of visual information and a text label, for learning such concepts
as color, shape, and object name.
Although it is usually not their main focus, many other developmental robot projects include
aspects of language study in their work [11, 12, 16, 20, 46, 110, 111].
Our proposal to use an HMM cascade structure to model associative learning is novel in the
same way as we described in the last chapter—we specifically model the relationship between repre-
sentations in multiple modalities with an HMM, as opposed to simply learning a joint distribution
65
"Apple"
Figure 4.1: Concept learning scenario using a cascade of HMMs. This model corresponds toFigure 1.6 on page 19, with the generic models replaced by HMMs.
between the two modalities, as is generally done in the work cited above. Our use of this model
appears next.
4.3 Concept Learning Scenario
Analysis of our model in Chapter 3 assumed that the data being analyzed came from the same
underlying source. In fact, unless our auditory and visual input streams are being produced by
an intelligent projector or R2-D2, it is unlikely that the data was produced in this manner. A
more likely scenario comes from Figure 4.1, which is derived from Figure 1.6 in Chapter 1. In this
scenario, both the robot and the person have a model of the world, which here is represented by
a cascade of HMMs. We assume that each model structurally allows the recognition of visual and
auditory information present in the world (the lower level models), and further, that concepts can
be inferred and understood from the sequence of discrete classifications of this auditory and visual
information (using the upper level model).
It is assumed that the boy’s model of the world is better or more complete than the robot’s
model and, therefore, that the goal of the robot is to learn the boy’s model of the world. To reach
this goal, the robot must try to garner information about each of the boy’s submodels. To learn
the boy’s visual submodel, the robot will use visual data obtained from the world and assume that
66
the boy’s model was learned from similar information. For learning the boy’s auditory submodel,
the robot will use the boy’s own “speech”, and to learn the boy’s concept model, the robot will
attempt to find a relationship between what the boy says and what the world presents visually.
4.3.1 Model description
Formally, the structure of our model is equivalent to the structure developed in Chapter 3, although
the flow of information through this structure may be different. For our scenario, assume that
our robot’s model of the world is a cascade model ˆϕrobot = ϕc, ϕa, ϕv, where ϕa and ϕv are
auditory and visual HMMs, respectively (corresponding to ϕl1 and ϕl2), and ϕc is a concept HMM
(corresponding to ϕu). Assume that the boy’s model of the world is a hybrid cascade model
ϕboy = ϕc, λa,ϕv, where λa is a switching HMM modeling audio information, as described in the
previous chapter, and the other submodels are visual and concept HMMs as before. Finally, assume
that the visual information presented by the world (e.g., the apple) is represented by a traditional
HMM ϕvis. In this scenario, ϕvis and ϕboy are fixed, and ϕv is the boy’s representation of ϕvis.
4.3.2 Model scenario
Figure 4.2 shows the model topology of the scenario we envision. This scenario proceeds as follows:
1. The model ϕvis produces a stream of states xvisn and corresponding visual features yvis
n .
The visual features yvisn are accessible by both the boy and the robot. The stream of states
may include such states as xvisn = APPLE and xvis
n = NOTHING.
2. The boy uses ϕv to recognize this visual stream, producing estimated state sequence xvn.
3. Using only the visual partial of the joint audio-visual state pdfs in concept model ϕc, the
boy “thinks” of the concept related to the visual input (i.e., chooses the most likely state xcn
concept state in ϕc corresponding to xvn).
4. The boy may choose, at random times, to “speak his mind.” At these times, he uses the
auditory observation pdf from state xcn to produce yca
n . This output becomes the switch for
switching HMM λa, which produces output stream yaudn = ya
n. It is assumed that the
67
cϕ
aϕ
nya
^nax
n
c,2yn
v^n
y = xc,1
aλ
vϕ
^nvx
vϕ
^ ^^ vy =x , x nnc
n
a
ϕboy_
ny = yv
n
visnyvis audy = y
n n
a
nyvis
ϕvis
robotϕ_ϕc
Figure 4.2: Model topology for robot concept learning. The topology of this diagram corresponds tothe scenario presented in Figure 4.1. The lower model ϕvis is a model of the world producing visualoutputs. The upper left model ϕboy recognizes this visual input and produces auditory output.The upper right model ˆϕrobot uses both the visual input produced by ϕvis and the auditory inputproduced by ϕboy, and trains its various submodels.
switch is “on” long enough to produce meaningful output from λa. At other times, the model
λa produces “silence” (i.e., xan = SILENCE, and ya
n represents this state appropriately).
5. The robot simultaneously recognizes and learns (clusters) class information from visual input
stream yvisn with HMM ϕv, and auditory input stream yaud
n with HMM ϕa. These models
produce estimated state sequences xvn and xa
n, respectively.
6. When both xvn and xa
n have meaningful information (i.e., xan 6= SILENCE and xv
n 6=
NOTHING), model ϕc both:
(a) updates (learns) using these inputs, (i.e., it clusters common co-occurrences), and
(b) estimates xc, its “thoughts” about the pair of inputs.
68
7. At other times, when only one of xvn and xa
n have meaningful information, ϕc uses only the
partial pdf associated with that input to estimate xc, and the model is not updated.
When actually run on the robot, estimated state information from all of the robot models may be
used by other programs (e.g., the controller) to make decisions.
4.3.3 Simulation results
Using the scenario outlined above, we ran a Monte Carlo simulation of the composite model. This
section outlines those results. The following parameters were used for the fixed models ϕvis and
ϕboy = ϕc, λa,ϕv. Let ϕvis be an HMM with Gaussian observations. Define its transition
probability matrix as
A(ϕvis) =
0.90 0.05 0.05
0.04 0.95 0.01
0.04 0.01 0.95
,
and its Gaussian observation density parameters as
µ(ϕvis) = [ 0 7 9 ]′, σ2(ϕvis) = [ 1.0 0.7 0.6 ]′.
Let the boy’s visual model ϕv be a learned version of ϕvis, i.e., ϕv ≈ ϕvis.
For the boy’s auditory switched HMM λa, let the set of transition probability matrices Am(λa),
1 ≤ m ≤ 3, be defined as
A1(λa) =
0.94 0.02 0.02 0.02
0.94 0.02 0.02 0.02
0.94 0.02 0.02 0.02
0.94 0.02 0.02 0.02
,
A2(λa) =
0.05 0.45 0.45 0.05
0.01 0.90 0.08 0.01
0.01 0.70 0.28 0.01
0.05 0.45 0.45 0.05
,
69
and
A3(λa) =
0.05 0.05 0.45 0.45
0.05 0.05 0.45 0.45
0.05 0.05 0.10 0.80
0.01 0.01 0.08 0.90
,
and let the Gaussian parameters µ(λa) and σ2(λa) be
µ(λa) =
[
0.0 3.0 5.0 7.0
]′
, σ2(λa) =
[
1.0 0.4 0.5 0.6
]′
.
Finally, let the boy’s concept HMM ϕc be defined by
A(ϕc) =
0.90 0.05 0.05
0.08 0.90 0.02
0.08 0.02 0.90
,
bv(ϕc) =
0.98 0.01 0.01
0.02 0.90 0.08
0.02 0.08 0.90
,
and
ba(ϕc) =
0.96 0.02 0.02
0.10 0.90 0.00
0.10 0.00 0.90
.
Note that bv(ϕc) represents a distribution over the states of ϕv, whereas ba(ϕc) represents a distri-
bution over the selection of Markov chains Am(λa). These two distributions are not and cannot
be used simultaneously.
As in the cascade model simulation in Chapter 3, assume we know the number of states and
type of distribution for each of the models, so that ˆϕrobot = ϕc, ϕa, ϕv has (approximately) the
correct topology to learn the given models. (See Section 2.3.5 for a brief discussion on model order
approximation when the number of states is not known or easily discernible.) As in the previous
chapter, means and variances for the observation densities of ϕa and ϕv were initialized using
k-means initialization on the first 1000 outputs of the generative model, Gaussian noise with zero
mean and standard deviation one was then added to the initial means, and noise with zero mean
70
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of pϕu(y1, . . . , yn).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
(b) Training of transition probabil-ity matrix A(ϕu).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,1
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,2
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b1
’(:,3
)
(c) Training of observation probabil-ity matrix bl1(ϕu).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,1
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,2
)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
b2
’(:,3
)
(d) Training of observation proba-bility matrix bl2(ϕu).
Figure 4.3: Parameter learning for model ϕc. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.
and standard deviation 0.5 was added to the variances. For recursive maximum-likelihood training,
we again let learning rate εn = 0.006n0.2 , where n is the iteration number, and we used a smoothing
history of k = 1000 (see Section 2.3.3).
Figure 4.3 shows the progression of a training run for model ϕc. As before, Figure 4.3(a) gives
a running average of the log-likelihood, and the remaining subfigures show the progression of the
parameter values through time. As can be seen from the graphs, most of the parameters converge
quite rapidly. Parameters which converge more slowly, such as those in Figure 4.3(d), are those
that are tracking changes in lower models which have not yet converged. The convergence of ϕv
and ϕa can be seen in Figures 4.4 and 4.5. As before, since we used k-means initialization for
initializing the observation densities of ϕa and ϕv, these values started somewhat close to their
71
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of p
ϕl1
(yl11 , . . . , yl1
n ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
(b) Training of transition probabil-ity matrix A(ϕl1 ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
1
2
3
4
5
6
7
8
9
time
µ
(c) Training of Gaussian observationdensity mean µ(ϕl1).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
1.5
2
2.5
3
time
σ
(d) Training of Gaussian observa-tion density variance σ2(ϕl1 ).
−6 −4 −2 0 2 4 6 8 10 120
500
1000
1500
2000
2500
3000
3500
(e) Data histogram and observationdistribution learned by the model.
Figure 4.4: Training run output for model ϕv. Since, in simulation, this model learns directly fromdata generated by ϕv, we can compare the model parameters of these models directly. Originalmodel parameters are indicated by the symbol at the right edge of the graphs.
72
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−2.5
−2
−1.5
−1
−0.5
0
time
(1/n
)log
p n
(a) Running average of log likeli-hood of p
ϕl2
(yl11 , . . . , yl2
n ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,1)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,2)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,3)
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.5
1
time
A’(:
,4)
(b) Training of transition probabil-ity matrix A(ϕl2 ).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
1
2
3
4
5
6
7
time
µ
(c) Training of Gaussian observationdensity mean µ(ϕl2).
0.5 1 1.5 2 2.5 3 3.5 4
x 104
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
time
σ
(d) Training of Gaussian observa-tion density variance σ2(ϕl2 ).
−5 0 5 100
500
1000
1500
2000
2500
3000
3500
4000
(e) Data histogram and observationdistribution learned by the model.
Figure 4.5: Parameter learning for model ϕa. Although the learned model cannot be directlycompared to the original model, these graphs show that the model parameters converge.
73
Table 4.1: Average classification accuracy for learned HMM ϕc over 50 simulation runs. Thenumber in parenthesis is the standard deviation.
Maximum-LikelihoodClassification
Viterbi Classification
ϕa 90.1%(3.7%) 89.9%(4.2%)
ϕv 97.9%(1.6%) 99.1%(2.4%)
ϕc 98.4%(1.1%) 98.8%(1.1%)
optimum, although as noted, noise was added to this initialization. Because of this added noise
and the nature of the source model distributions, in some cases the variance parameters for the
Gaussian observation distributions did not converge to the correct values by 40 000 iterations.
We repeated this experiment for 50 training episodes of 50 000 iterations each, and used the
same method outlined in the previous chapter to measure model accuracy. These episodes are
summarized in Table 4.1. As before, the cascade model showed a high degree of robustness.
4.4 Robotic Experiments
The basic scenario for robotic experiments of the model is similar to the simulated scenario presented
in the last section: the robot and a person are looking at the same object, the person names the
object or briefly some aspect of it, and the robot, over time, learns the association between that
word or phrase and the visual features of the object. This scenario, while describing a necessary
aspect of learning in our robot, does not take into account the overall goals and work of the project
described in Chapter 1. Here we describe an experiment which better demonstrates these goals.
In our real scenario, our robot is wandering around a benign environment, and is instinctually
motivated to look for “interesting” things. We expect the following behaviors:
1. It will be attracted to objects, especially ones that it has not seen before, or not seen recently;
it will “play” with these objects, attempting to first pick them up, then knock them over.
2. It will be attracted by loud noises, turning toward them and assuming, e.g., that someone
wants to get its attention.
3. Using our proposed cascade model, it will
74
(a) learn to recognize the visual objects in its environment,
(b) learn to recognize distinct words spoken to it, and
(c) learn the concepts associated with the various words and objects.
4. Also using our HMM cascade, it will demonstrate that it recognizes these concepts by
(a) recognizing a word, choosing a corresponding concept, and finding an object which also
matches that concept, and
(b) recognizing an object and saying the name of a concept corresponding to that object.
The behaviors listed in numbers one and two above were first demonstrated by McClain [50]. The
demonstration described here builds on his work and on the work of others, including
• sound source localization research by Li and Levinson [31, 32],
• speech feature extraction and synthesis research by M. Kleffner [48], and
• visual feature extraction by R. S. Lin (unpublished).
The specific objects we are using in this demonstration are shown in Figure 4.6, and the list of
words and phrases we say are listed in Table 4.2. These words were chosen to test the learning
concepts for specifically named objects (such as cat) as well as concepts for general categories (such
as animal). Although not necessary, the concepts we initially learn correspond directly to the words
and phrases listed in Table 4.2.
Because they pertain directly to our work, autonomous exploration and speech and visual feature
extraction are discussed below, with the details of both feature extraction algorithms appearing
in Appendix B. This discussion is followed by a description of the implementation of our HMM
cascade model for our robots in Section 4.4.3.
4.4.1 Finite state machine controller
The central component of the above experiment is a finite state machine (FSM) controller developed
by McClain [50] as a part of an autonomous exploration mode for our robot . This controller
continuously evaluates the state of the robot and its environment, and uses this information to
75
Figure 4.6: Objects used in our robot demonstration.
Table 4.2: List of words used in our robot demonstration.
animalballcatdog
green ballred ball
make decisions and produce specific types of behavior. For our experiment, we modified the state
machine and its related programs to use information from our associative memory when making
decisions, as well as to facilitate learning in our model. The FSM we are using is shown in Figure 4.7.
A description of each state is as follows:
1. Explore: look around for something interesting.
(a) If we see an interesting object (such as one we have not seen before), go to state 2.
(b) If we hear an interesting (i.e., loud) sound, go to state 6.
2. Found object: an object is visible.
(a) If it is far away, approach it, study what it looks like, and stay in state 2.
(b) If it is near, go to state 3.
3. Learn name: learn the name or feature of an object.
(a) If we hear something, repeat it and try to associate it with this object; stay in state 3.
(b) After a period of silence, go to state 4.
4. Play 1: play with the object.
(a) Approach and attempt to pick up the object; go to state 5.
5. Play 2: play with the object.
76
explore−−−/
visibleobject
silence(timeout)
−−−/pick upobject
−−−/run intoobject
speech: "illy"/turn toward
sound
timeoutexpired/
beep
silence(timeout)
wrong object/explore
found nothing or
founddesired object/
say name
object far/approach object
learn object
unknown speech/beep object
nearspeech/
repeat & learn
Object2−Found
3−Learn Name
6−Interact
4−Play 1
7−Search 5−Play 2
objectlost
1−Explore
search for objecthear known object/
Figure 4.7: The robot’s finite state machine controller. Values on the arcs indicate the inputs andcorresponding behaviors when transitioning between states.
(a) Try to knock the object over; go back to state 1.
6. Interact: listen for known sounds.
(a) If we hear the name of an object we know, look for it; go to state 7.
(b) If we hear something we do not know, beep and stay in state 6.
(c) If we do not hear anything for a short period, go back to state 1.
7. Search: look for a particular object.
(a) If we have not found the object, keep looking, and stay in state 7.
(b) If we have not found the object after a long time, give up and go to state 1.
(c) If we find the desired object, say the name (if we know it), and go to state 2.
The role our HMM cascade associative memory plays changes depending on the state of the model.
We describe these roles below in Section 4.4.3.
4.4.2 Sensory inputs
We are running this experiment on a real robot with real sensory inputs, so in addition to the FSM
controller, our associative memory needs features extracted from live speech and visual inputs.
77
For speech data analysis, we are extracting energy, voicing confidence, and a set of log-area ratios
(LARs) from a 16-kHz audio stream. This processing is based on work developed by Kleffner [48] for
speech imitation for the robot, and is described in Appendix B. Typically, around 8-12 LARs plus
pitch and voicing information can be used to synthesize a very accurate reproduction of the speech
signal. For our work, we are currently extracting three LARs, log energy, and voicing confidence
on consecutive 20-ms segments of audio, giving us a stream of length-five feature vectors at 50 Hz.
Despite the short length of this feature vector, these features are very representative of the speech
signal; using only the three extracted LARs and voicing information, we can still synthesize speech
that is intelligible.
For visual data analysis, the current experiment is using a robust segmentation and feature
extraction algorithm developed by R. S. Lin (unpublished). The segmentation algorithm is based
on loopy belief propagation [112]. After image segmentation, the feature extractor presents a
length-10 visual feature vector for each object in an image, consisting of
1. a normalized length-eight color histogram,
2. the first moment of the object shape, and
3. the height/width ratio of the object.
These features appear at a rate of about 2 sets per second. Descriptions of the segmentation and
feature extraction algorithms are presented in Appendix B.
4.4.3 HMM cascade model setup
Our HMM cascade model is set up structurally similar to model ˆϕrobot in Figure 4.2 on page 68,
with some modifications. The biggest change is in our audio model ϕa, which is actually a two level
model with some additional processing, as shown in Figure 4.8. Conceptually, the lower model is
a phonetic model, and the upper model is a word model. As mentioned above, for auditory input
features we are using three log-area ratios, log energy, and voicing information. These features
are presented to our phonetic model, a 3-state HMM. The observation densities in each state of
this model were initialized from silence, voiced, and unvoiced auditory data features, respectively.
Transition probabilities were initialized uniformly, and the model was then trained with the RMLE
78
ϕ^ aud
n
audy
ϕ^ word
S2 S3S1
n’
word~ y
n’
word x ^
n
audx ^
Figure 4.8: Auditory model used for speech recognition in our robot. Lower HMM ϕaud is aphonetic model. Sequences of states from this model corresponding to words are converted to ahistogram and normalized. This histogram and the word length comprise the word feature vector.This vector is quantized, and then presented to word HMM ϕword, and used to estimate xword
n .
algorithm using features extracted from recorded speech of 20 sentences from the Harvard list of
phonetically balanced sentences [113], shown in Table 4.3 on page 81. The training of some of the
parameters in this model is shown in Figure 4.9.
For the word recognizer, we needed some way of representing the features of variable length
words or phrases. We first made the assumption that only words or short phrases would be spoken,
i.e., that we would not need to parse full sentences. A voice activity detector (a component of the
the speech imitation code) was used to determine the boundaries of these words or phrases. Using
these boundaries, we extracted the word/phrase length and calculated a normalized histogram of
the state sequence recognized by the phonetic HMM, giving us a length-4 word feature vector (i.e.,
the word length plus one histogram value for each of the three states of ϕaud). We then quantized
79
1 2 3 4 5 6 7 8 9 10
x 104
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
iterations
(a) Training of transition probabil-ity matrix A(ϕaud) for model ϕaud.
1 2 3 4 5 6 7 8 9 10
x 104
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
iterations
(b) Training of Gaussian meanµ(ϕaud) for state 1 of modelϕaud.
1 2 3 4 5 6 7 8 9 10
x 104
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
iterations
(c) Training of Gaussian meanµ(ϕaud) for state 2 of model ϕaud.
1 2 3 4 5 6 7 8 9 10
x 104
−0.4
−0.2
0
0.2
0.4
0.6
0.8
iterations
(d) Training of Gaussian covariancematrix Σ(ϕaud) for state 1 of modelϕaud.
1 2 3 4 5 6 7 8 9 10
x 104
−0.2
0
0.2
0.4
0.6
0.8
iterations
(e) Training of Gaussian covariancematrix Σ(ϕaud) for state 2 of modelϕaud.
Figure 4.9: Parameter estimation for phonetic HMM ϕaud. Training of means and covariances ofthe Gaussian observation densities are shown for two of the three states in the model.
80
Table 4.3: Harvard phonetically balanced sentences. Features extracted from one wave file of eachsentence above were used to train the 3-state phonetic HMM.
List 1 List 21. The birch canoe slid on the smooth planks.2. Glue the sheet to the dark blue background.3. It’s easy to tell the depth of a well.4. These days a chicken leg is a rare dish.5. Rice is often served in round bowls.6. The juice of lemons makes fine punch.7. The box was thrown beside the parked truck.8. The hogs were fed chopped corn and garbage.9. Four hours of steady work faced us.
10. Large size in stockings is hard to sell.
1. The boy was there when the sun rose.2. A rod is used to catch pink salmon.3. The source of the huge river is the clear spring.4. Kick the ball straight and follow through.5. Help the woman get back to her feet.6. A pot of tea helps to pass the evening.7. Smoky fires lack flame and heat.8. The soft cushion broke the man’s fall.9. The salt breeze came across from the sea.
10. The girl at the booth sold fifty bonds.
each component of this feature vector into five bins, and the resulting quantized feature vector was
presented to the word HMM. The feature vector quantization bins were non-uniform; the cutoffs
were determined by dividing the sorted list of each feature value for our training set into five roughly
equal bins, as shown in Figure 4.10. The discrete observation densities for each state in the HMM
were initialized using 10 repetitions of each word or phrase in Table 4.2, giving one state per word.
Audio features were extracted from each training waveform, passed through and recognized by the
phonetic HMM, and then quantized. Transition probabilities were initialized uniformly, and the
whole word model was then trained for 80 epochs on the same 10 repetitions of each word using the
RMLE algorithm. The training of some of the parameters in this model is shown in Figure 4.11.
For the visual HMM ϕvis, we used a four state HMM to recognize features from the objects
show in Figure 4.6. As with the word model, we initialized the observation densities for the object
models from a collected data set. For each object, we obtained a feature vector for 200 images of
that object taken from multiple perspectives. These vectors were quantized as above before being
used to initialize the densities. Again, transition probabilities were initialized uniformly, and the
model was then trained for 10 epochs on the same data using RMLE. We only used 10 epochs to
train with because the initial density estimates were close to their optimal values, as can be seen
by the parameter training examples in Figure 4.12.
The third model (the concept model), ϕc, is a discrete HMM with observations covering the joint
state spaces of the audio and visual models. We initialized a model with six states, corresponding
81
−4 −3 −2 −1 0 1 2 3 4
⇓
Sorted Samples:
-1.8740 -1.4751 -1.0106 -1.0091 -1.0078 -0.9921 -0.9499 -0.8217 -0.7420 -0.6436
-0.6355 -0.5596 -0.3775 -0.3510 -0.3179 -0.2959 -0.2656 -0.2556 -0.2340 -0.1315
-0.0482 -0.0195 0.0000 0.0403 0.0880 0.1184 0.2120 0.2379 0.3148 0.3803
0.3899 0.4282 0.4437 0.5077 0.5689 0.5690 0.5779 0.5913 0.6145 0.6232
0.6771 0.7310 0.7812 0.7990 0.8956 0.9409 1.0823 1.0950 1.4435 1.6924
⇓
−4 −3 −2 −1 0 1 2 3 4
Figure 4.10: Equalized quantization. A finite number of samples are drawn from an unknownprobability distribution (represented by a Gaussian distribution above), sorted, and divided intoequal groups. The resulting divisions between groups are used as the cutoffs for future quantization.
82
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(a) Training of transitionprobability matrix A(ϕword).
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(b) Training of dimension 1 ofword 1 of discrete observationdensity bjk(ϕword).
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(c) Training of dimension 1 ofword 2 of discrete observationdensity bjk(ϕword).
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(d) Training of dimension 1 ofword 3 of discrete observationdensity bjk(ϕword).
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(e) Training of dimension 1 ofword 4 of discrete observationdensity bjk(ϕword).
Figure 4.11: Parameter learning for word model ϕword. Discrete observation density plots areshown for the first dimension of the observation densities for the first four words of the model.There were a total of six words, each with four observation dimensions (word length plus threestate histograms), quantized into five discrete levels. Plotted above are the probabilities of eachquantization level.
83
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
iterations
(a) Training of transitionprobability matrix A(ϕvis).
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(b) Training of dimension 1 ofobject 1 of discrete observa-tion density bjk(ϕvis).
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(c) Training of dimension 1 ofobject 2 of discrete observa-tion density bjk(ϕvis).
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(d) Training of dimension 1 ofobject 3 of discrete observa-tion density bjk(ϕvis).
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 110000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(e) Training of dimension 1 ofobject 4 of discrete observa-tion density bjk(ϕvis).
Figure 4.12: Parameter learning for model ϕvis. Discrete observation density plots are shown forthe first dimension (moment) of the observation densities for the four objects learned by the model.Each object was represented by a 10-dimensional vector (moment, height/width ratio, and a length-8 color histogram), each quantized into five discrete levels. Plotted above are the probabilities ofeach quantization level.
84
ϕ^
n^ visx
ϕ^ vis
ϕ^con
yn
vis
con vis^n ny = x
n^ conx
yn
aud
xn
aud^
ϕ^aud
Figure 4.13: Recognition of visual representations and concepts. In this state, the robot is notlistening for speech input, so model ϕaud is disabled.
to the six words/phrases in our word list in Table 4.2. The transition probabilities were initialized
uniformly, and the observation probabilities were initialized by hand to bias them slightly toward
the desired concepts. For example, for the state we chose to corresponding to “ball,” the observation
probabilities corresponding to the visually recognized red and green balls were given slightly higher
probabilities than the observation probabilities corresponding to the cat and the dog, and the
observation probability corresponding to the word “ball” was given a slightly higher value than
those probabilities corresponding to other words.1
Depending on the mode of the finite state machine, certain parts of the model are inactive.
Specifically, referring to the FSM in Figure 4.7 and described on page 76, when in states 1, 2, 4, 5,
and 7, auditory input is ignored: the object HMM recognizes visual inputs, and the concept model
uses the marginal density corresponding to the states of the object HMM to determine its state.
This idea is presented in Figure 4.13. In state 6, where the robot is listening for speech input,
the opposite happens: visual input is ignored, the auditory model attempts to recognize spoken
words, and the state of the word model alone determines the state of the concept model, as seen in
1This biasing is not strictly necessary, but helps our model converge in a reasonable amount of time. As withmany modeling scenarios, using prior knowledge to initialize the model is common [114]. We note that Poritz used asimilar type of bias when he conducted experiments on unsupervised learning of speech using HMMs [72].
85
ϕ^ϕ^con
con aud^n ny = x
n^ conx
ϕ^ aud
yn
aud
n^ audx
yn
vis
ϕ^vis
xn
vis^
Figure 4.14: Recognition of auditory representations and concepts. In this state, the robot is onlylistening for speech input, so model ϕvis is disabled.
Figure 4.14. Finally, in state 3, both audio and visual inputs are present. All models are active, and
recognition and learning is done in the concept model with both inputs, as shown in Figure 4.15.
Note that learning is possible in both the auditory and visual models in any state where the model
of interest is active. For the experiment here, we chose not to enable learning in these models.
4.4.4 Issues
There are a few miscellaneous issues we must deal with in our experiment, depending on the current
state of the FSM. The first issue is that multiple objects may be present within a scene. When this
happens, each visual object is presented in sequence to the visual HMM. In this way, the transition
probabilities in the visual HMM would come to represent information about the spatial relationship
between various objects, in that objects that are close to one another will frequently be presented
to the HMM sequentially.
Depending on the state of the FSM, one of these objects is identified as a target object. For
example, in state 2, the target object would be the object first identified as “interesting” in state
one. In subsequent iterations, the robot will remember and attempt to track this target object,
e.g., so that it can be played with later.
86
ϕ^
n^ visx xn
aud^
yn
aud
ϕ^ vis
ϕ^con
ϕ^aud
yn
vis
con vis aud^ ^y =x , x n n n
n^ conx
Figure 4.15: Recognition and learning using both auditory and visual information.
Because we have stereo cameras, we additionally have a correspondence problem to deal with.
Currently, at every iteration the model recognizes objects in each image separately, and then cor-
respondence is determined using the recognition labels (i.e., the recognized states of ϕv) for each
image. Objects which appear in only one of the images are currently ignored. We do not currently
handle the situation where there are multiple objects of the same visual class present.
A final potential issue is object occlusion, where only a portion of an object appears in an
image. As of right now, this has not been a serious issue for us. In the case where an object is
misclassified because it is occluded in one image, but fully visible in the other, correspondence is
not drawn between the two objects. If the robot is looking for this object, it will eventually find it
when it moves or turns its head. As the robot approaches an object, the bottom of the object may
also be cut off; in this case, we lower its head. Even in the case of a partially occluded object, the
recognition has generally proven robust enough to do proper recognition. This issue may become
more important as we increase the number of objects.
4.4.5 Results
Our goal in this experiment is to show that the concept model ϕcon can be learned from a set of real
inputs. As described above, we initialized and trained the auditory and visual models off-line using
87
Table 4.4: Initial observation probabilities used by the concept HMM for visible objects. Thehorizontal axis refers to the concept class, and the vertical axis refers to the classified visual object.
animal ball cat doggreenball
red ball
0.30 0.20 0.40 0.30 0.15 0.15
0.30 0.20 0.30 0.40 0.15 0.15
0.20 0.30 0.15 0.15 0.40 0.30
0.20 0.30 0.15 0.15 0.30 0.40
Table 4.5: Initial observation probabilities used by the concept HMM for words. The horizontalaxis refers to the concept class, and the vertical axis refers to the classified spoken word.
animal ball cat doggreenball red ball
“animal” 0.50 0.10 0.10 0.10 0.10 0.10“ball” 0.10 0.50 0.10 0.10 0.10 0.10“cat” 0.10 0.10 0.50 0.10 0.10 0.10“dog” 0.10 0.10 0.10 0.50 0.10 0.10
“green ball” 0.10 0.10 0.10 0.10 0.50 0.10“red ball” 0.10 0.10 0.10 0.10 0.10 0.50
recorded auditory and visual features, respectively. Note that, even though the training occurred
off-line, we used recursive maximum-likelihood estimation to learn the model parameters, so this
training could be done online.
For the concept model, we initialized the model as described in Section 4.4.3 above, i.e., we
initially set all of the transition probabilities equal, and by hand initialized the discrete observation
probabilities so that they would have a slight bias to particular concepts. The actual initialization
we used is shown in Tables 4.4 and 4.5. The model was then trained using RMLE during the
simulation run. Specifically, when the FSM entered state 3, the robot would sit in front of a target
object. The visual model ϕv would continuously recognize this object, and the auditory model ϕa
would recognize words that were spoken into a close-talk microphone. When a word was spoken and
88
Figure 4.16: Illy learning about various objects. In this scenario, as Illy approaches various objects,she stops and waits for a verbal description consisting of short words or phrases. Over time, sheassociates these spoken words with the object.
recognized, the state xa of model ϕa corresponding to that word and the state xv of model ϕv were
presented to the concept model, and the model was updated according to the RMLE algorithm.
To speed up training, each input pair was presented 10 times each time the a word was recognized.
This process was repeated multiple times for each object as the robot wandered around and played
with its toys.2 Figure 4.16 shows a picture taken during this training.
Figure 4.17 shows the change of some of the parameters of the concept model as the model is
trained. For the training run shown here, we ran the robot for about 30 min. The final trained
transition and observation probabilities are shown in Tables 4.6, 4.7, and 4.8.
2Because we were slightly impatient to get results, the robot was not actually allowed to play with its toys duringthe training run; it could only look at them and hear their names.
89
100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(a) Training of transitionprobability matrix A(ϕc).
100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(b) Training of the visual in-put of discrete observationdensity (
¯ϕc) for the first con-
cept (“animal”).
100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(c) Training of the visual inputof discrete observation density(¯ϕc) for the second concept
(“ball”).
100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(d) Training of the audio inputof discrete observation density(¯ϕc) for the first concept (“an-
imal”).
100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations
(e) Training of the auditoryinput of discrete observationdensity (
¯ϕc) for the second
concept (“ball”).
Figure 4.17: Parameter learning for model ϕcon. Discrete observation density plots are shown forthe first dimension (moment) of the observation densities for the four objects learned by the model.Each object was represented by a 10-dimensional vector (moment, height/width ratio, and a length-8 color histogram), each quantized into five discrete levels. Plotted above are the probabilities ofeach quantization level.
90
Table 4.6: Trained transition probabilities for the concept HMM. These values were initializeduniformly (i.e., all values started at 1
6 ).
animal ball cat doggreenball red ball
animal 0.4211 0.0856 0.1670 0.1840 0.0699 0.0723ball 0.0723 0.4760 0.0597 0.0721 0.1659 0.1540cat 0.1931 0.1017 0.3717 0.1479 0.0925 0.0931dog 0.2023 0.0911 0.1307 0.4115 0.0776 0.0867
green ball 0.1082 0.2142 0.1002 0.1051 0.3321 0.1401red ball 0.1105 0.1951 0.1026 0.1186 0.1434 0.3298
Table 4.7: Trained observation probabilities used by the concept HMM for visible objects. Thehorizontal axis refers to the concept class, and the vertical axis refers to the classified visual object.
animal ball cat doggreenball
red ball
0.4086 0.0761 0.5412 0.3193 0.0996 0.1041
0.3987 0.0744 0.2665 0.5077 0.0974 0.1027
0.0957 0.4215 0.0963 0.0603 0.5232 0.2777
0.0970 0.4280 0.0960 0.1127 0.2799 0.5155
Table 4.8: Trained observation probabilities used by the concept HMM for words. The horizontalaxis refers to the concept class, and the vertical axis refers to the classified spoken word.
animal ball cat doggreenball red ball
“animal” 0.6728 0.0530 0.1977 0.2134 0.0576 0.0605“ball” 0.0660 0.7088 0.0572 0.0509 0.2166 0.1896“cat” 0.0769 0.0268 0.5738 0.0575 0.0385 0.0398“dog” 0.1112 0.0735 0.1022 0.6194 0.0647 0.0882
“green ball” 0.0396 0.0699 0.0369 0.0306 0.5550 0.0722“red ball” 0.0334 0.0681 0.0322 0.0282 0.0676 0.5496
91
4.4.6 Discussion
The actual results shown are an indication of the long-term capabilities of the model. Specifically,
the model had not converged entirely after 30 min, but the parameter values did move in a direction
which indicated convergence to a useful state.
Trained transition probabilities in Table 4.6 indicate (1) a general affinity for “thinking” of the
same object at consecutive time steps (as indicated by high diagonal values), and (2) a slightly
smaller but discernible relationship between related classifications (e.g., the animal state was more
likely to be followed by a cat or dog state than any of the other states). These transition probabilities
strongly reflect the order of words presented to the model, which is reasonable considering (1) we
only trained the model when a word was present, (2) we often repeated the same word consecutively,
and (3) when we did not repeat a word consecutively, we often spoke another word related to the
same object.
For both auditory and visual inputs, the observation probabilities for each state in the concept
HMM were biased slightly at the beginning of training toward a particular outcome. For the
observation probabilities learned through the end of training, those probabilities referring to visible
objects (in Table 4.7) are the more interesting of the two. For example, the concept ball initially
corresponded to a visual representation of the red or green ball with probabilities 0.3 and 0.3, and
to a visual representation of the cat or dog with probabilities 0.2 and 0.2. The trained values of
these states showed a stronger proclivity to all initial biases. Taking the ball example again, the
final observation probabilities for the red ball and green ball were 0.43 and 0.42, respectively, and
the observation probabilities for the visual representations of cat and dog went down accordingly.
The same was true for other observation probabilities.
The results here indicate that our HMM cascade model can learn a set of concepts using fea-
tures extracted from live auditory and visual inputs measured by a mobile robot exploring its
environment.3 This learned information can then be used by the robot’s controller module to make
important behavioral decisions.
3The auditory inputs in this experiment were presented using a boomset microphone. However, in theory, nothingprevents us from using microphones on the robot, although we would have a noisier signal.
92
4.5 Conclusion
This chapter has discussed the use of our cascade of HMMs as an associative memory. First,
simulation results representing a real-world scenario indicated that this model is viable for learning
associations among concurrent stationary regions of multiple input streams, where each of these
stationary regions are modeled by a state in a hidden Markov model. Next, a live version of this
scenario was run on the robot, whereby features were extracted from auditory and visual streams,
classified by a HMM, and these classifications then used as input to a concept HMM for training
and additional classification.
The robotic implementation of our HMM cascade model presented here is a proof of concept for
an important idea. Specifically, we are able to take noisy, real-world analog inputs, convert them
to symbols (by classifying them), and present them to a controller for use in making important
decisions (for example, whether to approach and play with a particular toy, or look for another).
In other words, our robot is making symbolic decisions based on discrete representations of the real
world around it. In addition, when classifying and learning about real world outputs, the model
learns to associate related auditory and visual information with the same (symbolic) concept. The
model is learned online using a robust maximum-likelihood estimator.
The work described in this chapter explored the case where each concept in our concept HMM
corresponded to exactly one word, though potentially multiple visual objects. An interesting future
experiment would be to learn concepts which could refer to both multiple words and multiple
objects. Another obvious though challenging extension to this work would be to attempt to grow
the cascade model as new visual objects or words are presented to the robot.
93
CHAPTER 5
CONCLUSION
5.1 Summary
The main thrust of this dissertation has been to propose and analyze the use of HMMs in a cascade
architecture, as a means of extracting meaning from information available in multiple input streams.
Our motivation for creating this model was based on our understanding of how people, especially
children, learn meaning. Fundamentally, we believe all of our understanding of the world is based on
information from our senses. Some of this understanding is symbolic, or conceptual, as expressed
especially by language. These symbolic concepts have particular representations in the various
senses, and are related through an underlying spatio-temporal structure. Working backwards, we
believe that concepts are learned by associating, from multiple senses, representations of information
that seem to be related spatially or temporally. The model presented here is our first attempt at
this goal.
For our implementation, we chose to work with stochastic models. This choice was based
on a number of motivating factors, including the well known theory of these models and their
close relationship with optimal Bayes classifiers. One of the most interesting motivating factors,
however, was highlighted in a very recent neuroscience article on early language acquisition [115].
The article states that, according to recent research, infants “use computational strategies to detect
the statistical and prosodic patterns in language input” [115, p. 831]. Thus, our use of stochastic
models for a similar purpose seems particularly suitable.
Our particular choice to use and extend HMMs for our stochastic model stems from their
inclusion of a notion of time and sequence in the model definition. While more expressive stochas-
94
tic models exist, HMMs are currently the most feasible for our project, both conceptually and
computationally. We found that we could run a cascade of small models (3-10 states, 5-10 dimen-
sional observations) in real time, including model updates, at around 50 iterations per second on
a 2.2-GHz desktop machine, although processing audio and video features on the same machine
significantly reduced this rate. Further computational optimizations and more advanced hardware
will make much more complex models computationally feasible. The relatively recent development
of robust, iterative learning algorithms for these models also contributed to their suitability for our
application.
Our HMM cascade model itself is robust in both theory and simulation. Since each submodel of
the cascade model is itself an HMM, and since we are updating the model parameters using recursive
maximum-likelihood estimation, each submodel will (under appropriate conditions) converge with
probability 1 to the best stochastic representation of the data, as the number of iterations goes to
infinity, even if that data was not produced by an HMM (and, to our knowledge, no real source of
data is produced by such a model). Of course, convergence in the limit does not necessarily lead to
convergence with finite amounts of data, so we provided simulations to show that in practice, the
cascade does converge to something useful in a reasonable amount of time.
The original motivation for this research was to implement, for our robot, an associative memory
for learning the symbolic concepts mentioned above. The model is currently implemented and
running in our robot, and has worked very well. We have been able to run the model as part of
a demonstration, learn concepts from auditory and visual cues in the environment, and use these
concepts to make decisions. An important perspective on this simple statement is that our model
converting analog inputs to discrete symbols, allowing the robot’s controller to make decisions
symbolically using discrete representations of the environment. Moreover, these symbols form the
basis needed for more complex symbolic manipulation, such as language.
5.2 Insights and Future Directions
There are a few additional insights we have gained while working on this project. Some of these
have been outlined in the course of this dissertation, and have become part of our premises. Others
95
are purely technical, but just as interesting. Often they suggest avenues for further research. We
highlight a couple of these insights below.
5.2.1 Derivation of recursive maximum-likelihood estimation algorithms
One technical insight we have gained concerns the development of RMLE for HMMs. The basic
derivation starts by writing the likelihood function for a sequence of observations 〈y1, . . . , yn〉 for a
model ϕ as
pn(y1, . . . , yn) =
n∏
i=1
b(yn;ϕ)′un(ϕ), (5.1)
where b(yn;ϕ) is a vector of likelihoods for each class in the model, and un(ϕ) is a vector of prior
probabilities for each class in the model (see Section 2.3.1 and Appendix E). The derivation then
proceeds by taking the log of this function, which turns it into a sum of logs, and then uses the
partial derivatives of the last term in the sum, log[b(yn;ϕ)′un(ϕ)], to update the model parameters.
This procedure is quite general, which implies that a version of RMLE could be derived for
almost any model for which the likelihood of a sequence of observations can be written in the
form of Equation (5.1). As noted before, we completed such a derivation for hidden semi-Markov
models, which appears in Appendix D. In addition, although the derivation does not appear here,
the likelihood of a set of observations for the switching HMM described in Section 3.3.2 can be
written in the form of Equation (5.1), implying that a similar derivation is possible. We believe
this technique may also be applicable to switching state-space models. Of course, there may be
other restrictions on the form of the model for convergence to hold, but we feel this is a direction
worth exploring.
5.2.2 Generative modeling
As highlighted in Chapter 3, our HMM cascade cannot generate outputs that depend on all model
parameters. In particular, this implies that, even though we can use our model to learn sequences
of concepts involving auditory and visual information, we cannot, for example, directly turn the
model outward and produce equivalent auditory output from a concept or sequence of concepts.
(In this discussion, we will temporarily ignore the fact that the speech produced would probably
not be intelligible.)
96
At first glance, it would seem that humans do not do this either. For example, our ears do basic
spectral processing of auditory signals, which is then processed by our brain. When we produce
speech output, however, the signal is not simply reversed and sent back out through our ears!
Instead, a set of signals controlling the muscles in the mouth, vocal cords, and diaphragm create
the desired output of the system. A similar discussion can ensue concerning visual inputs. That
said, we can, at the very least, imagine sounds and pictures inside of our head, so we can consider
the models we use for recognizing speech and images as generative models which can reproduce
speech and images inside our head.
That our HMM cascade is not generative in the strictest sense does not diminish its recognition
capabilities, but it may imply that other models may be more appropriate. In particular, the
cascaded switching HMM of Chapter 3 is a fully generative model. In fact, learning and inference
in this model would be considerably more complicated, both conceptually and computationally, and
therefore it may not yet be practical to implement on a robot. It would, however, be interesting to
study this and similar models more carefully.
5.2.3 A language-learning robot
The ultimate goal of our project is the creation of a language learning and understanding robot.
The work in this dissertation is a significant step toward that goal. In particular, we have offered
a mechanism for creating an internal symbolic representation of the outside world, which can be
used as the basis for decision making and more complex symbolic processing.
As a practical matter, some good engineering on the robotic implementation presented in this
dissertation would improve the quality of any future experiments. In particular, the speech pro-
cessing is currently somewhat error prone.
As of right now, this work is used as the basis for decision making in a finite state machine
controller. While this controller suffices, the types of behavioral decisions it makes is currently
hard-coded. A very useful area of research would be to study and implement a controller which can
learn behaviors based on previous experience using reinforcement learning [41]. Some work related
to this has already been conducted by Zhu and Levinson [49] for our own project, although it is not
incorporated into our current work. Another related avenue of behavior learning research would be
97
to study the use of partially observable Markov decision processes (POMDPs) [116] to replace the
controller.
With regard to language learning, a medium- to long-term goal would be to learn some more
complex spatial or temporal relationships among the concepts currently being learned semantically.
Especially as computing power increases, more complex stochastic grammars could augment or
replace the HMMs in the cascade model presented here. In the short term, S. Levinson has proposed
learning simple two-word grammars, which could be studied and built on top of the models presented
herein. Even this seemingly simple task presents significant challenges for the next generation of
intelligent robotics researchers.
5.3 Final Words
The work presented in this dissertation is part of an ambitious project which draws ideas from
a large number of areas. At times the sheer number of fields touched and vastness of knowledge
required to simply engineer the project has been very daunting and frustrating. At the same time,
the broad understanding and insight gained through this process has been extremely rewarding.
We are proud of our contributions, and hope that they help advance this project and its related
fields.
98
APPENDIX A
HARDWARE AND SYSTEM-LEVEL
SOFTWARE SPECIFICATIONS
A.1 Introduction
Over the years, our research has required the use of three different robots, as well as various comput-
ers and other hardware. In this chapter we list the specifications of recent hardware (the hardware
used in Illy and Norbert), as well as information about configuration and installed software.
A.2 Robots
The base unit for both Illy and Norbert is an Arrick Robotics second generation Trilobot.
A.2.1 Specifications
These specifications were taken from the Arrick Robotics website (http://www.robotics.com/
trilobot/).
Features:
• 12”×12”×12” body dimensions, 11 pounds.
• Dual differential drive with DC gear motors and encoders.
• Maximum speed: 10” per second.
• Surfaces: tile, concrete, low pile carpet, moderate bumps and inclines.
• 2 pound payload capacity for radio data link, embedded PC, etc.
• Thumb screws make removing panels easy.
99
• Removable battery pack uses 8 standard D-cells.
• Pan/tilt head positions sensors quickly.
• Stationary mast contains additional sensors including a digital compass.
• Gripper can grasp and lift cans and balls.
• Programmable control from user’s desktop PC or on-board embedded PC.
• Infrared communications from TV remote control and other Trilobots.
• RC receiver port allows control from an RC transmitter.
• PC-style joystick control port.
• 2 line x 16 character liquid crystal display.
• 16-key keypad.
• Sound effects and rudimentary speech (optional speech synthesizer).
• Sound recording and playback.
• Expansion port allows unlimited possibilities.
• Safe, low voltage system.
Sensors:
• 8 whiskers surround the base.
• 2 degree electronic compass.
• Sonar range finder can detect objects and their distance.
• Passive Infrared Motion Detector (PIR) detects movement of people.
• 4 light level sensors detect direction and intensity of light.
• Digital temperature sensor.
• Tilt sensors detect inclines in all directions.
• Water sensor detects puddles.
• Sound can be detected and stored.
• Motor speed and distance using optical encoders.
• Battery voltage can be monitored.
• Infrared detector can receive communications from remote control.
• Infrared emitters can communicate with other Trilobots.
For right now, we do not use (or plan to use) all sensors, only those which fit our need for anthro-
pomorphism.
A.2.2 Configuration
Below we describe modifications to the base hardware for our two main robots, Illy and Norbert.
100
A.2.2.1 Illy
For Illy, in order to mount the small form-factor PC (described below), we added a support structure
around and above her head. This structure interferes with some of the sensors on the head mast
(e.g., compass), although we have no current plans to use most of these sensors. The structure also
required that we remove the handle used to pick up and carry Illy.
The original hardware had support for a single video camera on the head. We chose instead to
mount a pair of miniature cameras where the head is normally located, and attach the head to the
antenna on top of the forementioned structure.
Normally, when the robot is turned on, it comes up in terminal mode, whereby it is controlled
by the control panel. We control the robot via a serial interface, so we set the startup mode to
“Command Mode.”
A.2.2.2 Norbert
In Norbert, we purchased a smaller computer that fits internally in the robot’s storage bay, but
which required us to move the control panel out a couple of inches. As with Illy, we again added
stereo cameras, but mounted them above the original head.We also bring the robot up in“Command
Mode.”
A.3 Computers
As mentioned above, each of the robots contains a small form factor computer on board, mainly
for the purpose of collecting images and sounds. These computers and related hardware are listed
in Table A.1. In addition to the robots, we run our demonstrations on two Linux workstations,
described in Table A.2.
101
Table A.1: Computing hardware mounted on robots.
Illy: Norbert:Ampro Little Board P5x (Discontinued) Digital-Logic MSM-P3 SEN
266 MHz Pentium Processor (low power) 700 MHz Pentium III ProcessorPC/104+ (PCI/ISA) Expansion slot PC/104+ (PCI/ISA) Expansion slot256 MB RAM 128 MB RAM11 Mbs Wireless Ethernet (RadioLAN 11 Mbs Wireless (802.11b)
proprietary, external; connected to HD: 1 GB IBM Microdriveon-board ethernet) OS: Debian GNU/Linux 3.0
HD: 64 MB compact flash Expansion Cards (PC/104+):OS: Debian GNU/Linux 3.0 (root file Sound Card: MicroSpace MSMM104
system mounted over NFS) Framegrabber Card: Sensory 311 (×2)Expansion Cards (PC/104+):
Sound Card: MicroSpace MSMM104Framegrabber Card: Sensory 311 (×2)
Table A.2: Computer Workstations
Hal: Sal:Dell Opteron Champaign Computer
866 MHz Pentium III Processor 2.2 GHz Pentium IV Processor512 MB RAM 512 MB RAM10 Mbs Ethernet 10 Mbs EthernetHD: 20 GB HD: 40 GBOS: Debian GNU/Linux 3.0 OS: Debian GNU/Linux 3.0
102
APPENDIX B
MOBILE ROBOT SOFTWARE
B.1 Introduction
The work described earlier in this dissertation depends greatly on a suite of software developed by
various members of our group over the years. This appendix describes some of the software relevant
to this dissertation. Specifically, Section B.2 describes a distributed computing and communications
system fundamental to all research currently done on our robots. I did almost all of the design and
most of the implementation of this system.
Following the description of this system, Section B.3 describes a speech feature extraction
algorithm developed by M. Kleffner, and Section B.4 describes an object segmentation and feature
extraction algorithm developed by R. S. Lin. Both of these systems are the basis of features used
by the robot implementation of our cascade of HMMs, described in Chapter 4.
B.2 Distributed Computing and Communication System
We want to provide our robot with functional equivalents for much of the sensory-motor and
decision-making periphery in humans. In addition to the need for hardware equivalents for such
organs as eyes and ears, we need computational equivalents for components of the cognitive
framework—sensory processing, learning and memory, decision making, and outputs. Most com-
puting modules should run independently, and because of hardware constraints, may not even be
run on the same system. This section describes the software framework we have developed for
communication among the various modules.
103
B.2.1 System design
B.2.1.1 System modules
Our group has developed various processing and learning modules for our robotic system. Currently,
the modules used in our main demonstrations include (with attribution to the developer):
1. Audio/video servers (K. Squire) handle acquisition of stereo audio and video data on the
robot.
2. Sound source localization (D. Li) determines the direction from which sounds are coming.
3. Visual processing and object recognition (R. S. Lin) processes visual information in order to
find “interesting” objects.
4. Central memory (M. McClain) stores state information about the world.
5. Decision making and navigation (M. McClain) provides finite state machine controller, au-
tonomous navigation code.
6. Speech output (M. Kleffner/M. McClain) speaks short phrases.
7. Control server (K. Squire) handles direct control of the robot hardware.
8. HMM-based associative memory (K. Squire) provides basic learning and recognition of au-
dio/video semantic information.
These modules are all connected together to construct the cognitive cycle framework shown in
Figure 1.1 (p. 5). Our main concern here is passing information among these components.
B.2.1.2 Implementation issues
There are a number of implementation issues we need to consider. In particular, our biggest
limitation is hardware. Power on board the robot is limited, and the particular robot we have
chosen to work with has little space for holding additional equipment. Thus, on-board processing
is limited, and we must shift much of the processing to other computers. We are also restricted to
a relatively low-bandwidth wireless link, which limits the amount of data we can transmit. Despite
104
these restrictions, we still desired to meet a goal of iterating through the complete cognitive cycle
three-five times a second.
The actual implementation of most of the modules in our robot is described elsewhere [31, 32,
48, 50]. In the next section we will describe the implementation of the communications framework
and related modules (audio, video, and control servers).
B.2.2 Implementation
Below we enumerate a number of requirements for the communications framework:
1. The framework should allow multiple modules able to access the same data simultaneously
(e.g., speech audio processing and sound source localization).
2. Modules should have near real-time access to acquired or processed data.
3. Data access should be transparent, even if the module and data source are on different systems.
4. The interface used to access the data should be simple and consistent.
At the time we began this project, we found no system which met these needs perfectly. The
description of the system we developed follows.
B.2.2.1 IServer (audio/video/data/control server)
The IServer program is a general purpose server for facilitating and coordinating transparent
access to raw and processed data, as well as allowing robotic control. Here we will describe how
it is used for data acquisition and distribution, its usage, and some general comments about the
system.
Data acquisition. The data acquisition component deals with acquiring data (for example, raw
or processed audio or video) and distributing it to other modules that might need it. Here is how
it works.
For a particular data source, the program sets up a ring buffer in shared memory. There are
two types of processes which access this ring buffer:
105
Speech Recognition
Sound SourceLocation (Sink)
(Sink)(Source)
Sound Card
AudioRing
Buffer
full
full (old)
fullfilling
Figure B.1: Audio ring buffer. This diagram shows how a source may be writing to one segmentof the buffer, while multiple sinks may be reading from another segment.
1. A source process fetches (or creates) the raw data and writes it to the ring buffer.
2. A sink process reads and processes the data.
One example of a source process is one which reads audio data from the sound card and writes it to
the ring buffer. An example of sink process would be a process such as a sound source localization
program, which accesses the audio data and uses it to determine the direction a sound is coming
from. This setup can be seen in Figure B.1.
There are a number of benefits to this setup:
1. More than one sink process can use the same data on the same machine.
2. A program processing the data (a sink) does not need to worry about the details of obtaining
the data from the hardware. Access to data is consistent and easy.
3. A sink process may transparently reside on a different machine than the original source (as
described next).
Because of the demanding requirements of the input processing and the limited computing power
available on the robot, much of the processing takes place on other computers. The IServer
program includes a special sink process with the sole purpose of taking the data in the ring buffer
and sending it to another machine, where a corresponding source process receives this data and
106
Audio Source(Sound Card)
Audio Server(Sink)
Speech Recognition(Sink)
Sound SourceLocation (Sink)
Audio Source(Remote)
Audio Server(Sink)
Illy (robot)
network
network
Hal (workstation)
AudioRing
Buffer
AudioRing
Buffer
full
fullfilling
full (old)
filling
full (old)full
full
Figure B.2: Audio ring buffer on multiple machines. In this figure, a sink on on machine sendsaudio data to a source on another machine, which is then used to fill the ring buffer on the secondmachine. Other sink processes still access data in the same manner.
writes it to a ring buffer on its machine. A sink process on the second machine accesses this data
in exactly the same manner as if it were on the original machine.
Figure B.2 demonstrates this, again using audio processing as an example. In this setup, the
sound source location program is running on Illy (our robot), and accesses audio from the ring
buffer as before. A speech recognition program is running on another workstation (Hal), and needs
access to the same audio data. To get it, an audio server running on Illy takes data from the ring
buffer and sends it to the audio source process on Hal, which writes it to its ring buffer just like any
other source of audio data. The speech recognition program reads the data in the same manner as
before. The ring buffer on Hal may also have an audio server which sends the audio data to other
machines.
For each source of data (audio or video), there is a ring buffer set up on the source machine and
on each machine which needs the data. Since the robot is on a wireless link with limited bandwidth,
it is possible to set up the system to only transmit data once over the wireless link, and retransmit
as necessary from the receiving machine to other machines on the network. For example, suppose
two machines, Hal and Chadwick, need access to audio data from Illy. The best setup would be
to have Hal receive the data from Illy, and then resend it to Chadwick. This does cause a greater
time lag for programs needing the data, but is arguable better than saturating the wireless link.
107
Other features of the data acquisition system:
• The ring buffer is divided into segments, the total number and size of which depends on the
data type. Each segment is protected by a locking semaphore, so that a source process will
not write to any block that’s being read from, and a sink process will not read from a block
that’s being written to.
• Each segment of data includes a generic header specifying the endianness, byte count, a time
stamp, and a sequence number. There is also some padding added to the data structure to
take care of alignment issues on some architectures (e.g., SGI Octane).
• Data is sent in little-endian format (i.e., format used on Intel PCs). Network byte order,
by convention is big endian, so we are going against convention, but since a majority of
our processing currently takes place on little-endian machines, this convention saves a lot of
conversions. Note that we have in the past used a big-endian SGI Octane for video processing,
and hence, on this machine we need to convert the byte order for data obtained from other
machines. Conversion is taken care of by the source process on the local machine.
• Access to successive segments can be specified as sequential or most recent. Specifying se-
quential indicates that subsequent requests for data from the buffer should retrieve the next
segment in sequence—i.e., segments are read in order. This is the default for audio sink pro-
cesses. Specifying most recent returns the most recent buffer. This is the default for video
sink processes.
• In addition to allowing multiple sinks, there can be more than one audio or video source.
For example, during an open house demonstration with a lot of background noise, a boomset
microphone could be attached to a workstation, and a separate ring buffer set up to receive
this audio and send it to the speech recognition program. The sound source localization could
still use the original audio from the robot.
Function. The IServer program must be running on all machines (e.g., the robot Illy, and
workstations Hal and Sal). A sink program or process accesses ring-buffer data through a C++
library interface. To access data, the sink process must do the following:
108
1. Get the key for the particular data it wants (via the dbGet function). Here, the process can
also specify some of the desired parameters data streams (e.g., number of channels, sampling
rate, etc.).
2. Create an instance of the appropriate sink class (IVideoSink, IAudioSink).
3. Get metadata information via an appropriate function call (myGetVideoData, myGetAudio-
Data).
4. Get a pointer to a data segment using myGetSeg, use the data, and then release the segment
using myReleaseSeg. Lather, rinse, repeat (but just this step).
In response to the sink request, a source process will check the key request, and if the source is
not running, attempt to start it. If the request is for a data source on a different machine, the
source process will send a request to that machine to start a corresponding server process. This
server process is another sink process, and will again make a request for the same data on the local
machine, and then begin sending the data to the remote machine.
As mentioned above, the IServer program is written in C++ and is designed to be modular.
This modularity also allows a process to act as a data filter. We currently have two such filters
in our system: one is uses code by M. Kleffner to extract speech features from audio, and another
uses code by R. S. Lin to segment and extract object features from visual inputs. In both cases, the
filter code includes a sink (an IAudioSink or IVideoSink, respectively). The data coming into the
sink is processed, relevant features are extracted, and the features are then treated as the output
of an IDataSource, which can then be used by any other program.
B.2.2.2 Additional system information
In addition to data acquisition and distribution, there are a couple of other essential components
of the communications framework.
The central memory is a short-term memory containing information about the state of the
world. The actual data stored here changes depending on the demonstration being run, but it may
include such things as the direction of interesting sounds or the current goals of the robot.
109
The control server is the main connection to the robot hardware and handles all robot control.
It is currently implemented as a finite state machine and is detailed in Section 4.4.1.
B.2.3 Discussion and future work
As mentioned, the current framework is complete, although there are always some areas that could
use some additional work.
The main issue right now is with performance. While the system works well enough to suit
our needs, there can be significant delays between the acquisition of data and when it is processed.
Some of this delay is inevitable, but some tweaking should allow the overall delays to be reduced.
The system, as it works now, is quite robust, but there are rare circumstances in which it
breaks. While rudimentary monitoring programs exists, we hope to add a central monitor to the
framework, to better understand when the system is not functioning properly.
B.3 Speech Feature Extraction
The discussion below is summarized from M. Kleffner’s master’s thesis [48].
B.3.1 Introduction
M. Kleffner developed code for speech feature extraction and synthesis, in order to establish a
means of vocal expression for the robots. Our main use is for the feature extractor, which we use
for recognition. Below we briefly discuss the background and design of the system as used on our
robot. For full details, please see [48].
B.3.2 Background
The system described here was developed for the purpose of extracting speech features from audio
suitable for speech synthesis. Conceptually, the easiest way to synthesize speech is to simulate the
human vocal tract, and the simplest model for doing so is a linear-source filter model. This model
assumes a spectrally uniform source (representing the vocal folds) processed by a filter representing
the vocal tract. The filter contains resonances at vocal tract frequencies, as well as the spectral
110
slope of the waveform. Thus, for a parameterized model of speech synthesis, we require a vocal
tract filter, the fundamental pitch, a voiced, unvoiced, or mixed source, and a voicing confidence
score. For the purposes of recognition, we will only use the filter coefficients and voicing confidence
from this parameterization, and in addition will calculate the log-energy of the original source.
B.3.2.1 Linear prediction
One of the simplest and most efficient ways to estimate the spectral shape of the vocal tract is
to use linear prediction (LP). For a complete introduction to linear prediction, see Chapter 8 of
Rabiner and Schafer [117]. The linear prediction problem requires us to find p coefficients such that
the current sample can be accurately predicted from the previous p samples using
s(n) =
p∑
k=1
aks(n− k) + e(n), (B.1)
where s(n) is the speech signal, ak are the coefficients being estimated, and e(n) is the prediction
error in the estimation. To find the best set of coefficients, the squared prediction error
En =∑
m
(
sn(m) −p∑
k=1
aksn(m− k)
)2
(B.2)
is minimized by taking the derivative of En with respect to ak and finding the least-squares solution.
See [48, 117] for details.
B.3.2.2 Warped linear prediction
Although LP is optimal in the least-squares sense, it calculates error based on a uniform spectrum.
However, humans have better frequency resolution at lower frequencies. Warped linear prediction
(WLP) warps the input signal spectrum in a way that is more faithful to the frequency resolu-
tion of the human ear. Because of this property, around half of the number of coefficients are
required for perceptual performance equivalent to standard LP. Because of these nice features, the
implementation in our robot uses WLP.
Also note that, normally, LPCs are not used for recognition because they tend to have poorer
qualities than warped scale representations, such as mel-frequency cepstral coefficients (MFCCs).
111
However, because warped LPCs are also calculated using the bark-scale model of the human ear,
so they may be more suitable than LPCs for speech recognition.
See [118–121] for more information about warped LPCs, and [48] for details on our implemen-
tation.
B.3.2.3 Log area ratios
Linear predictive coefficients (LPCs) and warped linear predictive coefficients (warped LPCs) are an
optimal representation of a one-dimensional vocal tract, linear combinations or quantized versions of
LPCs (such as those learned by a classifier) generally correspond to unstable or meaningless filters.
To alleviate this problem, we can convert the LPCs into a form that can be linearly combined and
still represent a meaningful filter. One way to do this is to convert LPCs to the corresponding
reflection coefficients (RCs) of a one-dimensional vocal tract tube model. The reflection coefficients
ki are generally obtained during the calculation of the LPCs when that calculation is done using the
Levinson-Durbin algorithm [117], but can also be calculated by iterating i in the following recursion
from p down to 1:
ki = a(i)i , (B.3)
a(i−1)j =
a(i)j + a
(i)i a
(i)i−j
1 − k2i
, 1 ≤ j ≤ i− 1 (B.4)
with the initial condition
a(p)j = aj, 1 ≤ j ≤ p. (B.5)
Reflection coefficients can guarantee a stable filter, but the spectrum is sensitive to RCs with
large magnitudes. However, it has been shown [122] that log-area ratios (LARs) have near uniform
spectral sensitivity, which allow them to be easily combined and quantized. LARS are defined by
gi = log
(Ai+1
Ai
)
= log
(1 − ki
1 + ki
)
, 1 ≤ i ≤ p, (B.6)
where Ai is the area of a segment of the one-dimensional tube model of the vocal tract.
B.3.2.4 Voicing confidence
When reproducing speech for synthesis, it is necessary to know whether the speech is voiced or
unvoiced. Because this is a necessary feature when producing speech, it is also a good discriminating
112
feature for the speech and is part of the feature vector we use for recognition. Rather than make
a hard decision about voicing for any particular segment, a voicing confidence score (VCS) is
produced, which indicates the mix of pulse train and white noise necessary to reproduce the speech.
In our robot, the voicing confidence is calculated in the first half of the current segment using
c(1)t =
∑0.5N−1n=0 snsn+t
√∑0.5N−1
n=0 snsn∑0.5N−1
n=0 sn+tsn+t
, (B.7)
with t an integer in the range [pitchperiod− searchmin, pitchperiod+ searchmax]. For the second
half of the segment, the VCS is calculated similarly, using
c(2)t =
∑N−1n=0.5N snsn−t
√∑N−1
n=0.5N snsn∑N−1
n=0.5N sn−tsn−t
. (B.8)
The total VCS is given by
ctotal = clip(
max[
maxt
(
c(1)t
)
,maxt
(
c(2)t
)])
. (B.9)
This complicated scheme is used so that we can calculate the voicing confidence without depending
on sample values outside of the current segment (i.e., neither c(1)t nor c
(2)t depend on values of
sn outside of the current segment). A final VCS score is calculated to produce zeros for strongly
unvoiced segments:
cf =clip (ctotal − Vthresh)
1 − Vthresh, (B.10)
where Vthresh is set at 0.25 for our experiments.
B.3.2.5 Log energy
Log-energy for each segment is calculated using
en = log
√√√√
N−1∑
n=0
snsn. (B.11)
B.3.3 Design and implementation
As mentioned above, the audio feature extraction was originally meant to extract features useful
for speech synthesis. Typically in speech coders, standard frame length and spacing for audio
113
WarpedLinear
Predictor
WLPCs −>LARs
VoicingConfidenceEstimator
FeatureVector
WLPCs, energy LARs, energy
s(t)
VCS
Figure B.3: Block diagram describing audio feature generation.
waveform analysis are 30 ms and 10 ms, respectively, and these are the segment sizes in our original
implementation. These choices give a rate of 100 features per second. For our purposes, however,
we did not require this resolution, so we slowed down the feature rate to 50 features per second, by
using a frame length of 60 ms and spacing of 20 ms. Since the microphones attached to our robot
are extremely noisy, we chose to use audio from a close-talk microphone attached to a remote PC.
This audio was sampled at 16 kHz, corresponding to a frame length of 960 samples and a frame
spacing of 320 samples.
The block diagram for the feature extractor is given in Figure B.3. As can be seen in the
diagram, we extract warped linear predictive coefficients from the input signal, which are then
converted to log-area ratios. Kleffner suggests using 8-12 coefficients for 16-kHz audio. However,
for recognition purposes, a much coarser representation will suffice, and we only calculate three
LARs. Interestingly, we can still synthesize intelligible speech using this coarse representation. To
round out the feature vector, we also calculate log energy and voicing, described above.
B.4 Visual Object Segmentation and Feature Extraction
This section describes the image segmentation and feature extraction algorithm developed and
implemented on our robot by Lin [123].
114
B.4.1 Problem description
Given images collected from the camera mounted on the robot, we want to distinguish objects
in these images from the background. In our robot experiments, we control the environment
by bounding it with white painted walls; the floor also has a white marble texture. However,
even in this restricted setting, there is still uncontrollable noise present in the environment and
in the robot’s sensors, making the segmentation process non-trivial. In our work, we address the
image segmentation problem using Markov random fields and apply a coarse-to-fine loopy belief
propagation to obtain an approximate solution. Our experiments demonstrate good segmentation
results.
B.4.2 Pairwise Markov random fields
Markov random fields have been widely applied to early vision problems, including optical flow,
stereo vision, and image restoration [124–126]. Here we model the formation of an image using a
square lattice pairwise Markov random field. In this setting, each pixel in the image is connected
to a node in the lattice. In addition, for adjacent pixels in the images, their corresponding nodes
are also connected in the lattice. The values of the nodes are discrete and finite. In our work, a
node can have two possible values. It can be either foreground or background denoting whether the
image pixel connected to it is a foreground or a background pixel.
Let yij be an image pixel and xij be the node it connects to in the lattice, and let Y = yij be
the whole image and X = xij be whole lattice. The joint probability P (X,Y ) can be described
by
P (X,Y ) =1
Z
∏
(ij,kl)
ψ(xij , xkl)∏
ij
φ(xij , yij), (B.12)
where ψ(xij , xkl) and φ(xij , yij) are predetermined potential functions, and Z is a scale factor.
Under this model, image segmentation becomes an inference problem. Given image Y , the optimal
segmentation X∗ is defined as
X∗ = arg maxX
P (X,Y ). (B.13)
115
There exists a potential problem in this formulation. Since the number of all possible values of
X is exponentially proportional to the size of Y , computation of X ∗ will become intractable when
the size of Y is large. Therefore, an approximation method has to be adopted. In our work, instead
of computing P (X,Y ), we measure the marginal probability P (xij|Y ) and determine the best label
of xij according to
x∗ij = arg maxxij
P (xij|Y ). (B.14)
B.4.3 Local message passing algorithm
P (xij |Y ) can be approximated by an iterative, local message-passing algorithm called belief prop-
agation [127]. At iteration n, mn(ij,kl)(xkl) is the message passed from xij to xkl, defined as
mn(ij,kl)(xkl) = α
∑
xij
ψ(xij , xkl)φ(xij , yij)∏
(g,h)∈Γ(i,j)\(k,l)
mn(gh,ij)(xij), (B.15)
where α is a scaling constant, and set Γ(i, j) contains all neighbors of xij . With the messages
known, the marginal distribution of P (xij |Y ) at iteration n is defined as
P n(xij |Y ) = γφ(xij , yij)∏
(g,h)∈Γ(i,j)
mn(gh,ij)(xgh), (B.16)
where γ is a scale factor.
If the Markov random field has a tree structure, P n(xij |Y ) will converge to P (xij |Y ) after a
message from each node has propagated to all other nodes. If the Markov random field is not a
tree, the belief propagation algorithm will not converge. However, even under this circumstance,
empirical results show that the belief propagation algorithm can still achieve excellent performance
in many applications.
In our implementation, we use the max-product algorithm [128] to approximate equation (B.15)
by
mn(ij,kl)(xkl) = βmax
xij
ψ(xij , xkl)φ(xij , yij)∏
(g,h)∈Γ(i,j)\(k,l)
mn(gh,ij)(xij), (B.17)
116
where β is a scale factor. This approximation not only reduces the computation needed, but also
enables us compute messages and marginal distributions in log space.
B.4.4 Image segmentation
There are two types of image features used in our segmentation algorithm: color and pixel intensity
gradient. The two features are used by the two potential functions, φ(xij , yij) and ψ(xij , xkl),
respectively. The definition of our two potential functions will be explained below. With the
potential functions set, we can run belief propagation through iterations to compute the marginal
distribution P n(xij |Y ). In our implementation, we compute the message in a coarse to fine manner.
This approach enable us to reduce the heavy computation without seriously deteriorating our
approximation of P n(xij |Y ).
B.4.4.1 Potential functions
Knowing that the background is mostly white, we built a white pixel model based on a set of
images containing only background of the environment. In order to remove the intensity in our
color feature, we use the following color invariants, proposed in [129]:
frgb =
(R
max(G,B),
G
max(R,B),
B
max(R,G)
)
. (B.18)
We model the distribution of white color features as a Gaussian function. Since there is noise in
both the environment and the cameras, we use robust regression [130] to remove outlier pixels in
the training images. Let µ and C be the mean and covariance matrix of our white pixel model. We
then define φ(xij , yij) as
φ(xij , yij) =
N (frgb(yij);µ,C) if xij = background
κ if xij = foreground
, (B.19)
where κ is a constant. That is, if the whiteness likelihood of pixel yij exceeds κ, it is more likely to
be a background pixel. Otherwise, it is likely to be a foreground pixel.
The other potential function ψ(xij , xkl) describes the relationship between latent variables xij
and xkl. In our segmentation experiment, if xij = xkl, we expect the image intensities of the two
117
pixels yij and ykl to be similar. Otherwise, we expect a sharp intensity difference between yij and
ykl. In addition, we include a bias that favors the same label on adjacent variables. By combining
all of these constraints, we define ψ(xij , xkl) as
φ(xij , xkl) =
exp(dif(yij , ykl) −K) if xij 6= xkl
exp(−dif(yij, ykl)) if xij = xkl
, (B.20)
where dif(yij, ykl) is the intensity difference between yij and ykl in absolute value.
B.4.4.2 Coarse-to-fine iteration
The resolution of our images are 320×240, so it will take a certain number of iterations to propagate
messages from one end of the image to the other. In order to speed up message propagation,
we execute belief propagation in a coarse to fine manner as is suggested by Felzenszwalb and
Hutterlocher [125]. Starting at a coarse level, we divide the image into a number of blocks and run
belief propagation based on these blocks. After a small number of iterations, we decompose each
block into a number of smaller blocks and copy the messages of the original block to these blocks.
We then run belief propagation based on these new set of blocks. The process continues until each
block contains exactly one image pixel. A detailed description of this coarse-to-fine algorithm is
explained in [125].
B.4.4.3 Feature extraction
Using the segmented image described above, we extract some features useful for object recognition.
In particular, we extract a normalized color histogram using the same color invariant pixels defined
in Equation (B.18). We also calculate the first moment and the height to width ratio. All of this
information is passed to the associative memory and made available to any other module which
needs it.
118
APPENDIX C
HIDDEN MARKOV MODEL
ALGORITHMS
C.1 Introduction
As described in Chapter 2, an HMM is a discrete-time stochastic process with two components,
Xn, Yn, where (i)Xn is a finite-state Markov chain, and (ii) given Xn, Yn is a sequence of
conditionally independent random variables. The conditional distribution of Yk depends on Xn
only through Xk. The name HMM arises from the assumption that Xn is not observable, and so
its statistics can only be ascertained from Yn.
Generally, there are three problems of interest when talking about these models:
1. Given an observation sequence 〈y1, . . . , yn〉, find the likelihood pn(y1, . . . , yn;ϕ) of this se-
quence, given the model.
2. Given an observation sequence 〈y1, . . . , yn〉, find the a “good” corresponding state sequence
〈x1, . . . , xn〉.
3. Adjust the model parameters ϕ to maximize the likelihood pn(y1, . . . , yn;ϕ).
Chapter 2 described a recursive solution to these problems. In this appendix, we summarize more
traditional batch techniques based on Baum-Welch reestimation and Viterbi decoding. The reader
is referred to [67] and [81] for more details on these algorithms.
The algorithms below use the model description notation described in Section 2.2.
119
C.2 Baum-Welch Algorithm
Baum-Welch reestimation, also known as the forward-backward algorithm, is an expectation-
maximization (EM) method used for reestimating HMM parameters. In addition to adjusting
model parameters, the procedure can also be used to determine the likelihood of a given observa-
tion sequence, as well as give a maximally likely state sequence corresponding to that observation
sequence. This is the most common method for learning HMM parameters.
Consider an observation sequence 〈y1, . . . , yn〉. The most direct way to calculate the likelihood
of this sequence for a given HMM ϕ is to sum over all possible state sequences, the probability of
that sequence times the likelihood of the observations given that sequence, that is
pn(y1, . . . , yn;ϕ) =∑
〈x1,...,xn〉∈Rn
p(y1, . . . , yn, x1, . . . , xn;ϕ)
=∑
〈x1,...,xn〉∈Rn
p(y1, . . . , yn|x1, . . . , xn;ϕ)P (x1, . . . , xn;ϕ). (C.1)
This calculation is computationally intractable, but a procedure known as the forward-backward
algorithm can calculate the probability efficiently. Define
αn(ϕ) = [αn1(ϕ), . . . , αnr(ϕ)]′, (C.2)
where
αni(ϕ) = p(y1, . . . , yn, Xn = i;ϕ) (C.3)
as the likelihood of the partial observation sequence 〈y1, . . . , yn〉 and state i at time n, given the
model. The vector αn(ϕ) defines set of so-called forward probabilities. Setting α1(ϕ) = B(y1;ϕ)π,
we can solve for αn inductively as
αn+1(ϕ) = B(yn+1;ϕ)A(ϕ)′αn(ϕ). (C.4)
We can similarly define a set of backward probabilities βn(ϕ) as
βn(ϕ) = [βn1(ϕ), . . . , βnr(ϕ)]′, (C.5)
where
βni(ϕ) = P (yn+1, yn+2, . . . , yN |xn = i,ϕ). (C.6)
120
Setting βn(ϕ) = 1r, we can calculate βn(ϕ) using the backward recursion
βn(ϕ) = A(ϕ)B(yn+1;ϕ)βn+1(ϕ). (C.7)
Together, these functions can compute the likelihood P of the sequence at any time 1 ≤ ` ≤ n− 1
according to
P = pn(y1, . . . , yn;ϕ) = α`(ϕ)′A(ϕ)B(y`+1;ϕ)β`+1(ϕ). (C.8)
If we set ` = n− 1, this equation becomes
P = αn(ϕ)′1r. (C.9)
A similar formula exists using the backward probabilities.
The state sequence can be determined by looking at the most likely state at each time step.
This has the disadvantage that we may find a state sequence that is invalid for a given model; i.e.,
one in which xn = i and xt+1 = j, but for which the model has aij = 0. The Viterbi algorithm,
described in the next section, avoids this pitfall.
We can use the above calculations to reestimate model parameters. Let γij be the expected
number of transitions from state i to state j, conditioned on the observation sequence. This value
can be calculated with
γij =1
P
n−1∑
`=1
α`i(ϕ)aij(ϕ)bj(y`+1;ϕ)β`+1,j(ϕ).
Then the total expected number of transitions out of state i is given by
γi =
r∑
j=1
γij =1
P
n−1∑
`=1
α`i(ϕ)β`i(ϕ). (C.10)
The ratio of these can be used to calculate an updated value for aij(ϕ), using
aij(ϕ) =γij
γi=
∑n−1`=1 α`i(ϕ)aij(ϕ)bj(y`+1;ϕ)β`+1,j(ϕ)
∑n−1`=1 α`i(ϕ)β`i(ϕ)
. (C.11)
Similar methods can be used to find update equations for bj(·;ϕ) = bjk(ϕ) for the case of observa-
tions from a finite-alphabet,
bjk(ϕ) =
∑
`|y`=k α`j(ϕ)β`j(ϕ)∑n
`=1 α`j(ϕ)β`j(ϕ)(C.12)
and πi,
πi =1
Pα1(i)β1(i). (C.13)
121
While the above formulas were determined intuitively, it is possible to derive the same formulas
rigorously, using either Lagrange methods or through optimization techniques. Similar formulas
are also available to estimate the parameters of a continuous observation density. See [67] or [81]
for more details.
C.3 Viterbi-Based Algorithms
The when determining the probability of an observation sequence, the forward-backward algorithm
above took into account all possible state sequences, and calculated P = pn(y1, . . . , yn;ϕ). We can
also define P as the maximum joint probability of the observation and most likely state sequence
for a given model, that is, P = maxX∈Sn p(y1, . . . , yn, x1, . . . , xn;ϕ). It is possible to calculate both
this probability and the most likely state sequence simultaneously through a dynamic programming
technique called the Viterbi algorithm.
The algorithm is defined as follows: let φ1i = πibi(y1;ϕ) for i = 1, . . . , r. Then as in the forward
and backward procedures, we can compute φ recursively by
φnj = max1≤i≤r
[φn−1,iaij ] bj(yn;ϕ), (C.14)
and keep track of the best previous state (to state j) via
ψnj = arg max1≤i≤r
[φt−1,iaij] . (C.15)
At the end of our input, we can determine the probability of the most likely sequence from
P = max1≤i≤r
φni. (C.16)
To determine the best sequence, we let state xk = arg max1≤i≤r φki, and trace back the most likely
sequence that ended in that state, using
xk−1 = ψk(xk), k = n, n− 1, ..., 2.
To determine reestimation formulas based on this model, we simply determine the best sequence
as above, and then use counting to reestimate the model parameters.
122
For all states xn = i, if we count the number of transitions from state i to state j, and divide
that by the total number of transitions from state i, we should get a better estimate of aij(ϕ).
Following this idea, the reestimation formula for each aij(ϕ) is
aij(ϕ) =number of transitions from state i to state j
number of transitions from state i.
A similar procedure can be used to estimate new parameters for each bj(yn;ϕ). Below we will
give an example for the simple case when the observation distribution in each state is defined by a
one-dimensional Gaussian pdf. Let bj(yn;ϕ) be defined by
bj(yn;ϕ) = N (yn, µj(ϕ), σj(ϕ))
where N is a Gaussian density with mean µj(ϕ) and variance σ2j (ϕ). Using the same observation
sequence 〈y1, . . . , yn〉, and the same estimated state sequence 〈x1, . . . , xn〉 as above, the parameters
µj(ϕ) and σj(ϕ) for the updated model can be estimated by
µj(ϕ) = average of all yi observed while in state xi = j
σj(ϕ) = standard deviation of all yi observed while in state xi = j
After reestimation, the parameters of the model are replaced with the new values above, and the
calculation is repeated again for the entire observation sequence using the updated model. At
each iteration, p(y1, . . . , yn;ϕ) is guaranteed to increase, and ϕ slowly converges to a model which
describes observation sequence 〈y1, . . . , yn〉.
Note that counting should be done over long and/or many sequences before actual parameter
estimation, as the potential exists, for example, to reestimate an unlikely but possible transition
probability as zero, if it does not appear in the sequence(s) used for reestimation. A small prior
can be added to each probability to prevent this occurrence.
The actual procedure defined here is similar to the procedure IBM uses to update parameters
in its Via-Voice speech recognition system.
123
APPENDIX D
HIDDEN SEMI-MARKOV MODELS
AND THE RMLE ALGORITHM
D.1 Introduction
When applying HMMs to speech and other continuous data, a general assumption is that each state
in the model represents a stationary interval over a data segment. By default, with a standard
HMM, the probability of duration of a state is modeled as a geometric distribution, which does
not accurately model the temporal structure of speech. To address this problem, Fergusson [94]
proposed the idea of a variable duration hidden Markov model, which explicitly models the duration
of a given state by a probability mass function, converting the underlying Markov chain to a semi-
Markov chain. Russell and Moore [95] and Levinson [96] extended this work by modeling the state
duration with Poisson and gamma distributions, respectively. Later in the literature, these models
became known as hidden semi-Markov models (HSMMs).
In our modeling, we have come across circumstances where the explicit duration modeling in
the HSMM would seem to have some benefit. However, as with traditional HMMs, the standard
training methods are off-line, batch methods ill-suited for running on our robot. Based on our
experience with online learning described in Chapter 2, we have derived a version of recursive
maximum-likelihood estimation (RMLE) for the HSMM. While we have not yet implemented this
algorithm, this derivation may prove useful to future researchers.
124
In the following two sections, we will describe the mathematical model for the HSMM, and then
give a derivation of the RMLE for this model. The setup and derivation for this model is very
similar to the setup and derivation of the RMLE for the hidden Markov model in Chapter 2.
D.2 HSMM Model Description and Notation
An HSMM is a discrete-time stochastic process with three components,
Xn′ , Y n′ , T n′, defined on probability space (Ω,F , P ). Let Xn′∞n′=1 be a discrete-time first-
order semi-Markov chain with state space R = 1, . . . , r, r a fixed known constant. As in an
HMM, the transition probabilities of the Markov chain in an HSMM are given by
aij = P (Xn′ = j|Xn′−1 = i) (D.1)
for i, j = 1, . . . , r, with an additional constraint that aii = P (Xn′ = i|Xn′−1 = i) = 0. Let
A = aij. Then A ∈ A\aii 6= 0, where A is the set of all r× r stochastic matrices (i.e., aij ≥ 0,
∑
j aij = 1).
Let T n′∞n′=1 be a sequence of discrete durations corresponding to Xn′. The process T n′ is
a probabilistic function of Xn′, and the corresponding conditional density of T n′ can be described
by a parametric family of densities d(·;λ) : λ ∈ Λ, where the density parameter λ is a function of
Xn′ , and Λ is the set of valid parameters for the conditional density assumed by the model. The
conditional density of T n′ given Xn′ = j can be written d(·;λj), or more simply dj(·).
Example D.1. (Gamma duration density): Suppose the durations for each state in an HMM are
approximately1 distributed according to a Gamma distribution. Then parameter set Λ = (ν, η) ∈
R+ × R
+, λj ∈ Λ, and T n′ = τn′ is a sequence of discrete valued conditionally independent
state durations on R+, with probability distribution
d(τn′ ;λj) = d (τn′ ; νj, ηj) =η
νj
j
Γ(νj)τ
νj−1n′ e−ηjτn′ (D.2)
for Xn = j. Here, the mean value of τn isνj
ηj, and the variance is
νj
η2j
.
1Since the durations are discrete, the correspondence will not be exact.
125
Example D.2. (Discrete duration density): Suppose durations T n′ are drawn from a discrete
set of times T = 1, . . . , T. Then Λ = (d1, . . . , dT )|∑Tτ=1 dτ = 1, dτ ≥ 0 is the set of length-
T stochastic vectors, λj ∈ Λ, and T n′ = τn′ is a sequence of discrete valued conditionally
independent state durations on T, each τ n having probability
d(τn′ ;λj) = djτn′ 1 ≤ τn′ ≤ T (D.3)
for Xn′ = j.
As in a standard HMM, Xn′ is not visible in an HSMM, and the corresponding duration
process T n′ is therefore unknown as well. The statistics of both are ascertained from a corre-
sponding observable stochastic process. In an HSMM, state Xn′ produces a length T n′ observation
vector Y n′ . The process Y n′ therefore is a probabilistic function of Xn′ and T n′, and the
corresponding conditional density of Y n′ is assumed to belong to a parametric family of densities
b(·|τ ; θ) : θ ∈ Θ, where τ is a sample from duration process T n′, the density parameter θ
is a function of Xn′ , and Θ is the set of valid parameters for the particular conditional density
assumed by the model. The conditional density of Y n′ given Xn′ = j and T n′ = τn′ can be writ-
ten b(·|τn′ ; θj), or more simply as bj(·|τn′). Outside of certain conditions enumerated later, the
particular form of b(·|τ n′ ; θj) is irrelevant to our discussion.
Define the HSMM parameter space as Φ = Π ×A× Λ × Θ. The model ϕ ∈ Φ is defined as
ϕ = π1, . . . , πr, a11, a12, . . . , arr, λ1, . . . , λr, θ1, . . . , θr. (D.4)
Example D.3. (Gamma duration densities with Gaussian observation densities): For the case of
gamma duration densities with single dimensional Gaussian observation distributions,
ϕ = (π1, ..., πr, a11, a12, ..., arr, ν1, η1, . . . , νr, ηr, µ1, σ1, ..., µr, σr).
As in our HMM in Chapter 3, let p be the length of ϕ. Let ϕ∗ ∈ Φ be the fixed set of “true”
parameters of the model we are trying to estimate.
For a vector or matrix v, v′ represents its transpose. Define the r-dimensional column vector
b(yn′ |τn′ ;ϕ) and r × r matrix B(yn′ |τn′ ;ϕ) by
b(yn′ |τn′ ;ϕ) = [b(yn′ ; θ1τn′ (ϕ)), ..., b(yn′ ; θrτn′ (ϕ))]′ (D.5)
126
and
B(yn′ |τn′ ;ϕ) = diag[b(yn′ ; θ1τn′ (ϕ)), ..., b(yn′ ; θrτn′ (ϕ))]. (D.6)
Similarly, for the duration densities, define the r-dimensional column vector d(τ n′ ;ϕ) and r × r
matrix D(τn′ ;ϕ) by
d(τn′ ;ϕ) = [d(τn′ ;λ1(ϕ)), . . . , d(τn′ ;λr(ϕ))]′ (D.7)
and
D(τn′ ;ϕ) = diag[d(τn′ ;λ1(ϕ)), . . . , d(τn′ ;λr(ϕ))]. (D.8)
For convenience of notation, we will define a third vector g(yn′ , τn′ ;ϕ) and matrix G(yn′ , τn′ ;ϕ)
as
g(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ)1r
= B(yn′ |τn′ ;ϕ)d(τn′ ;ϕ) (D.9)
and
G(yn′ , τn′ ;ϕ) = B(yn′ |τn′ ;ϕ)D(τn′ ;ϕ). (D.10)
Until now, we have described the model entirely using what we will call model time, where one
time unit corresponds to the duration the model stays in a particular state. In the model time scale,
time variables are marked with a prime (′), and sequence variables are marked with an overbar, as
in τn′ . We would like to relate this description to normal time, where each time unit represents
one real unit of time.
For a given sequence of durations τn′, define the functions t0 : Z+ → Z
+ and t1 : Z+ → Z
+
by
t0τn′(k′) =
k′−1∑
i=1
τ i + 1 (D.11)
t1τn′(k′) =
k′∑
i=1
τ i. (D.12)
These functions mark, respectively, the real beginning and end of the k ′th state for duration sequence
τn′.
Similarly, define a function ξ : Z+ → Z
+ as
ξτn′(n) = k′ if t0τn′(k′) ≤ n ≤ t1τn′(k
′). (D.13)
127
This function returns the model time corresponding to normal time n. Together, these functions
allow us to convert between the two time scales. We will often drop the explicit dependence on
τn′ and simply write t0(n′), t1(n
′), and ξ(n).
Using these functions, we can define real-time analogs Xn and Yn to Xn and Y n,
respectively. The process Xn is related to Xn′ by Xn = Xξτn′
(n). Random sample Y n′ can
be written as Y n′ =⟨Yto(n′), . . . , Yt1(n′)
⟩.
For model ϕ, we would like to calculate the likelihood of a sequence of n normal time observations
〈y1, . . . , yn〉. Since our model is defined in terms of yn′, we partition the sequence yn into n′ ≤ n
subsequences such that each subsequence corresponds to the output of a single state of the model,
i.e.,
y1, . . . , yt1(1)︸ ︷︷ ︸
y1
, yt0(2), . . . , yt1(2)︸ ︷︷ ︸
y2
, . . . , yt0(n′), . . . , yn︸ ︷︷ ︸
yn′
.
For a given partition, the joint likelihood of the observation sequence and state durations is given
by
pn′(y1, . . . , yn′ , τ 1, . . . , τn′ ;ϕ) = π(ϕ)′D(τ 1;ϕ)B(y1|τ1;ϕ)n′∏
k′=2
A(ϕ)D(τ k′ ;ϕ)B(yk′ |τ k′ ;ϕ)1r.
(D.14)
= π(ϕ)′G(y1, τ 1;ϕ)n′∏
k′=2
A(ϕ)G(yk′ , τ k′ ;ϕ)1r. (D.15)
Averaging over all possible partitions, we can calculate pn(y1, . . . , yn;ϕ) as
pn(y1, . . . , yn;ϕ) =
n∑
n′=1
∑
τ1,...,τn,Pn′
i=1 τ i=n
P (n′)pn′(y1, . . . , yn′ , τ 1, . . . , τn′ ;ϕ) (D.16)
=
n∑
n′=1
∑
τ1,...,τn,Pn′
i=1 τ i=n
P (n′)π(ϕ)′G(y1, τ 1;ϕ)
n′∏
k′=2
A(ϕ)G(yk′ , τ k′ ;ϕ)1r, (D.17)
where P (n′) is the probability that there are n′ subsequences in the partition.
128
D.3 RMLE for the HSMM
This derivation follows from the derivation of the RMLE for the standard hidden Markov model
presented in Section 2.3.1.
For the HSMM, define the prediction filter un′(ϕ) as
un′(ϕ) = [un′1(ϕ), . . . , un′r(ϕ)]′ (D.18)
where
un′j(ϕ) = P (Xn′ = j|y1, . . . , yn′−1, τ 1, . . . , τn′−1, n′) (D.19)
is the probability of transitioning to state j at (model) time n′ given all previous observations and
a partition of those observations. For our derivation below, it will be useful to have a normal time
analog to un′(ϕ). Let un(ϕ) be
un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′ (D.20)
where
unj(ϕ) = P (Xn = j|y1, . . . , yn−1, τ1, . . . , τn′−1, n′, ξ(n− 1) = n′ − 1, ξ(n) = n′). (D.21)
For given n′ and τn′, un′(ϕ) = ut0(n′)(ϕ).
Using this filter, the likelihood pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) can be written as
pn′(y1, . . . , yn′ , τ1, . . . , τn′ ;ϕ) =
n′∏
k′=1
d(τ k′ ;ϕ)′B(yk′ |τk′ ;ϕ)uk′(ϕ).
=
n′∏
k′=1
g(yk′ , τk′ ;ϕ)′uk′(ϕ). (D.22)
(For this derivation, see Appendix E, Section E.2). As above, the likelihood at normal time n can
be calculated by averaging over all partitions of n, as
pn(y1, . . . , yn;ϕ) =
n∑
n′=1
∑
τ1,...,τn′Pn′
i=1 τ i=n
P (n′)
n′∏
k′=1
g(yk′ , τ k′ ;ϕ)′uk′(ϕ). (D.23)
Our goal is to maximize this likelihood with respect to parameter set ϕ, and in particular find
a recursive update. Unfortunately, there are a number of pragmatic problems with recursively
129
maximizing Equation (D.23). In particular, we would like to calculate this likelihood recursively,
so the summation over all partitions of 〈y1, . . . , yn〉 is undesirable. To alleviate this problem, we
will consider only the most likely partition of 〈y1, . . . , yn〉. Rewrite Equation (D.23) as
pn(y1, . . . , yn;ϕ) = maxn′=1,..,n
maxτ1,...,τn′
Pn′
i=1 τ i=n
n′∏
k′=1
g(yk′ , τ k′ ;ϕ)′uk′(ϕ). (D.24)
Maximizing pn(y1, . . . , yn;ϕ) is equivalent to maximizing its log-likelihood. For a given parti-
tion size n′ and sequence of durations τn′, define the normalized log-likelihood of (model time)
observations 〈y1, . . . , yn′〉 as
`n′(τn′,ϕ) =1
n′ + 1log pn′(y1, ..., yn′ , τ 1, . . . , τn′ ;ϕ)
=1
n′ + 1
n′∑
k′=1
log g(yk′ , τk′ ;ϕ)′uk′(ϕ). (D.25)
As in Equation (D.24), we can then write the log-likelihood of real time observations 〈y1, . . . , yn〉
as
`n(ϕ) = maxn′=1,...,n
maxτ1,...,τn′
Pn′
i=1 τ i=n
`n′(τn′,ϕ)
= maxn′=1,...,n
maxτ1,...,τn′
Pn′
i=1 τ i=n
1
n′ + 1
n′∑
k=1
log[g(yk, τk;ϕ)′uk(ϕ)]. (D.26)
At any time n, the values of n′ and τn′ which maximize `n(ϕ) can be determined recursively,
and can also be used in the recursive update of un(ϕ). Let n′∗n be the number of segments which
maximizes `n(ϕ), and let τ ∗n be the length of the last segment of yn′ which maximizes `n(ϕ).
Given the sequence of log-likelihoods up to `n−1(ϕ), as well as the optimal state sequence lengths
n′∗1 through n′∗n−1, we can maximize `n(ϕ) recursively with
τ∗n = arg maxτ
1
n′∗n−τ + 2
(
(n′∗n−τ + 1)`n−τ (ϕ)+
log[g(yn−τ+1, . . . , yn, τ ;ϕ)′un−τ+1(ϕ)
] )
, (D.27)
n′∗n = n′∗n−τ∗n
+ 1, (D.28)
130
and
`n(ϕ) = maxτ
1
n′∗n−τ + 2
(
(n′∗n−τ + 1)`n−τ (ϕ)+
log[g(yn−τ+1, . . . , yn, τ ;ϕ)′un−τ+1(ϕ)
] )
. (D.29)
with initialization τ ∗1 = 1, n′∗1 = 1, and `1(ϕ) = 12 log[g(y1, τ
∗1 ;ϕ)′u1(ϕ)].
As suggested above, we then use τ ∗n to recursively calculate un(ϕ), using
un+1(ϕ) =A(ϕ)′G(yn−τ∗
n+1, . . . , yn, τ∗n;ϕ)un−τ∗
n+1(ϕ)
g(yn−τ∗n+1, . . . , yn, τ∗n;ϕ)′un−τ∗
n+1(ϕ)
=A(ϕ)′G(yξ(n), τ
∗n;ϕ)un−τ∗
n+1(ϕ)
g(yξ(n), τ∗n;ϕ)′un−τ∗
n+1(ϕ). (D.30)
As with RMLE in the standard HMM, un(ϕ) is initialized with u1(ϕ) = π(ϕ).
Let w(l)n (ϕ) = (∂/∂ϕl)un(ϕ) be the partial derivative of un(ϕ) with respect to (wrt) the lth
component of ϕ. Each w(l)n (ϕ) is an r-length column vector, and
wn(ϕ) = (w(1)n (ϕ),w(2)
n (ϕ), . . . ,w(p)n (ϕ)) (D.31)
is an r × p matrix. Taking the derivative of un(ϕ) from Equation (D.30), we get
w(l)n+1(ϕ) =
∂un+1(ϕ)
∂ϕl
= R1(yξ(n), τ∗n,ϕ)w
(l)n−τ∗
n+1(ϕ) +R(l)2 (yξ(n), τ
∗n,ϕ) (D.32)
with
R1(yn′ , τ,ϕ) = A(ϕ)′[
I − G(yn′ , τ ;ϕ)un(ϕ)1′r
g(yn′ , τ ;ϕ)′un(ϕ)
]G(yn′ , τ ;ϕ)
g(yn′ , τ ;ϕ)′un(ϕ)(D.33)
R(l)2 (yn′ , τ,ϕ) = A(ϕ)′
[
I − G(yn′ , τ ;ϕ)un(ϕ)1′r
g(yn′ , τ ;ϕ)′un(ϕ)
][∂G(yn′ , τ ;ϕ)/∂ϕl]un(ϕ)
g(yn′ , τ ;ϕ)′un(ϕ)+
[∂A(ϕ)′/∂ϕl]G(yn′ , τ ;ϕ)un(ϕ)
g(yn′ , τ ;ϕ)′un(ϕ)(D.34)
where
∂
∂ϕlG(yn′ , τ ;ϕ) =
∂
∂ϕl[D(τ)B(yn′ ;ϕ)]
=∂D(τ)
∂ϕlB(yn′ ;ϕ) + D(τ)
∂B(yn′ ;ϕ)
∂ϕl. (D.35)
Using these equations, we can recursively calculate wn(ϕ) at every iteration.
131
To estimate the set of optimal parameters ϕ∗, we want to find the maximum of `n(ϕ) with
respect to ϕ, which we will attempt via recursive stochastic approximation. For each parameter l in
ϕ, at each time n, we take (∂/∂ϕl) of the most recent term inside the summation in Equation (D.26),
to form an “incremental score vector”
S(Yn′ ;ϕ) =(
S(1)(Yn′ ;ϕ), ..., S(p)(Yn′ ;ϕ))′
(D.36)
with
S(l)(Yn;ϕ) =∂
∂ϕllog[g(yξ(n), τ
∗n;ϕ)′un(ϕ)]
=g(yξ(n), τ
∗n;ϕ)′[(∂/∂ϕl)un(ϕ)] + [(∂/∂ϕl)g(yξ(n), τ
∗n;ϕ)]′un(ϕ)
g(yξ(n), τ∗n;ϕ)′un(ϕ)
=g(yξ(n), τ
∗n;ϕ)′wn(ϕ) + [(∂/∂ϕl)g(yξ(n), τ
∗n;ϕ)]′un(ϕ)
g(yξ(n), τ∗n;ϕ)′un(ϕ)
(D.37)
where
∂
∂ϕlg(yn′ , τ ;ϕ) =
∂
∂ϕl[d(τ)B(yn′ ;ϕ)]
=∂d(τ)
∂ϕlB(yn′ ;ϕ) + d(τ)
∂B(yn′ ;ϕ)
∂ϕl(D.38)
and
Yn , (Yn, Tn,un(ϕ),wn(ϕ)), (D.39)
where Tn = τ∗n.
As before, the RMLE algorithm takes the form
ϕn+1 = ΠG
(
ϕn + εnS(Yn;ϕn))
(D.40)
where εn is a sequence of step sizes satisfying εn ≥ 0, εn → 0 and∑
n εn = ∞, G is a compact and
convex set, and ΠG is a projection onto set G.
Equations (D.34) and (D.37) can both be simplified for each type of parameter in ϕ. For the
HMM, we have completed this simplification for model parameters for different assumed observation
densities. In the HSMM, this simplification must in particular be done for the parameters of the
chosen duration density d(τ ;ϕ) and observation density b(y;ϕ).
132
APPENDIX E
RMLE DERIVATIONS
E.1 Proof that pn(y1, . . . , yn; ϕ) =∏n
k=1 b(yk; ϕ)′uk(ϕ)pn(y1, . . . , yn; ϕ) =∏n
k=1 b(yk; ϕ)′uk(ϕ)pn(y1, . . . , yn; ϕ) =∏n
k=1 b(yk; ϕ)′uk(ϕ)
In Section 2.3.1, we state that pn(y1, . . . , yn;ϕ) is equivalent to∏n
k=1 b(yk;ϕ)′uk(ϕ). We have
pn(y1, . . . , yn) = p(y1)p(y2|y1)p(y3|y2, y1) · · · p(yn|y1, . . . , yn−1)
=∑
j
p(y1, x1 = j)∑
j
p(y2, x2 = j|y1)∑
j
p(y3, x3 = j|y1, y2) · · ·∑
j
p(yn, xn = j|y1, . . . , yn−1)
=∑
j
p(y1|x1 = j)P (x1 = j)∑
j
p(y2|x2 = j, y1)P (x2 = j|y1) · · ·∑
j
p(yn|xn = j, y1, . . . , yn−1)P (xn = j|y1, . . . , yn−1)
=∑
j
p(y1|x1 = j)P (x1 = j)∑
j
p(y2|x2 = j)P (x2 = j|y1) · · ·∑
j
p(yn|xn = j)P (xn = j|y1, . . . , yn−1)
=∑
j
bj(y1)u1j
∑
j
bj(y2)u2j
∑
j
bj(y3)u3j · · ·∑
j
bj(yn)unj
= (b(y1)′u1) · (b(y2)
′u2) · · · (b(yn)′un)
=∏
k
b(yk)′uk, (E.1)
where the fourth line is true because of the fact that each observation depends only on the current
state in an HMM.
133
E.2 Proof that pn′(y1, . . . , yn′, τ 1, . . . , τn′; ϕ) =∏n′
k=1 d(τk)′B(yk|τ k)ukpn′(y1, . . . , yn′, τ 1, . . . , τn′; ϕ) =
∏n′
k=1 d(τk)′B(yk|τ k)ukpn′(y1, . . . , yn′, τ1, . . . , τn′; ϕ) =
∏n′
k=1 d(τ k)′B(yk|τ k)uk
In Section D.3, we make a similar assertion regarding the HSMM, that pn(y1, . . . , yn, τ1, . . . , τn′ ;ϕ)
is equivalent to∏n′
k=1 d(τ k;ϕ)′B(yk;ϕ)uk(ϕ). We have
pn(y1, . . . , yn′ , τ1, . . . , τn′) = p(y1, τ1|n′)p(y2, τ 2|y1, τ1)p(yn′ , τn′ |y1, . . . , yn′−1, τ1, . . . , τn′−1)
=∑
j
p(y1, τ 1, x1 = j|n′)∑
j
p(y2, τ 2, x2 = j|y1, τ 1, n′) · · ·
∑
j
p(yn′ = j, τn′ , xn′ |y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)
=∑
j
p(y1, τ 1|x1 = j)P (x1 = j|n′)·
∑
j
p(y2, τ2|x2 = j)P (x2 = j|y1, τ1, n′) · · ·
∑
j
p(yn′ , τn′ |xn′ = j)P (xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)
=∑
j
p(y1|τ1, x1 = j)p(τ 1|x1 = j)P (x1|n′)·
∑
j
p(y2|τ2, x2 = j)p(τ 2|x2 = j)P (x2 = j|y1, τ 1, n′) · · ·
∑
j
p(yn′ |τn′ , xn′ = j)p(τn′ |xn′ = j)·
P (xn′ = j|y1, . . . , yn′−1, τ1, . . . , τn′−1, n′)
=∑
j
bj(y1|τ1)dj(τ1)u1j
∑
j
bj(y2|τ2)dj(τ2)u2j · · ·∑
j
bj(yn′ |τn′)dj(τn′)un′j
=
n′∏
k′=1
1′rB(yk′ |τk′)D(τ k′)uk′
=
n′∏
k′=1
b(yk′ |τk′)′D(τk′)uk′
=
n′∏
k′=1
d(τk′)′B(yk′ |τ k′)uk′ (E.2)
where the third line is because of the assumption that each observation and each duration depend
only on the current state in an HSMM. The definitions for b(·), B(·), d(·), D(·), and un′ come
from Sections D.2 and D.3.
134
E.3 Specialized RMLE Formulas
Section 2.3 of Chapter 2 gives the basic derivation of the RMLE algorithm. For completeness, we
restate the generalized parameter estimation formulas here, followed by their specialization for each
parameter type in particular HMMs.
Remember that the log-likelihood is defined as
`n(ϕ) =1
n+ 1
n∑
k=1
log[b(yk;ϕ)′uk(ϕ)], (E.3)
where b(yn;ϕ) is the observation likelihood vector for observation yn, and un(ϕ) is the set of prior
probabilities for each state at time n, i.e., un(ϕ) = [un1(ϕ), . . . , unr(ϕ)]′, where uni(ϕ) = P (xn =
i|y1, . . . , yn−1). This vector can be calculated recursively using
un+1(ϕ) =A′(ϕ)B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ). (E.4)
Taking the derivative of the last term in the summation of Equation (E.3) with respect to each
ϕl, we get
S(l)(Yn;ϕ) =b(yn;ϕ)′w
(l)n (ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)
b(yn;ϕ)′un(ϕ), (E.5)
where Yn = (Yn,un(ϕ),wn(ϕ)), and w(l)n = (∂/∂ϕl)un(ϕ). The value of w
(l)n (ϕ) can be calculated
recursively using
w(l)n+1(ϕ) =
∂un+1(ϕ)
∂ϕl
= R1(yn,ϕ)w(l)n (ϕ) +R
(l)2 (yn,ϕ), (E.6)
where
R1(yn,un(ϕ),ϕ) = A(ϕ)′[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
]B(yn;ϕ)
b(yn;ϕ)′un(ϕ)(E.7)
R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′
[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
][∂B(yn;ϕ)/∂ϕl]un(ϕ)
b(yn;ϕ)′un(ϕ)+
[∂A(ϕ)′/∂ϕl]B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ). (E.8)
A version of both S(l)(Yn;ϕ) and R(l)2 (yn,un(ϕ),ϕ) must be derived separately for each type of
parameter in `n(ϕ).
135
E.3.1 Transition probabilities
For transition probabilities ϕl = aij(ϕ), ∂B(yn;ϕ)/∂ϕl will be zero. Abusing notation slightly, let
l = aij refer to parameter aij in HMM ϕ. Then
S(aij )(Yn;ϕ) =b(y;ϕ)′w
(aij)n (ϕ)
b(yn;ϕ)′un(ϕ)(E.9)
and
R(aij)2 =
[∂A(ϕ)/∂ϕaij]B(yn;ϕ)un(ϕ)
b(yn;ϕ)′un(ϕ), (E.10)
where ∂A(ϕ)/∂ϕaijis a matrix with a 1 at position aij and zeros elsewhere.
E.3.2 Discrete observation probabilities
For observations drawn from a finite discrete set V = v1, . . . , vs, let ϕl = bjk(ϕ), and, as above,
let l = bjk. Then
S(bjk)(Yn;ϕ) =
b(yn;ϕ)′w(bjk)n (ϕ)+[(∂/∂ϕbjk
)b(yn ;ϕ)]′un(ϕ)
b(yn;ϕ)′un(ϕ) if yn = vk
b(yn;ϕ)′w(bjk)n (ϕ)
b(yn;ϕ)′un(ϕ) if yn 6= vk
(E.11)
and
R(bjk)2 (yn,un(ϕ),ϕ) =
A(ϕ)′[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
][∂B(yn;ϕ)/∂ϕl]un(ϕ)
b(yn;ϕ)′un(ϕ) if yn = vk
0 if yn 6= vk
. (E.12)
Note that even when yn 6= vk, R1(yn,un(ϕ),ϕ) and therefore w(l)n (ϕ) and S(l)(Yn;ϕ) are non-zero.
E.3.3 Gaussian observation likelihoods
For the case of continuous observation likelihood pdfs,
S(l)(Yn;ϕ) =b(yn;ϕ)′w
(l)n (ϕ) + [(∂/∂ϕl)b(yn;ϕ)]′un(ϕ)
b(yn;ϕ)′un(ϕ)(E.13)
and
R(l)2 (yn,un(ϕ),ϕ) = A(ϕ)′
[
I − B(yn;ϕ)un(ϕ)1′r
b(yn;ϕ)′un(ϕ)
][∂B(yn;ϕ)/∂ϕl]un(ϕ)
b(yn;ϕ)′un(ϕ). (E.14)
136
Here we assume that the observation likelihoods are given by a multidimensional Gaussian function
with dimension d, mean vector µ and covariance matrix Σ. This likelihood is defined as
b(y; θ) = N (y;µ(θ),Σ(θ)) =1
(2π)d2 |Σ(θ)| 12
exp
[
−1
2(y − µ(θ))′Σ(θ)−1 (y − µ(θ))
]
, (E.15)
where |Σ| indicates the determinant of Σ, and y′ indicates the transpose of vector y. In the
formulation here and derivation below, a, b, y, and µ are all column vectors, Σ is the covariance
matrix, and X is a square matrix. For convenience of notation, we will drop the explicit dependence
on θ. Assume real values for all calculations. We need to take the derivative of N (y;µ,Σ) with
respect to the elements of mean vector µ and covariance matrix Σ.
E.3.3.1 Mean vector µ
For µ, we can compute all elements at once by taking the vector derivative (gradient), as
∂
∂µN (y;µ,Σ) =
∂
∂µ
(
1
(2π)d2 |Σ| 12
exp
[
−1
2(y − µ)′Σ−1 (y − µ)
])
=1
(2π)d2 |Σ| 12
× ∂
∂µexp
[
−1
2(y − µ)′Σ−1 (y − µ)
]
=1
(2π)d2 |Σ| 12
exp
[
−1
2(y − µ)′Σ−1 (y − µ)
]
×
∂
∂µ
(
−1
2(y − µ)′Σ−1 (y − µ)
)
=1
(2π)d2 |Σ| 12
exp
[
−1
2(y − µ)′Σ−1(y − µ)
]
×(
1
2
)((
Σ−1 + Σ−T)(y − µ)
)
= N (y;µ,Σ)(Σ−1(y − µ)
), (E.16)
where in step 4 we used the identity ∂∂a
(a′X−1a) = (X + X′) a [131], and in step 5 we used the fact
that Σ (and therefore Σ−1) is symmetric.
137
E.3.3.2 Covariance matrix Σ
To take the derivative with respect to Σ, let g1(Σ) =(
(2π)d2 |Σ| 12
)−1and
g2(Σ) = exp[−1
2(y − µ)′Σ−1(y − µ)]. We then take
∂
∂ΣN (y;µ,Σ) =
∂
∂Σ
(
1
(2π)d2 |Σ| 12
exp
[
−1
2(y − µ)′Σ−1(y − µ)
])
=∂
∂Σ(g1(Σ)g2(Σ))
= g1(Σ)∂
∂Σg2(Σ) + g2(Σ)
∂
∂Σg1(Σ). (E.17)
Taking the derivative of g1(Σ) with respect to Σ,
∂
∂Σg1(Σ) = (2π)−
d2∂
∂Σ|Σ|− 1
2
= (2π)−d2
(
−1
2
)
|Σ|− 32∂
∂Σ|Σ|
= (2π)−d2
(
−1
2
)
|Σ|− 32 × |Σ|
(2Σ−1 − diag(Σ−1)
)
= −1
2
(
(2π)d2 |Σ| 12
)−1 (2Σ−1 − diag(Σ−1)
), (E.18)
where in line 3, we used the identity ∂∂X
|X| = |X|(2X−1 − diag(X−1)) when X is symmetric (see
Section F.2), where diag(X) is a square matrix containing the main diagonal of X, with zeros on
the off diagonal.
For g2(Σ),
∂
∂Σg2(Σ) =
∂
∂Σexp
[
−1
2(y − µ)′ Σ−1 (y − µ)
]
= exp
[
−1
2(y − µ)′ Σ−1 (y − µ)
]∂
∂Σ
(
−1
2(y − µ)′ Σ−1 (y − µ)
)
= exp
[
−1
2(y − µ)′ Σ−1 (y − µ)
](1
2
)
(2Υ − diag(Υ)) (E.19)
where
Υ = Σ−1 (y − µ) (y − µ)′ Σ−1. (E.20)
In line 2, we use the identity ∂∂X
a′X−1a = (−2X−1aa′X−1 + diag(X−1aa′X−1)) when X is sym-
metric [131, 132] (for this derivation, see also Section F.3).
138
Substituting Equations (E.18) and (E.19) into Equation (E.17), we get
∂
∂ΣN (y;µ,Σ) =
1
2
(
(2π)d2 |Σ| 12
)−1exp
[
−1
2(y − µ)′Σ−1(y − µ)
]
×((2Υ − diag(Υ)) −
(2Σ−1 − diag(Σ−1)
))
=1
2N (y;µ,Σ)
(2Υ − diag(Υ)
), (E.21)
where Υ is defined as
Υ = Υ−Σ−1
= Σ−1 (y − µ) (y − µ)′ Σ−1 −Σ−1. (E.22)
E.3.3.3 Upper triangular matrix R, for R′R = Σ
A particularly useful form used to store the covariance matrix information is upper triangular matrix
R, where R is the upper triangular matrix of the Cholesky decomposition of Σ, i.e., R ′R = Σ.
This form is convenient for two reasons. First, it is an intermediate form when taking the inverse of
a symmetric matrix, i.e., we can write Σ−1 = R−1R−T , where R−T is the inverse of the transpose
of R. Second, the determinant of Σ, written |Σ|, is equal to the product of the diagonal elements
of R.
Define a modified Gaussian function N (y;µ(θ),R(θ)) as
N (y;µ(θ),R(θ)) =1
(2π)d2 |R(θ)′R(θ)| 12
exp
[
−1
2(y − µ(θ))′R(θ)−1R(θ)−T (y − µ(θ))
]
. (E.23)
For convenience of notation, we will drop the explicit dependency on θ. The derivative of N (y;µ,R)
with respect to µ is the same as before. The derivative with respect to matrix R is similar to the
derivative of N (y;µ,Σ) with respect to Σ. As before, let
h1(R) =(
(2π)d2
∣∣RR′
∣∣12
)−1(E.24)
and
h2(R) = exp
[
−1
2(y − µ)′R−1R−T (y − µ)
]
. (E.25)
139
We then take
∂
∂RN (y;µ,R) =
∂
∂R
(
1
(2π)d2 |R′R| 12
exp
[
−1
2(y − µ)′R−1R−T (y − µ)
])
=∂
∂Σ(h1(R)h2(R))
= h1(R)∂
∂Rh2(R) + h2(R)
∂
∂Rh1(R). (E.26)
Taking the derivative of h1(R) with respect to R,
∂
∂Rh1(R) = (2π)−
d2∂
∂R
∣∣R′R
∣∣−
12
= (2π)−d2
(
−1
2
)∣∣R′R
∣∣−
32∂
∂R
∣∣R′R
∣∣
= (2π)−d2
(
−1
2
)∣∣R′R
∣∣−
32 × 2|R′R|R
(R′R
)−1
= (2π)−d2
∣∣R′R
∣∣−
12 ×R
(R′R
)−1
= −(
(2π)d2
∣∣R′R
∣∣12
)−1
×RΣ−1, (E.27)
where in line 3, we used the identity ∂∂X
|X′X| = 2|X′X|X(X′X)−1 when C is real symmetric [131].
For h2(R),
∂
∂Rh2(R) =
∂
∂Rexp
[
−1
2(y − µ)′ R−1R−T (y − µ)
]
= exp
[
−1
2(y − µ)′ R−1R−T (y − µ)
]∂
∂R
(
−1
2(y − µ)′ R−1R−T (y − µ)
)
= exp
[
−1
2(y − µ)′ R−1R−T (y − µ)
][RR−1R−T (y − µ) (y − µ)′ R−1R−T
]
exp
[
−1
2(y − µ)′ Σ−1 (y − µ)
][RΣ−1 (y − µ) (y − µ)′ Σ−1
]
= exp
[
−1
2(y − µ)′ Σ−1 (y − µ)
]
RΥ (E.28)
where, as before,
Υ = Σ−1 (y − µ) (y − µ)′ Σ−1. (E.29)
In line three, we use the identity ∂∂X
(aT X−1X−Ta) = −2X · X−1X−TaaTX−1X−T , for which the
derivation appears in Section F.4.
140
Substituting Equations (E.27) and (E.28) into Equation (E.26), we get
∂
∂RN (y;µ,R) =
(
(2π)d2
∣∣RR−T
∣∣12
)−1
exp
[
−1
2(y − µ)′R−1R−T (y − µ)
]
×
R(Υ −Σ−1
)
= N (y;µ,R)RΥ, (E.30)
where, as before, Υ is defined as
Υ = Υ−Σ−1
= Σ−1 (y − µ) (y − µ)′ Σ−1 −Σ−1. (E.31)
141
APPENDIX F
MATRIX CALCULUS
F.1 Introduction
A few of the derivations in Appendix E depend on matrix derivatives. Some of these derivatives
were taken from other sources [131, 132], but at least one requires some additional derivation not
found elsewhere.
F.2 Preliminaries
For the derivations below, X is assumed to be a square matrix, (X)ij = xij is the element at
position (i, j) of matrix X, a is a vector, and ei is the ith column of identity matrix I. Let Xij refer
to cofactor (i, j) of matrix X. For a vector or matrix v, vT indicates its transpose. Let X−1 be
the inverse of X, and let X−T be the inverse of the transpose of X. When taking the inverse, we
assume that X is non-singular.
We will use the following identities below. For nonsymmetric X,
∂
∂xijX = eie
Tj , (F.1)
where eieTj is a square matrix with a one at position (i, j) and zeros elsewhere. If X is symmetric,
∂
∂xijX =
eieTj if i = j
eieTj + eje
Ti if i 6= j.
(F.2)
The derivative of XTX is defined by
142
∂
∂xijXTX =
(∂
∂xijXT
)
X + XT
(∂
∂xijX
)
= ejeTi X + XTeie
Tj (F.3)
for nonsymmetric X.
For matrix X,
|X| =∑
j
xijXij (F.4)
for any fixed i [132]. Because each cofactor Xij is independent of xij, this implies, for nonsymmetric
X, that
∂
∂xij|X| = Xij (F.5)
and
∂
∂X|X| =
X11 X12 . . . X1r
X21 X22 . . . X2r
......
. . ....
Xr1 Xr2 . . . Xrr
= |X|X−T (F.6)
(from [131, 132]). If X is symmetric,
∂
∂xij|X| =
Xij if i = j
2Xij if i 6= j
(F.7)
and
∂
∂X|X| =
X11 2X12 . . . 2X1r
2X21 X22 . . . 2X2r
......
. . ....
2Xr1 2Xr2 . . . Xrr
= 2|X|(X−T − diag(X−T )
). (F.8)
143
F.3 Derivation of ∂∂XaTX−1a
This derivation comes from [131]. We start with
0 =∂
∂xijI
=∂
∂xij
(XX−1
)
=∂
∂xij(X)X−1 + X
∂
∂xij(X−1), (F.9)
which implies
∂
∂xijX−1 = −X−1 ∂
∂xij(X)X−1. (F.10)
Therefore, for nonsymmetric X,
∂
∂xijaTX−1a = −aTX−1 ∂
∂xij(X)X−1a
= −aTX−1eieTj X−1a
= −aTX−1ei · eTj X−1a
= −eTi X−Ta · aTX−Tej
= −(X−TaaTX−T )ij , (F.11)
which implies that
∂
∂XaTX−1a = −X−TaaTX−T . (F.12)
For symmetric X, ∂∂xij
aTX−1a is the same as Equation (F.11) if i = j. If i 6= j,
∂
∂xijaTX−1a = −aTX−1 ∂
∂xij(X)X−1a
= −aTX−1(eieTj + eje
Ti )X−1a
= −aTX−1ei · eTj X−1a− aTX−1ej · eT
i X−1a
= −eTi X−1a · aTX−1ej − eT
j X−1a · aTX−1ei
= −eTi X−1aaTX−1ej − eT
i X−1aaTX−1ej
= −2eTi X−1aaTX−1ej
= −2(X−T aaTX−T )ij , (F.13)
144
where in lines four and five, we take advantage of the fact that X (and therefore X−1) is symmetric.
The full derivative for symmetric X is then
∂
∂XaTX−1a = −2X−TaaTX−T + diag(X−T aaTX−T ). (F.14)
F.4 Derivation of ∂∂XaX−1X−Ta
This derivation is similar to but more complicated than the derivation in the previous section.
Starting with ∂∂X
X−1X−T , note that
0 =∂
∂xijI
=∂
∂xij
(XTXX−1X−T
)
=∂
∂xij
(XTX
)X−1X−T + XTX
∂
∂xij
(X−1X−T
), (F.15)
which implies
∂
∂xij
(X−1X−T
)= −X−1X−T ∂
∂xij
(XTX
)X−1X−T . (F.16)
Therefore,
∂
∂xij
(aTX−1X−Ta
)= aT ∂
∂xij
(X−1X−T
)a
= −aTX−1X−T ∂
∂xij
(XTX
)X−1X−Ta
= −aTX−1X−T(eje
Ti X + XTeie
Tj
)X−1X−Ta
= −aTX−1X−TejeTi XX−1X−Ta− aTX−1X−TXTeie
Tj X−1X−Ta
= −aTX−1X−Tej · eTi XX−1X−Ta− aTX−1X−TXTei · eT
j X−1X−Ta
= −eTj X−1X−Ta · aTX−1X−TXTei − eT
i XX−1X−Ta · aTX−1X−Tej
= −eTi XX−1X−TaaTX−1X−Tej − eT
i XX−1X−TaaTX−1X−Tej
= −2eTi XX−1X−TaaTX−1X−Tej
= −2(XX−1X−TaaTX−1X−T
)
ij. (F.17)
145
The full derivative is then
∂
∂X(aT X−1X−Ta) = −2XX−1X−TaaTX−1X−T (F.18)
= −2X−TaaTX−1X−T . (F.19)
In our work, we actually use Equation (F.18) rather than Equation (F.19).
146
REFERENCES
[1] A. Turing, “Computing machinery and intelligence,” Mind, vol. 59, pp. 433–460, 1950.
[2] G. Stojanov, “Petitage: A case study in developmental robotics,” in Proc. 1st Int. Workshopon Epigenetic Robotics, Lund, Sweden, 2001.
[3] R. A. Brooks, “Achieving artificial intelligence through building robots,” Massachusettes In-stitute of Technology, Artificial Intelligence Laboratory, Tech. Rep. 899, 1986.
[4] R. A. Brooks and L. A. Stein, “Building brains for bodies,” Massachusettes Institute of Tech-nology, Artificial Intelligence Laboratory, Tech. Rep. 1439, 1993.
[5] S. E. Levinson, “The role of sensorimotor function, associative memory and reinforcementlearning in automatic acquisition of spoken language by an autonomous robot,” in Proc. NSFDarpa Workshop on Development and Learning, Michigan State University, Apr. 2000.
[6] J. Krichmar and G. Edelman, “Machine psychology: Autonomous behavior, perceptual cat-egorization and conditioning in a brain-based device,” Cerebral Cortex, vol. 12, pp. 818–830,2002.
[7] Y. Zhang and J. Weng, “Grounded auditory development by a developmental robot,” in Proc.INNS/IEEE Int. Joint Conf. Neural Networks, Washington DC, July 2001, pp. 1059–1064.
[8] J. D. Han, S. W. Zeng, K. Y. Tham, M. Badgero, and J. Weng, “Dav: A humanoid robotplatform for autonomous mental development,” in Proc. 2nd Int. Conf. on Development andLearning, Cambridge, MA, June 2002.
[9] M. Lungarella and G. Metta, “Beyond gazing, pointing, and reaching: A survey of develop-mental robotics,” in Proc. 3rd Int. Workshop on Epigenetic Robotics, 2003.
[10] R. Brooks, C. Breazeal, M. Marjanovic, B. Scassellati, and M. Williamson, “The Cog project:Building a humanoid robot,” in Computation for Metaphors, Analogy and Agents, C. Nehaniv,Ed. Berlin: Springer-Verlag, 1998, pp. 52–87.
[11] P. Varshavskaya, “Behavior-based early language development on a humanoid robot,” in Proc.2nd Int. Workshop on Epigenetic Robotics, 2002.
[12] C. Breazeal and B. Scassellati, “How to build robots that make friends and influence people,”in Proc. Int. Conf. on Intell. Robots and Systems, Kyongju, Korea, 1999.
[13] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends in Cognitive Sciences,vol. 3, no. 6, pp. 233–242, 1999.
147
[14] H. Kozima and H. Yano, “A robot that learns to communicate with human caregivers,” inProc. 1st Int. Workshop on Epigenetic Robotics, Lund, Sweden, 2001.
[15] J. Weng, “A theory for mentally developing robots,” in Proc. 2nd Int. Conf. on Developmentand Learning, Cambridge, MA, June 2002, pp. 131–140.
[16] J. Weng, Y. Zhang, and Y. Chen, “Developing early senses about the world: ‘Object perma-nence’ and visuoauditory real-time learning,” in Proc. INNS/IEEE Int. Joint Conf. NeuralNetworks, vol. 4, July 2003, pp. 2710–2715.
[17] N. Almassy, G. M. Edelman, and O. Sporns, “Behavioral constraints in the development ofneuronal properties: a cortical model embedded in a real world device,” Cerebral Cortex,vol. 8, pp. 346–361, 1998.
[18] O. Sporns and W. H. Alexander, “Neuromodulation in a learning robot: Interactions betweenneural plasticity and behavior,” in Proc. INNS/IEEE Int. Joint Conf. Neural Networks, vol. 4,July 2003, pp. 2789–2794.
[19] A. K. Seth, J. L. McKinstry, G. M. Edelman, and J. L. Krichmar,“Visual binding, reentry andneuronal synchrony in a physically situated brain-based device,” in Proc. 3rd Int. Workshopon Epigenetic Robotics, 2003.
[20] K. Fischer and R. Moratz, “From communicative strategies to cognitive modelling,” in Proc.1st Int. Workshop on Epigenetic Robotics, Lund, Sweden, 2001.
[21] L. Hugues and A. Drogoul, “Shaping of robot behaviors by demonstration,” in Proc. 1st Int.Workshop on Epigenetic Robotics, Lund, Sweden, 2001.
[22] P. R. Cohen, C. Sutton, and B. Burns, “Learning effects of robot actions using temporalassociations,” in Proc. 2nd Int. Conf. on Development and Learning, Cambridge, MA, June2002, pp. 96–101.
[23] I. Fasel, G. O. Deak, J. Triesch, and J. Movellan, “Combining embodied models and empiricalresearch for understanding the development of shared attention,” in Proc. 2nd Int. Conf. onDevelopment and Learning, Cambridge, MA, June 2002, pp. 21–27.
[24] R. A. Grupen, “A developmental organization for robot behavior,” in Proc. 3rd Int. Workshopon Epigenetic Robotics, 2003.
[25] Arrick Robotics, http://www.robotics.com/.
[26] B. Gold and N. Morgan, Speech and Audio Signal Processing. New York: Wiley, 2000.
[27] M. J. Tovee, An Introduction to the Visual System. Cambridge: Cambridge University Press,1996.
[28] W. Zhu and S. E. Levinson, “Edge orientation-based multiview object recognition,” in Proc.IEEE Int’l Conf. on Pattern Recognition, vol. 1, Barcelona, Spain, 2000, pp. 936–939.
[29] W. Zhu, S. Wang, R. S. Lin, and S. E. Levinson, “Tracking of object with SVM regression,”in Proc. IEEE Int. Conf. on Comput. Vision & Pattern Recognition, vol. 2, Hawaii, 2001, pp.240–245.
148
[30] R. S. Lin, “Learning vision-based robot navigation,” M.S. thesis, University of Illinois atUrbana-Champaign, 2004.
[31] D. Li and S. E. Levinson, “A robust linear phase unwrapping method for dual-channel soundsource localization,” in Int. Conf. on Robot. Automat., Washington D.C., May 2002.
[32] D. Li and S. E. Levinson, “A Bayes-rule based hierarchical system for binaural sound sourcelocalization,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Hong Kong, Apr.2003.
[33] M. Hakozaki, H. Oasa, and H. Shinoda, “Telemetric robot skin,” in Proc. IEEE Int. Conf. onRobot. Automat., Detroit, Michigan, May 1999.
[34] A. Loutfi, S. Coradeschi, T. Duckett, and P. Wide, “Odor source identification by groundinglinguistic descriptions in an artificial nose,” in Proc. SPIE Conf. on Sensor Fusion: Architec-tures, Algorithms and Applications V, vol. 4385, Orlando, Florida, 2001, pp. 273–282.
[35] S. Savoy et al., “Solution-based analysis of multiple analytes by a sensor array: Towardthe development of an electronic tongue,” in SPIE Conf. on Chemical Microsensors andApplications, vol. 3539, Boston, MA, Nov. 1998.
[36] F. W. Edridge-Green, Memory and Its Cultivation. New York: D. Appleton and Co., 1900.
[37] M. H. Ashcraft, Human Memory and Cognition. New York: Harper Collins, 1989.
[38] A. Baddeley, “Memory,” in MIT Encyclopedia of Cognitive Science, R. A. Wilson and F. Keil,Eds. Cambridge, MA: The MIT Press, 1999.
[39] D. L. Schacter and E. Tulving, Eds., Memory Systems 1994. Cambridge, MA: The MITPress, 1994.
[40] D. R. Shanks, The Psychology of Associative Learning. Cambridge: Cambridge UniversityPress, 1995.
[41] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J.Artif. Intell. Research, vol. 4, pp. 237–285, 1996.
[42] M. Wines, “For sniffing out land mines, a platoon of twitching noses,” The New York Times,p. A1, May 18, 2004.
[43] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl, The Scientist in the Crib. New York: HarperCollins, 1999.
[44] C. Garvey, Play. Cambridge, MA: Harvard University Press, 1990.
[45] J. Kaminski, J. Call, and J. Fischer, “Word learning in a domestic dog: Evidence for ‘fastmapping’,” Science, vol. 304, pp. 1682–1683, June 2004.
[46] B. Breidegard and C. Balkenius, “Speech development by imitation,” in Proc. 3rd Int. Work-shop on Epigenetic Robotics, 2003.
[47] M. Cabido-Lopes and J. Santos-Victor, “Visual transformations in gesture imitation: Whatyou see is what you do,” in Proc. Int. Conf. Robot. Automat., 2003, pp. 2375–2381.
149
[48] M. Kleffner, “A method of automatic speech imitation via warped linear prediction,” M.S.thesis, University of Illinois at Urbana-Champaign, 2003.
[49] W. Zhu and S. E. Levinson, “PQ-learning: An efficient robot learning method for intelligentbehavior acquisition,” in Proc. 7th Int. Conf. on Intell. Autonomous Systems, vol. 1, Marinadel Rey, CA, Mar. 2002, pp. 404–411.
[50] M. McClain, “The role of exploration in language acquisition for an autonomous robot,” M.S.thesis, University of Illinois at Urbana-Champaign, 2003.
[51] Q. Liu, “Interactive and incremental learning via a multisensory mobile robot,” Ph.D. disser-tation, University of Illinois at Urbana-Champaign, 2001.
[52] W. Zhu and S. E. Levinson, “JPDF-based visual concept learning by an autonomous agent,”in Proc. Int. Conf. Vision Interface, 2003.
[53] S. Carey and E. Bartlett, “Acquiring a single new word,” Papers and Reports on Child Lan-guage Development, vol. 15, pp. 17–29, 1978.
[54] L. Markson and P. Bloom,“Evidence against a dedicated system for word learning in children,”Nature, vol. 385, pp. 813–815, Feb. 1997.
[55] K. Yip and G. J. Sussman, “Sparse representations for fast, one-shot learning,” Proc. Nat.Conf. Artif. Intell., 1997.
[56] J. C. Nieh, “Stingless-bee communication,” American Scientist, vol. 87, no. 5, pp. 428–435,Sept. 1999.
[57] S. Laurence and E. Margolis, “Concepts and cognitive science,” in Concepts: Core Readings,S. Laurence and E. Margolis, Eds. Cambridge, MA: The MIT Press, 1999, pp. 3–81.
[58] R. L. Solso, Cognitive Psychology, 4th ed. Boston: Allyn and Bacon, 1995.
[59] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks. Reading, MA: Addison-Wesley, 1989.
[60] J. M. Zurada, Introduction to Artificial Neural Systems. St. Paul, MN: West PublishingCompany, 1992.
[61] F. V. Jensen, Bayesian Networks and Decision Graphs. New York: Springer-Verlag, 2001.
[62] H. Pan, Z.-P. Liang, and T. Huang, “Fusing audio and visual features of speech,” in Proc.2000 Int. Conf. Image Processing, vol. 3, 2000, pp. 214–217.
[63] H. Pan, “A Bayesian fusion approach and its application to integrating audio and visualsignals in HCI,” Ph.D. dissertation, University of Illinois at Urbana-Champaign, 2001.
[64] M. Brand, “Couple hidden Markov models for modeling interacting processes,” MIT MediaLab, Tech. Rep. 405, 1997.
[65] S. M. Chu, “Multimodal fusion with applications to audio-visual speech recognition,” Ph.D.dissertation, University of Illinois at Urbana-Champaign, 2003.
150
[66] V. Krishnamurthy and G. G. Yin, “Recursive algorithms for estimation of hidden Markovmodels and autoregressive models with Markov regime,” IEEE Trans. Inform. Theory, vol. 48,no. 2, pp. 458–476, Feb. 2002.
[67] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ:Prentice Hall PTR, 1993.
[68] P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Hidden Markov models for DNA sequencing,”in Workshop on Genomic Signal Processing and Statistics (GENSIPS 2002), Oct. 2002.
[69] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov models incomputational biology: Applications to protein modeling,” J. of Molecular Biology, vol. 235,pp. 1501–1531, 1994.
[70] D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, “A generalized hidden Markov modelfor the recognition of human genes in DNA,” in Proc. 4th Int. Conf. Intell. Syst. MolecularBio., 1996, pp. 134–142.
[71] R. L. Cave and L. P. Neuwirth, “Hidden Markov models for English,” in Proc. of the Sym-posium on the Applications of Hidden Markov Models to Text and Speech. Princeton, NJ:IDA-CRD, Oct. 1980, pp. 16–56.
[72] A. B. Poritz, “Linear predictive hidden Markov models and the speech signal,” in Proc. IEEEInt. Conf. Acoust. Speech Signal Processing, 1982, pp. 1291–1294.
[73] A. Ljolje and S. E. Levinson, “Development of an acoustic-phonetic hidden Markov modelfor continuous speech recognition,” IEEE Trans. Signal Processing, vol. 29, no. 1, pp. 29–39,1991.
[74] A. Arapostathis and S. I. Marcus, “Analysis of an identification algorithm arising in theadaptive estimation of Markov chains,” Math Control Signals Systems, vol. 3, no. 1, pp. 1–29,1990.
[75] I. B. Collings, V. Krishnamurthy, and J. B. Moore, “On-line identification of hidden Markovmodels via recursive prediction error techniques,” IEEE Trans. Signal Processing, vol. 42,no. 12, pp. 3535–3539, Dec. 1994.
[76] F. LeGland and L. Mevel, “Recursive estimation in hidden Markov models,” in Proc. 36thIEEE Conf. Decision Contr., San Diego, CA, Dec. 1997.
[77] U. Holst and G. Lindgren, “Recursive estimation in mixture models with Markov regime,”IEEE Trans. Inform. Theory, vol. 37, no. 6, pp. 1683–1690, Nov. 1991.
[78] V. Krishnamurthy and J. B. Moore, “On-line estimation of hidden Markov model parametersbased on the Kullback-Leiber information measure,” IEEE Trans. Signal Processing, vol. 41,no. 8, pp. 2557–2573, Aug. 1993.
[79] T. Ryden, “On recursive estimation for hidden Markov models,” Stochastic Processes andtheir Applications, vol. 66, pp. 79–96, 1997.
[80] F. LeGland and L. Mevel, “Recursive identification of HMM’s with observations in a finiteset,” in Proc. 34th IEEE Conf. Decision Contr., New Orleans, Dec. 1995, pp. 216–221.
151
[81] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the application of thetheory of probabilistic functions of a Markov process to automatic speech recognition,” TheBell System Technical Journal, vol. 62, no. 4, pp. 1035–1074, Apr. 1983.
[82] H. V. Poor, An Introduction to Signal Detection and Estimation. New York: Springer-Verlag,1994.
[83] F. LeGland and L. Mevel, “Geometric ergodicity in hidden Markov models,” INRIA, Tech.Rep. RR-2991, Sept. 1996.
[84] S. P. Meyn, Markov chains and stochastic stability. London: Springer-Verlag, 1993.
[85] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Ap-plications. New York: Springer-Verlag, 2003.
[86] V. Krishnamurthy and T. Ryden, “Consistent estimation of linear and non-linear autoregres-sive models with Markov regime,” J. of Time Series Analysis, vol. 19, no. 3, pp. 291–307,1998.
[87] N. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,” in Proc. 9th Int.Conf. on Artif. Neural Networks, 1999.
[88] N. N. Schraudolph and T. Graepel, “Combining conjugate direction methods with stochasticapproximation of gradients,” in Proc. 9th Int. Workshop Artif. Intell. and Statistics, 2003.
[89] Y. Ephraim and N. Merhav,“Hidden Markov processes,” IEEE Trans. Inform. Theory, vol. 48,no. 6, pp. 1518–1569, June 2002.
[90] E. Gassiat and S. Boucheron, “Optimal error exponents in hidden Markov models orderestimation,” IEEE Trans. Inform. Theory, vol. 49, no. 4, pp. 964– 980, Apr. 2003.
[91] T. Ryden, “Estimating the order of hidden Markov models,” Statistics, vol. 26, pp. 345–354,1995.
[92] R. J. MacKay, “Estimating the order of a hidden Markov model,” The Canadian Journal ofStatistics, vol. 30, no. 4, pp. 573–589, 2002.
[93] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York: Wiley, 2001.
[94] J. D. Ferguson, “Variable duration models for speech,” in Proc. of the Symposium on theApplications of Hidden Markov Models to Text and Speech, J. D. Ferguson, Ed. Princeton,NJ: IDA-CRD, Oct. 1980, pp. 143–179.
[95] M. J. Russell and R. K. Moore, “Explicit modelling of state occupancy in hidden Markovmodels for automatic speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech SignalProcessing, vol. 10, Apr. 1985, pp. 5–8.
[96] S. E. Levinson, “Continuously variable duration hidden Markov models for automatic speechrecognition,” Computer Speech and Language, vol. 1, pp. 29–45, 1986.
[97] S. Fine, Y. Singer, and N. Tishby, “The hierarchical hidden Markov model: Analysis andapplications,” Machine Learning, vol. 32, no. 1, pp. 41–62, 1998.
152
[98] K. Murphy and M. Paskin, “Linear time inference in hierarchical HMMs,” in Advances inNeural Information Processing Systems, 2002.
[99] Z. Ghahramani and G. E. Hinton, “Factorial hidden Markov models,” Machine Learning,vol. 29, pp. 245–273, 1997.
[100] V. Pavlovic, J. M. Rehg, T.-J. Cham, and K. P. Murphy, “A dynamic Bayesian networkapproach to figure tracking using learned dynamic models,” in Proc. Int. Conf. on Comput.Vision, 1999, pp. 94–101.
[101] Z. Ghahramani and G. Hinton, “Variational learning for switching state-space models,”NeuralComputation, vol. 12, pp. 831–864, 2000.
[102] K. S. Fu, Syntactic methods in pattern recognition. New York: Academic Press, 1974.
[103] E. Charniak, Statistical Language Learning. Cambridge, MA: The MIT Press, 1996.
[104] N. Chomsky, “Three models for the description of language,” IEEE Trans. Inform. Theory,vol. 2, no. 3, pp. 113–124, Nov. 1956.
[105] D. Roy, “Grounded spoken language acquisition: Experiments in word learning,” IEEE Trans.Multimedia, vol. 5, no. 2, June 2003.
[106] L. Steels, “Language games for autonomous robots,” IEEE Intell. Syst., pp. 17–22, Sept./Oct.2001.
[107] L. Steels and F. Kaplan, “Aibo’s first words: The social learning of language and meaning,”Evolution of Communication, vol. 4, no. 1, pp. 3–32, 2001.
[108] T. Oates, Z. Eyler-Walker, and P. R. Cohen, “Using syntax to learn semantics: An experimentin language acquisition with a mobile robot,” University of Massachusetts Computer ScienceDepartment, Tech. Rep. 99-35, 1999.
[109] T. Oates, “Grounding knowledge in sensors: Unsupervised learning for language and plan-ning,” Ph.D. dissertation, University of Massachusettes, Amherst, 2001.
[110] B. Burns, C. Sutton, C. Morrison, and P. Cohen, “Information theory and representation inassociative word learning,” in Proc. 3rd Int. Workshop on Epigenetic Robotics, 2003.
[111] C. Crangle and P. Suppes, Language and Learning for Robots. Stanford, CA: Center for theStudy of Language and Information, 1994.
[112] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief-propagation for approximate infer-ence: An empirical study,” in Proc. 15th Conf. Uncertainty in Artif. Intell., K. B. Laskey andH. Prade, Eds., San Mateo, CA, 1999.
[113] IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. AudioElectroacoust., vol. AU-17, no. 3, pp. 225–246, Sept. 1969.
[114] S. E. Levinson (personal communication), 2004.
[115] P. K. Kuhl, “Early language acquisition: Cracking the speech code,” Nature Reviews Neuro-science, vol. 5, Nov. 2004.
153
[116] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partiallyobservable stochastic domains,” Artif. Intell., vol. 101, no. 1-2, pp. 99–134, May 1998.
[117] L. R. Rabiner and R. W. Schafer, Digital processing of Speech Signals. Upper Saddle River,NJ: Prentice Hall, 1978.
[118] H. W. Strube, “Linear prediction on a warped frequency scale,” J. of the Acoustical Societyof America, vol. 68, pp. 1071–1076, 1980.
[119] U. K. Laine, M. Karjalainen, and T. Altosaar, “Warped linear prediction (WLP) in speechand audio processing,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, vol. III,1994, pp. 349–352.
[120] J. O. Smith III and J. S. Abel, “Bark and ERB bilinear transforms,” IEEE Trans. SpeechAudio Processing, vol. 7, pp. 697–708, 1999.
[121] A. Harma, “Evaluation of a warped linear predictive coding scheme,” in Proc. IEEE Int. Conf.Acoust. Speech Signal Processing, vol. II, 2000, pp. 897–900.
[122] R. Viswanathan and J. Makhoul, “Quantization properties of transmission parameters inlinear predictive systems,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 23, pp. 309–321, 1975.
[123] R.-S. Lin (personal communication), 2004.
[124] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and Bayesian restora-tion of images,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 6, pp. 721–741, Nov.1984.
[125] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” inProc. IEEE Int. Conf. on Comput. Vision & Pattern Recognition, vol. 1, 2004, pp. 261–268.
[126] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” Int. J. ofComput. Vision, vol. 40, no. 1, pp. 25–47, 2000.
[127] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation and itsgeneralizations,” Mitsubishi Electric Research Laboratories, Inc., Tech. Rep. TR-2001-22,Jan. 2002.
[128] Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” IEEE Trans. Inform. Theory, vol. 47, no. 2, pp.726–744, Feb. 2001.
[129] T. Gevers and A. W. M. Smeulders, “Pictoseek: Combining color and shape invariant featuresfor image retrieval,” IEEE Trans. Image Processing, vol. 9, no. 1, pp. 102–119, Jan. 2000.
[130] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection. New York:Wiley, 1987.
[131] Matrix Reference Manual, http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html, June2004.
[132] A. Graham, Kronecker Products and Matrix Calculus With Applications. Chichester, Eng-land: Ellis Horwood Limited, 1981.
154
VITA
Kevin Michael Squire received his BS in computer engineering from Case Western Reserve Univer-
sity in 1995, his MS in electrical engineering from the University of Illinois at Urbana-Champaign
(UIUC) in 1998, and with this dissertation has completed his PhD in electrical engineering at UIUC
in 2004. He has conducted research on artificial intelligence, stochastic modeling, learning, image
processing, and speech and language processing at UIUC and at the Tokyo Institute of Technology,
Tokyo, Japan.
155