Machine Learning: Theory, Applications, Experiences

Machine Learning: Theory, Applications, Experiences

A Workshop for Women in Machine Learning

October 4, 2006San Diego, CA

http://www.seas.upenn.edu/~wiml/

Workshop Organization

Organizers: Lisa Wainer, University College LondonHanna Wallach, University of Cambridge

Jennifer Wortman, University of Pennsylvania

Faculty advisor: Amy Greenwald, Brown University

Additional reviewers: Maria-Florina Balcan Kristina KlinknerMelissa Carroll Bethany LefflerKimberley Ferguson Ozgur SimsekKatherine Heller Alicia Peregrin WolfeJulia Hockenmaier Elena ZhelevaRebecca Hutchinson

Thanks to our generous sponsors:

1

Schedule

October 3, 2006

19:30 Workshop dinner

October 4, 2006

08:45 Registration and poster set-up

09:00 Welcome

09:15 Invited talk: A General Class of No-Regret Learning Algorithms and Game-Theoretic EquilibriaAmy Greenwald, Brown University

09:45 On a Theory of Learning with Similarity FunctionsMaria-Florina Balcan, Carnegie Mellon University

10:00 Matrix Tile AnalysisInmar Givoni, University of Toronto

10:15 Towards Bayesian Black Box Learning SystemsJo-Anne Ting, University of Southern California

10:30 Coffee break

10:45 Invited talk: Clustering High-Dimensional DataJennifer Dy, Northeastern University

11:15 Efficient Bayesian Algorithms for ClusteringKatherine Heller, Gatsby Unit, University College London

11:30 Hidden Process ModelsRebecca Hutchinson, Carnegie Mellon University

11:45 Invited talk: Recent advances in near-neighbor learningMaya Gupta, University of Washington

12:15 Spotlight talks:

Correcting sample selection bias by unlabeled dataJiayuan Huang, University of Waterloo

Decision Tree Methods for Finding Reusable MDP HomomorphismsAlicia Peregrin Wolfe, University of Massachusetts, Amherst

Evaluating a Reputation-based Spam Classification SystemElena Zheleva, University of Maryland, College Park

Improving Robot Navigation Through Self-Supervised Online LearningEllie Lin, Carnegie Mellon University

12:30 Lunch

2

13:00 Poster session 1

13:45 Invited talk: SRL: Statistical Relational LearningLise Getoor, University of Maryland, College Park

14:15 Generalized statistical methods for fraud detectionCecile Levasseur, University of California, San Diego

14:30 Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small MoleculesChloe-Agathe Azencott, University of California, Irvine

14:45 Invited talk: Modeling and Learning User Preferences for Sets of ObjectsMarie desJardins, University of Maryland, Baltimore County

15:15 Coffee break

15:30 Efficient Exploration with Latent StructureBethany Leffler, Rutgers University

15:45 Efficient Model Learning for Dialog ManagementFinale Doshi, MIT

16:00 Transfer in the context of Reinforcement LearningSoumi Ray, University of Maryland, Baltimore County

16:15 Spotlight talks:

Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent TracesGita Sukthankar, Carnegie Mellon University

An Online Learning System for the Prediction of Electricity Distribution Feeder FailuresHila Becker, Columbia University

Classification of fMRI Images: An Approach Using Viola-Jones FeaturesMelissa K Carroll, Princeton University

Fast Online Classification with Support Vector MachinesSeyda Ertekin, Penn State University

16:30 Poster session 2

17:15 Open discussion

17:45 Closing remarks and poster take-down

18:00 End of workshop

3

Invited Talks

A General Class of No-Regret Learning Algorithms and Game-Theoretic EquilibriaAmy Greenwald, Brown University

No-regret learning algorithms have attracted a great deal of attention in the game theoretic and machine learning communities. Whereas rational agents act so as to maximize their expected utilities, no-regret learners are boundedly rational agents that act so as to minimize their "regret". In this talk, we discuss the behavior of no-regret learning algorithms in repeated games.

Specifically, we introduce a general class of algorithms called no- -regret learning, whichΦ includes common variants of no-regret learning such as no external-regret and no-internal-regret learning. Analogously, we introduce a class of game-theoretic equilibria called -Φequilibria. We show that no- -regret learning algorithms converge to -equilibria. InΦ Φ particular, no-external-regret learning converges to minimax equilibrium in zero-sum games; and no-internal-regret learning converges to correlated equilibrium in general-sum games. Although our class of no-regret algorithms is quite extensive, no algorithm in this class learns Nash equilibrium.

Speaker biography:

Dr. Amy Greenwald is an assistant professor of computer science at Brown University in Providence, Rhode Island. Her primary research area is the study of economic interactions among computational agents. Her primary methodologies are game-theoretic analysis and simulation. Her work is applicable in areas ranging from dynamic pricing to autonomous bidding to transportation planning and scheduling. She was awarded a Sloan Fellowship in 2006; she was nominated for the 2002 Presidential Early Career Award for Scientists and Engineers (PECASE); and she was named one of the Computing Research Association's Digital Government Fellows in 2001. Before joining the faculty at Brown, Dr. Greenwald was employed by IBM's T.J. Watson Research Center, where she researched Information Economies. Her paper entitled "Shopbots and Pricebots" (joint work with Jeff Kephart) was named Best Paper at IBM Research in 2000.

Clustering High-Dimensional DataJennifer Dy, Northeastern University

Creating effective algorithms for unsupervised learning is important because vast amounts of data preclude humans from manually labeling the categories of each instance. In addition, human labeling is expensive and subjective. Therefore, a majority of existing data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is to group "similar" objects together. "Similarity" is typically defined by a metric or a probability model. These measures are highly dependent on the features representing the data. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Moreover, many clustering algorithms fail when dealing with high-dimensions. We present two approaches for dealing with clustering in high-dimensional spaces: 1. Feature selection for clustering, through Gaussian mixtures and the maximum likelihood and scatter separability criteria, and 2. Hierarchical feature transformation and clustering, through automated hierarchical mixtures of probabilistic principal component analyzers.

4

Speaker biography:

Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and PhD in 1997 and 2001 respectively from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the Department of Electrical Engineering, University of the Philippines. She received an NSF Career award in 2004. She is an editorial board member for the journal, Machine Learning since 2004, and publications chair for the International Conference on Machine Learning in 2004. Her research interests include Machine Learning, Data Mining, Statistical Pattern Recognition, and Computer Vision.

Recent advances in near-neighbor learningMaya R. Gupta, University of Washington

Recent advances in nearest-neighbor learning are shown for adaptive neighborhood definitions, neighborhood weighting, and estimating given nearest-neighbors. In particular, it is shown that weights that solve linear interpolation equations minimize the first-order learning error, and this is coupled with the principle of maximum entropy to create a flexible weighting approach. Different approaches to adaptive neighborhoods are contrasted, the focus being on neighborhoods that form a convex hull around the test point. Standard weighted nearest-neighbor estimation is shown to maximize likelihood, and it is shown that minimizing expected Bregman divergence instead leads to optimal solutions in terms of expected misclassification cost. Applications may include the testing of pipeline integrity, custom color enhancements, and estimation for color management.

Speaker biography:

Maya Gupta completed her Ph.D. in Electrical Engineering in 2003 at Stanford University as a National Science Foundation Graduate Fellow. Her undergraduate studies led to a BS in Electrical Engineering and a BA in Economics from Rice University in 1997. From 1999-2003 she worked for Ricoh's California Research Center as a color image processing research engineer. In the fall of 2003, she joined the EE faculty of the University of Washington as an Assistant Professor where she also serves as an Adjunct Assistant Professor of Applied Mathematics. More information about her research is available at her group's webpage: idl.ee.washington.edu.

Modeling and Learning User Preferences for Sets of ObjectsMarie desJardins, University of Maryland, Baltimore County

Most work on preference learning has focused on pairwise preferences or rankings over individual items. In many application domains, however, when a set of items is presented together, the individual items can interact in ways that increase (via complementarity) or decrease (via redundancy or incompatibility) the quality of the set as a whole.

In this talk, I will describe the DD-PREF language that we have developed for specifying set-based preferences. One problem with such a language is that it may be difficult for users to explicitly specify their preferences quantitatively. Therefore, we have also developed an approach for learning these preferences. Our learning method takes as input a collection of positive examples―that is, one or more sets that have been identified by a user as desirable. Kernel density estimation is used to estimate the value function for individual items, and the desired set diversity is estimated from the average set diversity

5

observed in the collection.

Since this is a new learning problem, I will also describe our new evaluation methodology and give experimental results of the learning method on two data collections: synthetic blocks-world data and a new real-world music data collection.

Joint work with Eric Eaton and Kiri L. Wagstaff.

Speaker biography:

Dr. Marie desJardins is an assistant professor in the Department of Computer Science and Electrical Engineering at the University of Maryland, Baltimore County. Prior to joining the faculty in 2001, Dr. desJardins was a senior computer scientist at SRI International in Menlo Park, California. Her research is in artificial intelligence, focusing on the areas of machine learning, multi-agent systems, planning, interactive AI techniques, information management, reasoning with uncertainty, and decision theory.

SRL: Statistical Relational LearningLise Getoor, University of Maryland, College Park

A key challenge for machine learning is mining richly structured datasets describing objects, their properties, and links among the objects. We'd like to be able to learn models, which can capture both the underlying uncertainty and the logical relationships in the domain. Links among the objects may demonstrate certain patterns, which can be helpful for many practical inference tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.

Statistical Relational Learning (SRL) is a newly emerging research area, which attempts to represent, reason and learn in domains with complex relational and rich probabilistic structure. In this talk, I'll begin with a short SRL overview. Then, I'll describe some of my group's recent work, including our work on entity resolution in relational domains.

Joint work with Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen.

Speaker biography:

Prof. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In June 2006, she co-organized the 4th in a series of successful workshops on statistical relational learning, http://www.cs.umd.edu/srl2006. She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.

6

Talks

On a Theory of Learning with Similarity FunctionsMaria-Florina Balcan, Carnegie Mellon University

Kernel functions have become an extremely popular tool in machine learning. They have an attractive theory that describes a kernel function as being good for a given learning problem if data is separable by a large margin in a (possibly very high-dimensional) implicit space defined by the kernel. This theory, however, has a bit of a disconnect with the intuition of a good kernel as a good similarity function. In this work we develop an alternative theory of learning with similarity functions more generally (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition. In this way, we provide the first steps towards a theory of kernels that describes the effectiveness of a given kernel function in terms of natural similarity-based properties.

Joint work with Avrim Blum.

Matrix Tile AnalysisInmar Givoni, University of Toronto

Many tasks require finding groups of elements in a matrix of numbers, symbols or class likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are not appropriate when addition and multiplication of matrix elements are not sensibly defined. More directly, methods like bi-clustering can be used to classify matrix elements, but these methods make the overly restrictive assumption that the class of each element is a function of a row class and a column class. We introduce a general computational problem, "matrix tile analysis" (MTA), which consists of decomposing a matrix into a set of non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows and columns. MTA does not require an algebra for combining tiles, but must search over an exponential number of discrete combinations of tile assignments. We describe a loopy BP (sum-product) algorithm and an ICM algorithm for performing MTA. We compare the effectiveness of these methods to PCA and the plaid method on hundreds of randomly generated tasks. Using double-gene-knockout data, we show that MTA finds groups of interacting yeast genes that have biologically-related functions.

Joint work with Vincent Cheung and Brendan J. Frey.

Towards Bayesian Black Box Learning SystemsJo-Anne Ting, University of Southern California

A long-standing dream of machine learning is to create black box learning systems that can operate autonomously in home, research and industrial applications. While it is well understood that a universal black box may not be possible, significant progress can be made in specific domains. In particular, we address learning problems in sensor-rich and data-rich environments, as provided by autonomous vehicles, surveillance systems, biological or robotic systems. In these scenarios, the input data has hundreds or thousands

7

of dimensions and is used to make predictions (often in real-time), resulting in a learning system that learns to "understand" the environment.

The goal of machine learning in this domain is to devise algorithms that can efficiently deal with very high dimensional data, usually contaminated by noise, redundancy and irrelevant dimensions. These algorithms must learn nonlinear functions, potentially in an incremental and real-time fashion, for robust classification and regression. In order to achieve black box quality, manual tuning parameters (e.g. as in gradient descent or structure selection) need to be minimized or, ideally, avoided.

Bayesian inference, when combined with approximation methods to reduce computational complexity, suggests a promising route to achieve our goals, since it offers a principled way to eliminate open parameters. In past work, we have started to create a toolbox of methods to achieve our goal of black box learning. In (Ting et al., NIPS 2005), we introduced a Bayesian approach to linear regression. The novelty of this algorithm comes from a Bayesian and EM-like formulation of linear regression that robustly performs automatic feature detection in the inputs in a computationally efficient way. We applied this algorithm to the analysis of neuroscientific data (i.e. the problem of prediction of electromyographic (EMG) activity in the arm muscles of a monkey from spiking activity of neurons in the primary motor and premotor cortex). The algorithm achieves results that are faster by orders of magnitude and higher quality than previously applied methods.

More recently, we introduced a variational Bayesian regression algorithm that is able to perform optimal prediction, given noise-contaminated input and output data (Ting, D'Souza & Schaal, ICML 2006). Traditional linear regression algorithms produce biased estimates when input noise is present and suffer numerically when the data contains irrelevant and/or redundant inputs. Our algorithm is able to effectively handle datasets with both characteristics. On a system identification task for a robot dynamics model, we achieved from 10 to 70% better results than traditional approaches.

Current work focuses on developing a Bayesian version of nonlinear function approximation with locally weighted regression. The challenge is to determine the size of the neighborhood of data that should contribute to the local regression model―a typical bias-variance trade-off problem. Preliminary results indicate that a full Bayesian treatment of this problem can achieve impressive robust function approximation performance without the need for tuning meta parameters. We are also interested in extending this locally linear Bayesian model to an online setting, in the spirit of dynamic Bayesian networks, to offer a parameter-free alternative to incremental learning.

Joint work with Aaron D'Souza, Stefan Schaal, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato, Peter Strick, Michael Mistry, Jan Peters, and Jun Nakanishi.

This work will also be in Poster Session 1.

Efficient Bayesian Algorithms for ClusteringKatherine Ann Heller, Gatsby Unit, University College London

One of the most important goals of unsupervised learning is to discover meaningful clusters in data. There are many different types of clustering methods that are commonly used in machine learning including spectral, hierarchical, and mixture modeling. Our work takes a model-based Bayesian approach to defining a cluster and evaluates cluster membership in this paradigm. We use marginal likelihoods to compare different cluster models, and hence determine which data points belong to which clusters. If we have

8

models with conjugate priors, these marginal likelihoods can be computed extremely efficiently.

Using this clustering framework in conjunction with non-parametric Bayesian methods, we have proposed a new way of performing hierarchical clustering. Our Bayesian Hierarchical Clustering (BHC) algorithm takes a more principled approach to the problem than the traditional algorithms (e.g. allowing for model comparisons and the prediction of new data points) without sacrificing efficiency. BHC can also be interpreted as performing approximate inference in Dirichlet Process Mixtures (DPMs), and provides a combinatorial lower bound on the marginal likelihood of a DPM.

We have also explored the task of "clustering on demand" for information retrieval. Given a query consisting of a few examples of some concept, we have proposed a method that returns other items belonging to the concept exemplified by the query. We do this by ranking all items using a Bayesian relevance criterion based on marginal likelihoods, and returning the items with the highest scores. In the case of binary data, all scores can be computed with a single matrix-vector product. We can also use this method as the basis for an image retrieval system. In our most recent work this framework has served as inspiration for a new approach to automated analogical reasoning.

Joint work with Zoubin Ghahramani and Ricardo Silva.

Hidden Process ModelsRebecca Hutchinson, Carnegie Mellon University

We introduce the Hidden Process Model (HPM), a probabilistic model for multivariate time series data. HPMs assume the data is generated by a system of partially observed, linearly additive processes that overlap in space and time. While we present a general formalism for any domain with similar modeling assumptions, HPMs are motivated by our interest in studying cognitive processes in the brain, given a time series of functional magnetic resonance imaging (fMRI) data. We use HPMs to model fMRI data by assuming there is an unobserved series of hidden, overlapping cognitive processes in the brain that probabilistically generate the observed fMRI time series.

Consider for example a study in which subjects in the scanner repeatedly view a picture and read a sentence and indicate whether the sentence correctly describes the picture. It is natural to think of the observed fMRI sequence as arising from a set of hidden cognitive processes in the subject’s brain, which we would like to track. To do this, we use HPMs to learn the probabilistic time series response signature for each type of cognitive process, and to estimate the onset time of each instantiated cognitive process occurring throughout the experiment.

There are significant challenges to this learning task in the fMRI domain. The first is that fMRI data is high dimensional and sparse. A typical fMRI dataset measures approximately 10,000 brain locations over 15-20 minutes (features), with only a few dozen trials (training examples). A second challenge is due to the nature of the fMRI signal: it is a highly noisy measurement of an indirect and temporally blurred neural correlate called the hemodynamic response. The hemodynamic response to a short burst of less than a second of neural activity lasts for 10-12 seconds. This temporal blurring in fMRI makes it problematic to model the time series as a first-order Markov process. In short, our problem is to learn the parameters and timing of potentially overlapping, partially observed responses to cognitive processes in the brain using many features and a small number of noisy training examples.

The modeling assumptions that HPMs make to deal with the challenges of the fMRI domain

9

are: 1) the latent time series is modeled at the level of processes rather than individual time points; 2) processes are general descriptions that can be instantiated many times over the course of the time series; 3) we can use prior knowledge of the form “process instance X occurs somewhere inside the time interval [a, b].” HPMs could apply to any domain in which these assumptions are valid.

HPMs address a key open question in fMRI analysis: how can one learn the response signatures of overlapping cognitive processes with unknown timing? There is no competing method to HPMs available in the fMRI community. In our ICML paper, we give the HPM formalism, inference and learning algorithms, and experimental results on real and synthetic fMRI datasets.

Joint work with Tom Mitchell and Indrayana Rustandi.


Generalized statistical methods for fraud detectionCecile Levasseur, University of California, San Diego

Many important risk assessment system applications depend on the ability to accurately detect the occurrence of key events given a large data set of observations. For example, this problem arises in drug discovery (“Do the molecular descriptors associated with known drugs suggest that a new, candidate drug will have low toxicity and high effectiveness?”); and credit card fraud detection (“Given the data for a large set of credit card users does the usage pattern of this particular card indicate that it might have been stolen?”). In many of these domains, no or little a priori knowledge exists regarding the true sources of any causal relationships that may occur between variables of interest. In these situations, meaningful information regarding the circumstances of the key events must be extracted from the data itself, a problem that can be viewed as an important application of data-driven pattern recognition or detection.

The problem of unsupervised data-driven detection or prediction is one of relating descriptors of a large unlabeled database of “objects” to measured properties of these objects, and then using these empirically determined relationships to infer or detect the properties of new objects. This work considers measured object properties that are nongaussian (and comprised of continuous and discrete data), very noisy, and highly nonlinearly related. Data comprised of measurements of such disparate properties are said to be hybrid or of mixed type. As a consequence, the resulting detection problem is very difficult. The difficulties are further compounded because the descriptor space is of high dimension. While many domains lack accurate labels in their database, others like credit card fraud exhibit tagged data. Therefore, the problem of supervised data-driven detection, one relating to a labelled database of objects, is also examined. In addition, by utilizing tagged data, a performance benchmark can be set, enabling meaningful comparisons of supervised and unsupervised approaches.

Statistical approaches to fraud detection are mostly based on modelling the data relying on their statistical properties and using this information to estimate whether a new object comes from the same distribution or not. The statistical modelling approach proposed here is a generalization and amalgamation of techniques from classical linear statistics (logistic regression, principal component analysis and generalized linear models) into a framework referred to as generalized linear statistics (GLS). It is based on the use of exponential family distributions to model the various types (continuous and discrete) of data measurements. A key aspect is that the natural parameter of the exponential family distributions is constrained to a lower dimensional subspace to model the belief that the

10

intrinsic dimensionality of the data is smaller than the dimensionality of the observation space. The proposed constrained statistical modelling is a nonlinear methodology that exploits the split that occurs for exponential family distributions between the data space and the parameter space as soon as one leaves the domain of purely Gaussian random variables. Although the problem is nonlinear, it can be solved by using classical linear statistical tools applied to data that has been mapped into the parameter space that still has a natural, flat Euclidean structure. This approach provides an effective way to exploit tractably parameterized latent-variable exponential-family probability models for data-driven learning of model parameters and features, which in turn are useful for the development of effective fraud detection algorithms.

The fraud detection techniques proposed here are performed in the parameter space rather than in the data space as has been done in more classical approaches. In the case of a low level of contamination of the data by fraudulent points, a single lower dimensional subspace is learned by using the GLS based statistical modelling on a training set. Given a new data point, it is projected to its image on the lower dimensional subspace and fraud detection is performed by comparing its distance from the training set mean-image to a threshold. An example that shows that there are domains for which the classical linear techniques, such as principal component analysis, used in the data space perform far from optimally is presented compared to the new proposed parameter space techniques. For cases of data with roughly as many fraudulent as non-fraudulent points, an unsupervised approach to the linear Fisher discriminant is proposed. The GLS based framework enables unsupervised learning of a lower dimensional subspace in the parameter space that separates fraudulent from non-fraudulent data. Fraud detection is performed as in the previous case. In both cases, an ROC curve is generated to assess the performance of the proposed fraud detection methods.

Joint work with Kenneth Kreutz-Delgado and Uwe Mayer.

Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small MoleculesChloe-Agathe Azencott, University of California, Irvine

Small molecules, i.e. molecules composed of a couple hundreds of atoms, play a fundamental role in biology, chemistry and pharmacology. Their usage goes from the design of new drugs to the better understanding of biological systems; however, establishing their physical, chemical and biological properties through a physical experimentation can be very costly. It is therefore essential to develop efficient computational methods to predict these properties.

Kernel methods, and among them support vector machines, appear as particularly appropriate for chemical data, for they involve similarity measures which allow to embed the data in a high-dimensional feature space where linear methods can be used. Machine learning spectral kernels can be derived from various descriptions of the molecules; we study representations which dimensionality ranges from 1 to 4, thus obtaining 1D, 2D, 2.5D, 3D and 4D kernels.

Using cross-validation and redundancy reduction techniques on various datasets of small and medium size from the literature, we test the kernels for the prediction of boiling points, melting points, aqueous solubility and octanol/water partition coefficient and compare them against state-of-the art results.

Spectral kernels derived from the rich and reliable two-dimensional representation of the molecules outperform the other methods on most of the datasets. They seem to be the

11

method of choice, given their simplicity, computational efficiency and prediction accuracy.

Efficient Exploration with Latent StructureBethany Leffler, Rutgers University

Developing robot control using a reinforcement-learning (RL) approach involves a number of technical challenges. In our work, we address the problem of learning an action model.

Classical RL approaches assume Markov decision process (MDP) environments, which do not support the critical idea of generalization between states. For an agent to learn the results of its actions for each state, it would have to visit each state and perform each action in that state at least once. In a robot setting, however, it is unrealistic to assume there will be sufficient time to learn about every state of the environment independently; so richer models of environmental dynamics are needed. Our technique for developing such a model is to assume that each state is not unique. In most environments, there will be states that have the same transition dynamics. By developing models where similar states have similar dynamics, it becomes possible for a learner to reuse its experience in one state to more quickly learn the dynamics of other parts of the environment. However, it also introduces an additional challenge―determining which states are similar.

To evaluate the viability of this approach, we constructed an experiment using a four-wheeled Lego Mindstorm robot as the agent. The state space consisted of discretized vehicle locations with a hidden variable of slope (flat or incline), which correlated directly with the action model. The agent had to learn which throttling action to perform in each state to maintain a target speed. In this scenario, the actions did not affect the transitions between states.

To determine similarity between states, the agent executed a selected action several times in each of the vehicle locations. The outcomes of these actions were used to hierarchically cluster the states. Once the states were clustered, the agent then started learning an action model for each state cluster. The advantage of this approach over one that learned a separate action model for each state is that information gathered in several different states can be pooled together. In common environments, there are many more states than state-types; therefore, learning based on clusters drastically reduces learning time. In fact, we were able to prove a worst-case learning time result that formalizes and validates this claim.

If the environment does not have many similar states or if the clustering algorithm groups the states incorrectly, than the benefit of this approach will be minimized. Even in this worst case, however, it is important to note that this algorithm is no more costly than exploring each state individually.

Some limitations of this algorithm arise when states have semi-similar action models. For instance, if two states behave similarly when one action is performed, but not for all the actions, it is possible that the agent would learn incorrectly when following our proposed algorithm. In most robotic environments, however, using our algorithm will greatly reduce the time taken by the agent to determine its action model in all states, thereby increasing the efficiency of the robot.

Joint work with Michael L. Littman, Alexander L. Strehl, and Thomas Walsh.

12

Efficient Model Learning for Dialog ManagementFinale Doshi, MIT

Intelligent planning algorithms such as the Partially Observable Markov Decision Process (POMDP) have succeeded in dialog management applications because of their robustness to the inherent uncertainty of human interaction. Like all dialog planning systems, however, POMDPs require an accurate model of the user (such as the user's different states of the user and what the user might say). POMDPs are generally specified using a large probabilistic model with many parameters; these parameters are difficult to specify from domain knowledge, and gathering enough data to estimate the parameters accurately a priori is expensive.

In this paper, we take a Bayesian approach to learning the user model simultaneously the dialog management problem. First we show that the policy that maximizes the expected reward is the solution of the POMDP taken with the expected values of the parameters. We update the parameter distributions after each test, and incrementally update the previous POMDP solution. The update process has a relatively small computational cost, and we test various heuristics to focus computation in circumstances where it is most likely to improve the dialog. We are able to demonstrate a robust dialog manager that learns from interaction data, out-performing a hand-coded model in simulation and in a robotic wheelchair application.

Joint work with Nicholas Roy.

Transfer in the context of Reinforcement LearningSoumi Ray, University of Maryland, Baltimore County

We are investigating the problem of transferring knowledge learned in one domain to another related domain. Transfer of knowledge from simple domains to more complex domains can reduce the total training time in the complex domains. We are doing transfer in the context of reinforcement learning. In the past, knowledge transfer has been accomplished between domains with the same state and action spaces. Work has also been done where the state and action spaces of the two domains are different but a mapping has been provided by humans. We are trying to automate the mapping from the old domain to the new domain when the state and action spaces are different.

We have two domains D1 and D2, with corresponding state spaces S1 and S2 and action spaces A1 and A2 where |S1| = |S2| and |A1| = |A2|. Our goal is to transfer a policy learned in D1 to D2 so as to speed learning in D2. We first run Q-learning in D1 to produce Q-table Q1. Then we train for limited time in D2 and generate Q2. The test bed we have used is a 16x16 grid world. We have taken two domains in a 16x16 grid world with four actions: North, South, East and West. In the first domain we have trained for 500 iterations and in the second domain we have trained for 20 iterations. The two approaches that we have used are as follows.

Our goal is to find the mapping between the state spaces S1 and S2 and action spaces A1 and A2 In the first approach we compute the difference between matrices Q1 and Q2 and greedily find a mapping that minimizes the difference calculated above. With this mapping we can transfer the Q-values from the completely trained domain D1 to the partially trained domain D2 to speed up learning in domain D2. We find that it takes fewer steps to learn completely in the second domain when the Q-values are transferred than learning from scratch. Our second approach finds the mapping that assigns the highest Q-values of the states in domain one to the highest Q-values of the states in domain two. This approach is an improvement over the first approach. It takes many fewer steps to learn in

13

the second domain using transfer. We are also interested in finding the mapping when S1 and A1 are subsets of S2 and A2 respectively, i.e. |S1|<|S2| and |A1|<|A2|. This can be handled by allowing mapping a single state/action in S1/A1 to multiple states/actions in S2/A2.

Joint work with Tim Oates.


14

Spotlights (Session 1)

Correcting sample selection bias by unlabeled dataJiayuan Huang, University of Waterloo

The default assumption in many learning scenarios is that training and test data are independently and identically drawn from the same distribution. When the distributions on training and test set do not match, we are facing the problem that commonly referred to as sample selection bias or covariance shift. This problem occurs in many real world applications including the areas of surveys, sociology, biology and economics. It is not hard to see that given the skewed selection for the training data, it is impossible to derive a good model to make accurate predictions on the general target as the training set might not be representative of the complete population from which the test is usually come. Thus the prediction results in a biased estimation, potentially increasing the errors. Although there exists previous work addressing this problem, sample selection bias is typically ignored in standard estimation algorithms. In this work, we utilize the availability of unlabeled data to direct a sample selection de-biasing procedure for various learning methods. Unlike most previous algorithms that try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate, our method infer the re-sampling weight directly by distribution matching between training and testing sets in the feature space in a non-parametric manner. We do not require the estimation of biased densities or selection probabilities or any assumptions of knowing the probabilities of different classes. Our method works by matching distributions between training and testing sets in feature space that can handle high dimensional data. Our experiments results with many benchmark datasets demonstrate our method works well in practice. The method also shows good performance in tumor diagnosis using microarrays that it promises to be a valuable tool for cross-platform microarray classification.

Joint work with Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.

Decision Tree Methods for Finding Reusable MDP HomomorphismsAlicia Peregrin Wolfe, University of Massachusetts, Amherst

State abstraction is a useful tool for agents interacting with complex environments. Good state abstractions are compact, reusable, and easy to learn from sample data. This paper combines and extends two existing classes of state abstraction methods to achieve these criteria. The first class of methods search for MDP homomorphisms (Ravindran 2004), which produce models of reward and transition probabilities in an abstract state space. The second class of methods, like the UTree algorithm (McCallum 1995), learn compact models of the value function quickly from sample data. Models based on MDP homomorphisms can easily be extended such that they are usable across tasks with similar reward functions. However, value based methods like UTree cannot be extended in this fashion. We present results showing a new, combined algorithm that fulfills all three criteria: the resulting models are compact, can be learned quickly from sample data, and can be used across a class of reward functions.

Joint work with Andrew Barto.

15

Evaluating a Reputation-based Spam Classification SystemElena Zheleva, University of Maryland, College Park

Over the past several years, spam has been a growing problem for the Internet community. It interferes with valid e-mail and burdens both e-mail users and ISPs. While there are various successful automated e-mail filtering approaches that aim at reducing the amount of spam, there are still many challenges to overcome.

Reactive spam filtering approaches classify a piece of e-mail as spam if it has been reported as such by a large volume of e-mail users. Unfortunately, by the time the system responds by blocking the message or automatically placing it in future recipients' spam folders, the spam campaign has already affected a lot of users. The challenge that we consider is whether we can reduce the response time, recognizing a spam campaign at an earlier stage, thus reducing the cost that users and systems incur. Specifically, we are evaluating the predictive power of a reputation-based spam filtering system, which uses the feedback only from trustworthy e-mail users.

In a reputation-based or trust-based spam filtering system, the system identifies a set of users who report spam reliably and trusts their spam reports more than the spam reports of other users. A message coming into the system is classified as spam if enough reliable users report it. This automatic spam filtering approach is vulnerable to malicious users when any anonymous person can subscribe and unsubscribe to the e-mail service. This is the case with most free e-mail providers such as AOL, Hotmail and Yahoo. We show how to overcome this problem in this work.

There are two well-known open-source projects which operate in this framework: Vipul's Razor and Distributed Checksum Clearinghouse. Unfortunately, their reputation systems work only as a part of their commercially available software counterparts and, due to trade secrets, it is not clear how the design characteristics such as reputation definition and metrics affect the system performance. More importantly, the spam reports they receive are mostly from authorized users (such as business partner company employees), which reduce the risk of abuse by anonymous users.

The effectiveness of a reputation-based spam filtering system is based on evaluating the following properties: 1) automatic maintenance of a reliable user set over time, 2) timely and accurate recognition of a spam campaign, and 3) having a set of guarantees on the system vulnerability. In our work, we present the results from simulating a reputation-based spam filtering over a period of time. The evaluation dataset includes all the spam reports received during that period of time for a particular free e-mail provider. We show how our algorithms effectively reduce spam campaign response time, while minimizing system vulnerability.

Joint work with Lise Getoor and Alek Kolcz.

Improving Robot Navigation Through Self-Supervised Online LearningEllie Lin, Carnegie Mellon University

In mobile robotics, there are often features that, while potentially powerful for improving navigation, prove difficult to profit from as they generalize poorly to novel situations. Overhead imagery data, for instance, has the potential to greatly enhance autonomous robot navigation in complex outdoor environments. In practice, reliable and effective automated interpretation of imagery from diverse terrain, environmental conditions, and sensor varieties proves challenging. Similarly, fixed techniques that successfully interpret on-board sensor data across many environments begin to fail past short ranges as the

16

density and accuracy necessary for such computation quickly degrade and features that are able to be computed from distant data are very domain-specific. We introduce an online, probabilistic model to effectively learn to use these scope-limited features by leveraging other features that, while perhaps otherwise more limited, generalize reliably. We apply our approach to provide an efficient, self-supervised learning method that accurately predicts traversal costs over large areas from overhead data. We present results from field-testing on-board a robot operating over large distances in off-road environments. Additionally, we show how our algorithm can be used offline with overhead data to produce a priori traversal cost maps and detect misalignments between overhead data and estimated vehicle positions. This approach can significantly improve the versatility of many unmanned ground vehicles by allowing them to traverse highly varied terrains with increased performance.

Joint work with B. Sofman, J. Bagnell, N. Vandapel and A. Stentz.

17

Spotlights (Session 2)

Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent TracesGita Sukthankar, Carnegie Mellon University

This research addresses the problem of activity recognition for physically embodied agent teams. We define team activity recognition as the process of identifying team behaviors from traces of agent positions over time; for many physical domains, military or athletic, coordinated team behaviors create distinctive spatio-temporal patterns that can be used to identify low-level action sequences. We focus on the novel problem of recovering agent-to-team assignments for complex team tasks where team composition, the mapping of agents into teams, changes over time. Without a priori knowledge of current team assignments, the behavior recognition problem is challenging since behaviors are characterized by the aggregate motion of the entire team and cannot generally be determined by observing the movements of a single agent in isolation.

To handle this problem, we introduce a new algorithm, Simultaneous Team Assignment and Behavior Recognition (STABR) that generates behavior annotations from spatio-temporal agent traces. STABR leverages information from the spatial relationships of the team members to create sets of potential team assignments at selected time-steps. These spatial relationships are efficiently discovered using a randomized search technique, RANSAC, to generate potential team assignment hypotheses. Sequences of team assignment hypotheses are evaluated using dynamic programming to derive a parsimonious explanation for the entire observed spatio-temporal trace. To prune the number of hypotheses, potential team assignments are fitted to a parameterized team behavior model; poorly fitting hypotheses are eliminated before the dynamic programming phase. The proposed approach is able to perform accurate team behavior recognition without exhaustive search over the partition set of potential team assignments, as demonstrated on several scenarios of simulated military maneuvers.

STABR does not simply assume that agents within a certain proximity should be assigned to the same team; instead if relies on matching static snapshots of agent position against a database of team formation templates to produce a candidate pool of agent-to-team assignments. This candidate pool of assignments is verified by running a local spatio-temporal behavior detector. The intuition is that the aggregate agent movement for an incorrect team assignment will generally fail to match any behavior model. STABR significantly outperforms agglomerative clustering on the agent-to-team assignment problem for traces with dynamic agent composition (95% accuracy).

The scenarios presented here illustrate the operation of STABR in environments that lack the external cues used by other multi-agent plan recognition approaches, such as landmarks, cleanly clustered agent teams, and extensive domain knowledge. We believe that when such cues are available they can be directly incorporated into STABR, both to improve accuracy and to prune hypotheses. STABR provides a principled framework for reasoning about dynamic team assignments in spatial domains.

Joint work with Katia Sycara.

18

An Online Learning System for the Prediction of Electricity Distribution Feeder FailuresHila Becker, Columbia University

We are using machine learning techniques for constructing a failure-susceptibility ranking of feeder cables that supply electricity to the boroughs of New York City. The electricity system is inherently dynamic, and thus our failure-susceptibility ranking system must be able to adapt to the latest conditions in real time, having to update its ranking accordingly. The feeders have a significant failure rate, and many resources are devoted to monitoring, maintenance and repair of feeders. The ability to predict failures allows the shifting from reactive to proactive maintenance, thus reducing costs.

The feature set for each feeder includes a mixture of static data (e.g. age and composition of each feeder section) and dynamic data (e.g. electrical load data for a feeder and its transformers). The values of the dynamic features are captured at the time of training and therefore lead to different models depending on the time and day at which each model is trained. Previously, a framework was designed to train models using a new variant of boosting called Martingale Boosting, as well as Support Vector Machines. However, in this framework, an engineer had to decide whether to use the most recent data to build a new model, or use the latest model instead for future predictions.

To avoid the need of human intervention, we have developed an “online” system that determines what model to use by monitoring past performance of previously trained models. In our new framework, we treat each batch-trained model as an expert, and use a measurement of its performance as the basis for reward or penalty of its quality score. We measure performance as a normalized average rank of failures. For example, in a ranking of 50 items with actual failures ranked #4 and #20, the performance is: 1 – (4 + 20) / (2*50) = 0.76.

Our approach builds on the notion of learning from expert advice as formulated in the continuous version of the Weighted Majority algorithm. Since each model is analogous to an expert and our system runs live thus gathering new data and generating new models, we have to keep adding new experts to the existing ensemble throughout the algorithm’s execution. To avoid having to monitor an ever-increasing set of experts, we drop poorly performing experts after each prediction. We had to address the following key issues in our solution: (1) how often and with what weight do we add new experts, and (2) what experts do we drop. Our simulations suggest that using the median of all current models’ weights for new models works best. To drop experts we use a combination of age of the model and past performance. Finally, to make predictions we use a weighted average of the top-scoring experts.

Our system is currently deployed and being tested by New York City’s electricity distribution company. Results are highly encouraging, with 75% of the failures in the summer of 2005 being ranked in the top 26%, and 75% of failures in 2006 being ranked in the top 36%.

Joint work with Marta Arias.

Classification of fMRI Images: An Approach Using Viola-Jones FeaturesMelissa K. Carroll, Princeton University

There has been growing interest in using Functional Magnetic Resonance Imaging (fMRI) for “mind reading,” particularly in applying machine learning methods to classifying fMRI brain images based on the subject’s instantaneous cognitive state. For instance, Haxby et al. (2001) perform fMRI scans while subjects are viewing images of one of seven classes of

19

objects with the goal of discriminating the brain images based on the class of image being viewed at the time.

Most machine learning approaches used to date for fMRI classification have treated individual voxels as features and ignored the spatial correlation between voxels (Norman et al., 2006). We present a novel method for searching this feature space to generate features that capture spatial information, derived from the Viola and Jones (2001) algorithm for 2D object detection, and apply it to 2D representations of the images. In this method, features are computed corresponding to absolute and relative intensities over regions of varying size and shape, and used by AdaBoost (Schapire and Singer, 1999) to generate a classifier. Figure 1 (http://www.cs.princeton.edu/~mkc/wiml06/Figure1.jpg) shows examples of these features overlaid on an actual 2D representation of the 3D fMRI image. Mean intensities in white regions are subtracted from mean activities in gray regions to compute each feature, which are combined to form the feature vector. One-, two-, three- and four-rectangle features of all 100 size combinations between 1x1 and 10x10 are computed for all positions in the image.

As Figure 2 (http://www.cs.princeton.edu/~mkc/wiml06/Figure2.jpg) shows, including richer features than the standard one-pixel features can result in improved classification of the Haxby et al. dataset. One potential limitation of the method is that the large feature set it produces conflicts with computational limitations; however, figure 2 shows that even selecting a small random subset of the richer features can result in an increase in classification accuracy by 5% or more, although performance varies across subjects. In addition, the performance of this subset of features can be used to target subsequent feature selection. Future work needs to be performed to develop reliable and valid methods for rating feature importance.

Finally, Figure 3 (http://www.cs.princeton.edu/~mkc/wiml06/Figure3.jpg) shows that confusion among predicted classes occurs most often between classes that are most similar and for which previous classifiers have encountered difficulty, e.g. male faces and female faces. This target space similarity structure could be exploited in future work to improve classification.

1. J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, (293) 2425-2429.

2. K. A. Norman, S. M. Polyn, G. J. Detre and J. V. Haxby, (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, In Press.

3. R.E. Schapire and Y. Singer. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3): 297-336.

4. P. Viola and M. Jones. (2001). Rapid object detection using a boosted cascade of simple features. CVPR 2001.

Joint work with Kenneth A. Norman, James V. Haxby and Robert E. Schapire.

Fast Online Classification with Support Vector MachinesSeyda Ertekin, Penn State University

In recent years, we have witnessed significant increase in the amount of data in digital format, due to the widespread use of computers and advances in storage systems. As the volume of digital information increases, people need more effective tools to better find, filter and manage these resources. Classification, the assignment of instances (i.e. pictures, text documents, emails, Web sites etc.) to one or more predefined categories

20

http://www.cs.princeton.edu/~mkc/wiml06/Figure1.jpg



based on their content, is an important component in many information organization and management tasks. Support Vector Machines (SVMs) is a popular machine learning algorithm for classification problems due to their theoretical foundation and good generalization performance. However, SVMs have not yet seen widespread adoption in the communities working with very large datasets due to the high computational cost involved in solving quadratic programming (QP) problem in the training phase. This research presents an online SVM learning algorithm, LASVM, which yields classification accuracy rates of the state-of-the-art SVM solvers but requires less computational resources. LASVM tolerates much smaller main memory and has a much faster training phase. We also show that not all the examples are equally informative in the training set. We present methods to select the most informative examples and exploit those to reduce the computational requirements of the learning algorithm. We uncover the properties of active learning algorithms to select the informative examples efficiently from very large-scale training sets. We will also show the benefits of using a non-convex loss function at SVMs for faster speeds and less computational requirements.

Joint work with Leon Bottou, Antoine Bordes and Jason Weston.

21

Posters (Session 1)

Using Decision Trees for Gabor-based Texture Classification of Tissues in Computed TomographyAlia Bashir, DePaul University

This research is aimed at developing an automated imaging system for classification of tissues in CT images. Classification of tissues in CT scans using shape or gray level information is challenging due to the changing shape of organs in a stack of images and the gray level intensity overlap in soft tissues. However, healthy organs are expected to have a consistent texture within tissues across slices. Given a large enough set of normal-tissue images, and a good set of texture features, machine learning techniques can be applied to create an automatic classifier. Previous work from one of the authors explored texture descriptors based on wavelet, ridgelets, and curvelets for the classification of tissues from normal chest and abdomen CT scans. These texture descriptors were able to classify tissues with an accuracy range of 85 - 98%, with curvelet-based texture descriptors performing the best. In this paper we bridge the gap to perfect accuracy by focusing on texture features based on a bank of Gabor filters. The approach consists of three steps: convolution of the regions of interest with a bank of 32 Gabor filters (4 frequencies and 8 orientations), extraction of two Gabor texture features per filter (mean and standard deviation), and creation of a classifier that automatically identifies the various tissues. The data set consists of 2D DICOM images from five normal chest and abdomen CT studies from Northwestern Medical Hospital. The following regions of interest were segmented out and labeled by an expert radiologist: liver, spleen, kidney, aorta, trabecular bone, lung, muscle, IP fat, and SQ fat for a total of 1112 images. For each image, the feature vector consists of the mean and standard deviation of the 32 filtered images, totaling 64 descriptors. The classification step is carried out using a Classification and Regression decision tree classifier. A decision tree predicts the class of an object (tissue) from values of predictor variables (texture descriptors), and generates a set of decision rules. These sets of rules are then used for the classification of each region of interest. Both the cross-validation and the random split of the data set into a training set (~65%) and testing set (~35%) techniques were applied but no significant difference was observed. The optimal tree had a depth of 20, parent node value set at 10 and child node value set at 1. To evaluate the performance of each classifier, specificity, sensitivity, precision, and accuracy rates are calculated from each misclassification matrix. Results show that this set of texture features is able to perfectly classify the 9 regions of interests. The Gabor filters’ ability to isolate features at different scales and directions allows for a multi-resolution analysis of texture essential when dealing with, at times, very subtle differences in the texture of tissues in CT scans. Given the great performance in the classification of healthy tissues, we plan to apply Gabor texture feature to the classification of abnormal tissues.

Joint work with Julie Hasemann and Lucia Dettori.

VOGUE: A Novel Variable Order-Gap State Machine for Modeling SequencesBouchra Bouqata, Rensselaer Polytechnic Institute (RPI)

In this paper we present VOGUE, a new state machine that combines two separate techniques for modeling long range dependencies in sequential data: data mining and data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to mine frequent patterns with different lengths and gaps between elements. It then uses these mined sequences to build the state machine. We applied VOGUE to the task of

22

protein sequence classification on real data from the PROSITE protein families. We show that VOGUE yields significantly better scores than higher-order Hidden Markov Models. Moreover, we show that VOGUEs classification sensitivity outperforms that of HMMER, a state-of-the-art method for protein classification

Joint work with Christopher Carothers, Boleslaw K. Szymanski and Mohammed J. Zaki.

GroZi: a Grocery Shopping Assistant for the BlindCarolina Galleguillos, U.C San Diego

Grocery shopping is a common activity that people all over the world perform on a regular basis. Unfortunately, grocery stores and supermarkets are still largely inaccessible to people with visual impairments, as they are generally viewed as "high cost" customers. We propose to develop a computer vision based grocery shopping assistant based on a handheld device with haptic feedback that can detect different products inside of a store, thereby increasing the autonomy of blind (or low vision) people to perform grocery shopping.

Our solution makes use of new computer vision techniques for the task of visual recognition of specific products inside of a store as specified in advance on a shopping list. These techniques can avail of complementary resources such as RFID, barcode scanning, and sighted guides. We also present a challenging new dataset of images consisting of different categories of grocery products that can be use for object recognition studies.

The use of the system consists of the creation of a shopping list followed by in-store navigation. In order to create a shopping list we will develop a website accessible to visually impaired people that stores data and images of different products. The website will be augmented with new image templates from the community of users that shop with the device, in addition to images of the same product that are taken in different stores by different users. This will increase the system's ability to recognize products that change appearance due to seasonal or promotion reasons. The navigational task includes finding the correct aisle for the products (based on text detection and character recognition), avoiding obstacles, finding products and checking out.

A typical grocery store carries around 30,000 items, so recognizing a single object is a nontrivial task. Assuming a shopping list length is generally less than 1/1000th of this amount (i.e., less than 30 items), the recognition can be constrained to two different phases: detection of object on a possibly cluttered shelf, and verification of the detected object with respect to the shopping list. For this task, we intend to use state of the art object recognition algorithms and develop new approaches for fast identification.

Applications of Kernel Minimum Enclosing BallCristina Garcia C., Universidad Central de Venezuela

The minimum enclosing ball (MEB) is a well-studied problem in computational geometry. In this work we describe a generalization of a simple approximate MEB construction, introduced by M. Badoiu and K. L. Clarkson, to a feature space MEB using the kernel trick. The simplicity of the methodology in itself is surprising, the MEB algorithm is based only on geometrical information extracted from a sample of data points, and just two parameters need to be tuned: the constant of the kernel and the tolerance in the radio of the approximation. The applicability of the method is demonstrated on anomaly detection and less traditional scenarios as 3D object modeling and path planning. Results are

23

encouraging and show that even an approximate feature space MEB, is able to induce topology preserving mappings on arbitrary dimensional noisy data as efficiently as other machine learning approaches.

Joint work with Jose Ali Moreno.

Classification With Cumular TreesClaudia Henry, Antilles-Guyane

The accurate combination of decision trees and linear separators has been shown to provide some of the best off the shelf classifiers. We describes a new type of such combination, which we call Cumular (Cumulative Linear) Trees. Cumular Trees are midway between Oblique Decision Trees and Alternating Decision Trees: more expressive than the former, and simpler than the latter. We provide an induction algorithm for Cumular Trees, which is, as we show, a boosting algorithm in the original sense. Experimental results against AdaBoost, C4.5 and OC1 display very good results, especially when dealing with noisy data.

Joint work with Richard Nock and Franck Nielsen.

Transient Memory in Reinforcement Learning: Why Forgetting Can be Good for YouAnna Koop, University of Alberta

The vast majority of work in machine learning is concerned with algorithms that converge to a single solution. It is not clear that this is always the most appropriate aim. Consider a sailor adapting to the ship's motion. She may learn two conditional models: one for walking when at sea, and another for walking when on land. She may, when memory resources are limited, learn a best-on-average policy that settles on a compromise among all situations she has encountered. A more flexible approach might be to quickly adapt the walking policy to new situations, rather than seeking one final solution or set of solutions.

We explore two cases of transient memory. In the first case, the rate at which individual parameters change is controlled by meta-parameters. These meta parameters allow the agent to ignore irrelevant or random features, to converge where features are consistent throughout its experience, and otherwise to adapt quickly to changes in the environment. This approach requires no commitment to the number of parameter sets necessary in a given environment, but makes the best use of available resources.

In the second case, a single solution is stored in long-term parameters, but this solution is used only as the starting point for learning about a specific situation. This is currently being applied to the game of Go. At the beginning of a game, the agent's value function parameters are initialized according to the long-term memory. During the course of a game these parameters are updated by simulating, from each state, thousands of self-play games. The short-term parameters learned in this way are used both for action selection and as the starting point for learning on the next turn, after the opponent has moved. Actual game-play moves are used to update both the short- and long-term memory. At the end of the game, the short-term memory is forgotten and the value function parameters are initialized to the long-term values. This allows the agent to store general knowledge in long-term memory while adapting quickly to the specific situations encountered in the current game.

24

Predicting Task-Specific Webpages for RevisitingA. Twinkle E. Lettkeman, Oregon State University

Most web browsers track the history of all pages visited, with the intuition that users are likely to want to return to pages that they have previously accessed. However, the history viewers in web browsers are ineffective for most users, because of the overwhelming glut of webpages that appear in the history. Not only does the history represent a potentially confusing interleaving of many of a user's different tasks, but it also includes many webpages that would provide minimal or no utility to the user if revisited. This paper reports on a technique used to dramatically reduce web browsing histories down to pages that are relevant to the user's current task context and have a high likelihood of being desirable to revisit. We briefly describe how the TaskTracer system maintains an awareness of a user's tasks and semi-automatically segments the web browsing history by task. We then present a technique that is used to predict whether webpages previously visited on a task will be of future value to the user and are worth displaying in the history user interface. Our approach uses a combination of heuristics and machine learning to evaluate the content of a page and interactions with the page to learn a predictive model of webpage relevance for each user task. We show the results of an empirical evaluation of this technique based on user data. This approach could be applied to systems that include tracking of webpage resources to predict future value of resources and to lower costs of finding and reusing webpages to the user. Our findings suggest that prediction of web pages is highly user- and task-specific, and that the choice of prediction algorithms is not obvious. In future work we aim to refine the features used to predict revisitability. We will analyze the effect of better text feature extraction in conjunction with user interest indicators such as reading time, scrolling behavior, and text selection. Preliminary analysis indicates that applying these refinements may increase the accuracy of our prediction models.

Joint work with Simone Stumpf, Jed Irvine and Jonathan Herlocker.

Hyper-parameters auto-setting using regularization path for SVMGaëlle Loosli, INSA de Rouen

In the context of classification tasks, Support Vector Machines are now very popular. However, their utilization by neophyte users is still hampered by the need to supply values for control parameters in order to get the best attainable results. Mainly, given clean data, SVM's users must make three choices: the type of kernel, its bandwidth and the regularization parameter. It would be convenient to provide users with a push-button SVM that would be able to auto-set to the best possible values. This paper presents a new method that approaches this goal. Given the importance of this problem for reaping all the potential benefits of the use of SVM, many research works have been dedicated to ways of helping the setting of the parameters. Most rely on either outer measures, such as cross-validation, to guide the selection, or to measures embedded in the learning method itself. In place of empirical approaches to the setting of the control parameters, regularization paths have been proposed and widely studied these past years since they provide a smart and fast way to access all the optimal solutions of a problem according to all compromises between bias and variance for regression or compromises between bias and regularity in classification. For instance, in the case of classification tasks, as studied in this paper, Soft margins SVM deal with non-separable problem thanks to slack variables that are parametrized by a slack trade-off (usually noted C, it is the regularization parameter). Within the usual formulation of the Soft margins SVM, this trade-off takes its value between 0 (random) and infinity (hard-margins). The nu-SVM technique reformulates the SVM problem so that C is replaced by nu parameters taking values in [0,1]. This normalized parameter has a more intuitive meaning: it represents the minimal proportion

25

of points in the solution and the maximal proportion of misclassified points.

However, having the whole regularization path is not enough. Indeed, the end user still needs to retrieve from it the best values for the regularization parameters. Instead of selecting these values by k-fold cross-validation or leave-one-out, or other approximations, we propose to include the leave-one-out estimator inside the regularization path in order to have an idea of the generalization error at each step. We explain why it is less expansive than selecting the best parameter a posteriori and give a method to stop learning before attaining the end of the path to save useless efforts. Contrarily to what is usually done for regularization path, our method does not start with all points as support vectors. Doing so we avoid the computation of the whole Gram matrix at the first step. Then, since the proposed method stops on the path, this extreme non-sparse solution is never attained and thus the whole Gram matrix never required. One of the main advantages of this is that it is possible to use this setting for large databases.

The Influence of Ranker Quality on Rank Aggregation AlgorithmsBrandeis Marshall, Rensselaer Polytechnic Institute

The rank aggregation problem has been studied extensively in recent years with a focus on how to combine several different rankers to obtain a consensus aggregate ranker. We study the rank aggregation problem from a different perspective: how the individual input rankers impact the performance of the aggregate ranker. We develop a general statistical framework based on a model of how the individual rankers depend on the ground truth ranker. Within this framework, one can study the performance of different aggregation methods. The individual rankers, which are the inputs to the rank aggregation algorithm, are statistical perturbations of the ground truth ranker. With rigorous experimental evaluation, we study how noise level and the misinformation of the rankers affect the performance of the aggregate ranker. We introduce and study a novel Kendall-tau rank aggregator and a simple aggregator called PrOpt, which we compare to some other well known rank aggregation algorithms such as average, median and Markov chain aggregators. Our results show that the relative performance of aggregators varies considerably depending on how the input rankers relate to the ground truth.

Joint work with Sibel Adali and Malik Magdon-Ismail.

Learning for Route Planning under UncertaintyEvdokia Nikolova, Massachusetts Institute of Technology

We present new complexity results and efficient algorithms for optimal route planning in the presence of uncertainty. We employ a decision theoretic framework for defining the optimal route: for a given source S and destination T in the graph, we seek an ST-path of lowest expected cost where the edge travel times are random variables and the cost is a nonlinear function of total travel time. Although this is a natural model for route planning on real-world road networks, results are sparse due to the analytic difficulty of finding closed form expressions for the expected cost, as well as the computational/combinatorial difficulty of efficiently finding an optimal path, which minimizes the expected cost.

We identify a family of appropriate cost models and travel time distributions that are closed under convolution and physically valid. We obtain hardness results for routing problems with a given start time and cost functions with a global minimum, in a variety of deterministic and stochastic settings. In general the global cost is not separable into edge costs, precluding classic shortest-path approaches. However, using partial minimization

26

techniques, we exhibit an efficient solution via dynamic programming with low polynomial complexity.

We then consider an important special case of the problem, in which the goal is to maximize the probability that the path length does not exceed a given threshold value (deadline). We give a surprising exact n log nθ algorithm for the case of normally distributed edge lengths, which is based on quasi-convex maximization. We then prove average and smoothed polynomial bounds for this algorithm, which also translate to average and smoothed bounds for the parametric shortest path problem, and extend to a more general non-convex optimization setting. We also consider a number other edge length distributions, giving a range of exact and approximation schemes.

Our offline algorithms can be adapted to give online learning algorithms via the Kalai-Vempala approach of converting an offline to an efficient online optimization solution.

Joint work with Matthew Brand, David Karger, Jonathan Kelner and Michael Mitzenmacher.

A Neurocomputational Model of Impaired ImitationBiljana Petreska, Ecole Polytechnique Federale de Lausanne

This abstract addresses the question of human imitation through convergent evidence from neuroscience, using tools from machine learning. In particular, we consider a deficit in imitation of meaningless gestures (i.e., hand postures relative to the head) following callosal brain lesion (i.e., disconnected hemispheres). We base our work on the rational that looking at how imitation in apraxic patients is impaired can unveil its underlying neural principles. We ground the functional architecture and information flow of our model in brain imaging studies. Finally findings from monkey brain neurophysiological studies drive the choice of implementation of our processing modules. Our neurocomputational model of visuo-motor imitation is based on selforganizing maps receiving sensory input (i.e., visual, tactile or proprioceptive) with associated activities [1]. We train the connections between the maps with anti-hebbian learning to account for the transformations required to translate the observation of the visual stimulus to imitate to the corresponding tactile and proprioceptive information that will guide the imitative gesture. Patterns of impairment of the model, realized by adding uncertainty in the transfer of information between the networks, reproduce the deficits found in a clinical examination of visuo-motor imitation of meaningless gestures [2]. The model makes hypotheses on the type of representation used and the neural mechanisms underlying human visuo-motor imitation. The model also helps to gain more understanding in the occurrence and nature of imitation errors in patients with brain lesions.

[1] B. Petreska, and A.G. Billard. A Neurocomputational Model of an Imitation Deficit following Brain Lesion. In Proceedings of 16th International Conference on Artificial Neural Networks (ICANN 2006), Athens (Greece). To appear.

[2] G. Goldenberg, K. Laimgruber, and J. Hermsdörfer. Imitation of gestures by disconnected hemispheres. Neuropsychologia, 39:1432–1443, 2001.

Joint work with A. G. Billard.

27

Bayesian Estimation for Autonomous Object Manipulation Based on Tactile SensorsAnya Petrovskaya, Stanford University

We consider the problem of autonomously estimating position and orientation of an object from tactile data. When initial uncertainty is high, estimation of all six parameters precisely is computationally expensive. We propose an efficient Bayesian approach that is able to estimate all six parameters in both unimodal and multimodal scenarios. The approach is termed Scaling Series sampling as it estimates the solution region by samples. It performs the search using a series of successive refinements, gradually scaling the precision from low to high. Our approach can be applied to a wide range of manipulation tasks. We demonstrate its portability on two applications: (1) manipulating a box and (2) grasping a door handle.

Joint work with Oussama Khatib, Sebastian Thrun, Andrew Y. Ng.

Therapist Robot Behavior Adaptation for Post-stroke Rehabilitation TherapyAdriana Tapus, University of Southern California

Research into Human-Robot Interaction (HRI) for socially assistive applications is in its infancy. Socially assistive robotics, which focuses on the social interaction, rather than the physical interaction between the robot and the human user, has the potential to enhance the quality of life for large populations of users. Post-stroke rehabilitation is one of the largest potential application domains, since stroke is a dominant cause of severe disability in the growing ageing population. In the US alone, over 750,000 people suffer a new stroke each year, with the majority sustaining some permanent loss of movement [Institute06]. This loss of function, termed "learned disuse", can improve with rehabilitation therapy during the critical post-stroke period. One of the most important elements of any rehabilitation program is carefully directed, well-focused and repetitive practice of exercises, which can be passive and active.

Our work focuses on hands-off therapist robots that assist, encourage, and socially interact with patients during their active exercises. Our previous research demonstrated, through real world experiments with stroke patients [Tapus06b, Eriksson05, Gockley06], that the physical embodiment (including shared physical context and physical movement of the robot), the encouragements, and the monitoring play key roles in patient compliance with rehabilitation exercises.

In the current work we investigate the role of the robot’s personality in the hands-off therapy process. We focus on the relationship between the level of extroversion/introversion (as defined in Eysenck Model of personality [Eysenck91]) of the robot and the user, addressing the following research questions: 1. How should we model the behavior and encouragement of the therapist robot as a function of the personality of the user and the number of exercises performed? 2. Is there a relationship between the extroversion-introversion personality spectrum based on the Eysenck model and the challenge based vs. nurturing style of patient encouragement?

To date, little research into human-robot personality matching has been performed. Some of our recent results showed the preference for personality matching between users and socially assistive robots [Tapus06a]. Our therapist robot behavior adaptation system monitors the number of exercises/minute performed by the human/patient, indicating the level of engagement and/or fatigue, and changes the robot’s behavior in order to maximize this level. The socially assistive therapist robot (see Figure 1) is equipped with a basis set of behaviors that will explicitly express its desires and intentions in a physical and verbal way that is observable to the user/patient. These behaviors involve the control of

28

physical distance, gestural expression, and verbal expression (tone and content). The number of exercises/minute is therefore used as a reward that maximizes the response of the system.

Hands-off robot post-stroke rehabilitation therapy holds great promise of improving patient compliance in the recovery program. Our work aims toward developing and testing a model of compatibility between human and robot personality in the assistive context, based on the PEN theory of personality and toward building a customized therapy protocol. Examining and answering these issues will begin to address the role of assistive robot personality in enhancing patient compliance.

[Eriksson05] Eriksson, J., Matarić, M., J., and Winstein, C. "Hands-off assistive robotics for post-stroke arm rehabilitation", In Proceedings of the International Conference on Rehabilitation Robotics (ICORR-05), Chicago, Illinois, June 2005.

[Eysenck91] Eysenck, H., J. "Dimensions of personality: 16, 5 or 3? Criteria for a taxonomic paradigm", In Personality and individual differences, vol. 12, pp.773-790, 1991.

[Gockley06] Gockley, R., and Matarić, M., J. "Encouraging Physical Therapy Compliance with a Hands-Off Mobile Robot", In Proceedings of the First International Conference on Human Robot Interaction (HRI-06), Salt Lake City, Utah, March 2006.

[Institute06] "Post-Stroke Rehabilitation Fact Sheet", National Institute of neurological disorders and stroke, January, 2006.

[Tapus06a] Tapus, A. and Matarić, M., J. (2006) "User Personality Matching with Hands-Off Robot for Post-Stroke Rehabilitation Therapy", In Proceedings of the 10th International Symposium on Experimental Robotics (ISER), Rio de Janeiro, Brazil, July 2006.

[Tapus06b] Tapus, A. and Matarić, M., J. (2006) "Towards Socially Assistive Robotics", International Journal of the Robotics Society of Japan (JRSJ), 24(5), pp. 576- 578, July, 2006.

Joint work with Maja J. Matarić.

Learning How To TeachCynthia Taylor, University of California, San Diego

The goal of the RUBI project is to develop a social robot (RUBI) that can interact with children and teach them in an autonomous manner. As part of the project we are currently focusing on the problem of teaching 18-24 month old children skills targeted by the California Department of Education as appropriate for this age group.

In particular we are focusing on teaching the children to identify objects, shapes and colors. We have seven RFID-tagged stuffed toys, in the shapes of common objects like a slice of watermelon or a waffle. RUBI says the name of the object and shows a picture of it on her touch screen, and the children hand her a toy, which she identifies as correct or incorrect. She keeps track of the right and wrong answers for each toy.

RUBI has a touch screen on her stomach she can use to play short videos and play games with the children. By recording when the children touch her stomach, the screen also provides important information about whether or not the children are engaged. She has two Apple iSight cameras for eyes, and runs machine learning software that lets her detect both faces and smiles. The smile detection lets her gage people’s moods during social interaction, and respond accordingly. She has an RFID reader in her right hand, letting her

29

identify RFID tagged toys.

The machine learning aspect of this problem is how to use the information from her perceptual primitives so as to teach the materials in an effective manner. After each question/answer, RUBI has to decide whether to continue playing her current learning game or switch to another activity, and what question to ask next if she continues playing the game. She also has to decide what to do in situations where she asks a question and does not get an answer for a long period of time. Unlike many standard AI problems like chess, RUBI works in continuous time, with no discrete turns.

We are approaching the problem from the point of view of control theory. Exact solutions to the optimal teach problem exist for some simple models of learning, such as the Atkinson and Bower learning model. We are planning to find approximate solutions to this control problem using Reinforcement Learning Methods. We will complement formal and computational analysis with ethnographic study of how teachers do teach the children on the same task. Our focus will be on understanding both timing and what sources of information they use to adapt their teaching strategies.

Joint work with Paul Ruvolo, Ian Fasel, Javier R. Movellan.

Strategies for improving face recognition in video using machine learning methodsDeborah Thomas, University of Notre Dame

Surveillance cameras are a common feature in many stores and public places. There are many applications for face recognition from video streams in the area of law enforcement. However, while face recognition from high quality still images has been very successful, face recognition from video is a relatively new area and there is huge room for improvement. Furthermore, when using video as our data, we can exploit the fact that there are multiple frames to choose from to improve recognition performance. So, instead of representing subjects using a single high quality image, they can be represented using a set of frames chosen from the frames in a video clip. However, we want to select as many distinct frames for an individual as possible. This allows for the diversity in the training space, thereby improving the generalization capacity of the learned face recognition classifier.

In this work, we consider two different approaches. The commonality between the two approaches is Principal Component Analysis. Given the high dimensionality of the data, PCA is often warranted to not only reduce the dimensions but also construct mode independent dimensions. In our first approach, we use a nearest neighbor algorithm with Mahalanobis Cosine (MahCosine) distance measure. A pair of images in which the faces differ from each other in pose and expression will have a bigger MahCosine distance between them. So we can use this as a measure of difference between frames. In the second approach, we project the images into PCA space and then use K-means clustering to group all the frames from one subject and pick one image per cluster to make up the representation set. Here again, images, which are similar to each other, will be in the same cluster, while more different images will be in different clusters. In addition to difference between frames, we also incorporate a quality metric of the face in picking the frames in addition to using PCA and this yields a higher recognition rate.

We demonstrate our approach using two different datasets. First, we compare our approach to the approach used by Lee et. al in 2003 (Video-based Face Recognition Using Appearance Manifolds) and 2005 (Visual Tracking and Recognition using Probabilistic Appearance Manifolds). They use appearance manifolds to represent their subjects and use planes in PCA space for the different poses. We show that our approach performs

30

better than their results from 2003 and comparably to their results in 2005 when using their dataset. We also compare it to using a single high quality image as the gallery representation of the subject. Finally, we demonstrate our approach using a dataset collected at the University of Notre Dame. This set contains data from multiple sensors, at varying quality, and is made up of data taken both indoors and outdoors. It is also the largest known dataset. We show that our approach yields promising results on this difficult dataset as well.

Joint work with Kevin W. Bowyer, Patrick J. Flynn and Nitesh Chawla.

Machine learning systems for detecting driver drowsinessEsra Vural, Sabanci University / University Of California San Diego

The advance of computing technology has provided the means for building intelligent vehicle systems. Drowsy driver detection system is one of the potential applications of intelligent vehicle systems. Previous approaches to fatigue detection have focused primarily on blink rate and eye closure. Moreover, chin rests are often employed, preventing analysis of information in head motion. Here we explore detection of driver fatigue by characterizing the dynamics of head motion as well as a number of internal facial movements including blinks, yawns, and other facial movements that may be associated with fatigue. Facial motion is measured from video using a fully automated facial expression analysis system based on the Facial Action Coding System (FACS) (Bartlett et al., in press). This system captures information about eyelid movement and yawning, and also enables exploration of other facial movements that may be associated with fatigue. For example, some subjects raise their eyebrows in an attempt to keep their eyes open. Information about head motion is extracted by an accelerometer, as well as by automatically detected locations of the eyes, nose, and mouth. Dynamical models, taking these measures as input, are employed to characterize the differences between alert and fatigued states. For example, drowsy states may be associated with characteristic head dynamics such as a slow decrease in pitch followed by jerky upward movement. The system is trained on subjects experiencing fatigue in a driving task. In this task, subjects play a driving video game with a steering wheel. Distance to the center of the lane provides a ground-truth measure of driver performance. At random intervals, a wind effect is applied that drags the car to the right or left and the subject must correct the position of the car. The subjects response time to initiate the steering correction, as well as the time to return the car to the center of the lane are employed as measures of driver alertness. Support vector machines will be employed to learn the relationship between the facial behavior variables and the measures of driver alertness. Lastly, Bayesian belief networks with a sliding window will be employed to estimate the alertness state of the driver during the video.

The goal of perceiving the driver's expressions and actions is to guide the car/driver system so as to improve safety. We hope this study will elucidate the relationship between facial behavior and fatigue, which is not fully understood at this time. The use of automated expression measurement systems enables exploration of facial dynamics that was previously intractable in studies of facial behavior.

Joint work with Mujdat Cetin, Aytul Ercil, Marian Stewart Bartlett and Javier Movellan.

31

Posters (Session 2)

Improving Associative ClassifiersLuiza Antonie, University of Alberta, Canada

Classification of objects in predefined classes is an important task in many applications. In our research we focus on associative classifiers, classification systems that use association rules in building their model. These systems discover patterns in data that are associated with the predefined classes. Association rule-based classifiers have recently emerged as competitive classification systems. However, there are still deficiencies that hinder their performance. We investigate these issues and we propose several solutions to overcome them. We study the performance of our classification model on real-life applications where the classes of interest are typically under-represented (e.g., mammography classification, text categorization, preterm birth prediction).

Joint work with Osmar R. Zaiane and Robert C. Holte.

Convex Optimization Techniques for Large-Scale Covariance SelectionOnureena Banerjee, University California Berkeley

We consider the problem of fitting a large-scale covariance matrix to multivariate Gaussian data in such a way that the inverse is sparse, thus providing model selection. Beginning with a dense empirical covariance matrix, we solve a maximum likelihood problem with an L1-norm penalty term added to encourage sparsity in the inverse. For models with tens of nodes, the resulting problem can be solved using standard interior-point algorithms for convex optimization, but these methods scale poorly with problem size. We present two new algorithms aimed at solving problems with a thousand nodes. The first, based on Nesterov's first-order algorithm, yields a rigorous complexity estimate for the problem, with a much better dependence on problem size than interior-point methods. Our second algorithm uses block coordinate descent, updating row/columns of the covariance matrix sequentially. Experiments with genomic data show that our method is able to uncover biologically interpretable connections among genes.

Joint work with Laurent El Ghaoui, Alexandre d'Aspremont and Georges Natsoulis.

American Sign Language Recognition in Game Development for Deaf ChildrenHelene Brashear, Georgia Institute of Technology

CopyCat is an American Sign Language (ASL) game, which uses gesture recognition technology to help young deaf children practice ASL skills. We describe a brief history of the game, an overview of recent user studies, and the results of recent work on the problem of continuous, user-independent sign language recognition in classroom settings. Our database of signing samples was collected from user studies of deaf children playing a Wizard of Oz version of the game at the Atlanta Area School for the Deaf (AASD). Our data set is characterized by disfluencies inherent in continuous signing, varied user characteristics including clothing and skin tones, and illumination changes in the classroom. The dataset consisted of 541 phrase samples and 1,959 individual sign samples of five children signing game phrases from a 22-word vocabulary.

Our recognition approach uses color histogram adaptation for robust hand segmentation

32

and tracking. The children wear small colored gloves with wireless accelerometers mounted on the back of their wrists. The hand shape information is combined with accelerometer data and used to train hidden Markov models for recognition. We evaluated our approach by using leave-one-out validation; this technique iterates through each child, training on data from four children and testing on the remaining child's data. We achieved average word accuracies per child ranging from 91.75% to 73.73% for the user-independent models.

Joint work with Kwang-Hyun Park, Seungyon Lee, Valerie Henderson, Harley Hamilton and Thad Starner.

Distributed Data Mining on Astronomy CatalogsHaimonti Dutta, University of Maryland, Baltimore County

The design, implementation and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research. However, these data archives will necessarily be geographically distributed. To fully exploit the potential of this data load, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first downloading the archives fully to a centralized site.

In this work, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs. The system is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully downloaded to a centralized server. To illustrate the potential effectiveness of our system, we develop communication-efficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such as a reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through efficient access to distributed multi-dimensional data sets.

Joint work with Chris Giannella, Kirk Borne, Ran Wolff and Hillol Kargupta.

Cognitive Component AnalysisLing Feng, Technical University of Denmark

Cognitive Science has attracted a new level of prosperity ‘consciousness’ of engineers during recent decades. One of the reasons is that people begin to wonder whether statistically optimal representations by a variety of machine learning methods are aligned with human cognitive activities. The evolution of human perception system and cognition system is a long-time adaptation process and an on-going interaction between natural environments and natural selection. During the course of charting family trees of various species, evolutionary biologists intended to distinguish between "primitive" and "derived" features. Primitive features group alliances agents; and derived features are prone to enlarge the difference among individuals within a specific agent. Wagensberg linked the difference of individuals to the importance of independence for successful ‘life forms’: A living individual is part of the world with some identity that tends to become independent of the uncertainty of the rest of the world. Wagensberg also points out that by creating alliances agents can give up independence for the benefit of a group, which in turns may

33

increase independence for the group as an entity.

Our independence hypothesis has been inspired by intriguing facts from using independent component analysis (ICA) algorithm. As a consequence of evolution, human perception system can model complex multi-agent scenery. Humans’ ability of using a broad spectrum of cues for analyzing perceptual input and identification of individual agents has been studied and furthermore stimulated in computers. The resulting theoretically optimal representations achieved by using a variety of ICA closely resembles representations found in human perceptual systems on visual contrast detection, on visual features involved in color and stereo processing, and on representations of sound features.

As a consequence, Cognitive Component Analysis (COCA) was first brought to bear in 2005: the process of unsupervised grouping of data such that the resulting group structure is well-aligned with that resulting from human cognitive activity. We investigated the independent cognitive component hypothesis, which asks the question: Do humans also use these information theoretically optimal ‘ICA’ methods in more generic and abstract data analysis. COCA has been applied to broad topics to review low-level cognitive components. These evidences confirmed that ICA is relevant for representing semantic structure, in text and social networks and musical features; more strikingly for representing information embedded in speech signals, such as phoneme, gender, speaker identity, and even height.

Human learns strategies from one perceptual domain to another, and apply them in more or less distinct categories, such as events or objects grouping. In machine learning, the label structures found by unsupervised learning using some real world data sets are consistent with the labels derived from human cognition. ‘Ray’-structures in latent semantic analysis-like plots, are the key COCA phenomena. Evidences have been found that ‘ray’-structures discovered by COCA, which are understood as the human cognition labels, coincide with labels of the samples in the relevant feature space. The fact that structures by COCA are aligned with labels structures highlights the possibility of using unlabeled data in supervised learning methods.

Joint work with Lars Kai Hansen.

Proto-transfer Learning in Markov Decision Processes Using Spectral MethodsKimberly Ferguson, University of Massachusetts, Amherst

In this paper we introduce proto-transfer leaning, a new framework for transfer learning. We explore solutions to transfer learning within reinforcement learning through the use of spectral methods. Proto-value functions (PVFs) are basis functions computed from a spectral analysis of random walks on the state space graph. They naturally lead to the ability to transfer knowledge and representation between related tasks or domains. We investigate task transfer by using the same PVFs in Markov decision processes (MDPs) with different rewards functions. Additionally, our experiments in domain transfer explore applying the Nyström method for interpolation of PVFs between MDPs of different sizes.

Joint work with Sridhar Mahadevan.

Continuous Typist Recognition Using Machine LearningKathryn Hempstalk, University of Waikato, New Zealand

Identifying oneself at login by means of a username/password pair (or perhaps some other identification scheme) is a common form of security for computer systems. However, once

34

the user has logged in, computers do not continue to check whether the same person is using the computer. This is an easy verification task for a human, for changes of identity are easily noticed: simply glancing at the user gives enough information to confirm that he or she is the same as whoever logged in at the beginning of the session. For a computer, the task is much harder.

It would be attractive to verify the user on a continuous basis after they have logged in by monitoring characteristic patterns in their keyboard input―notably timing patterns. Once a sufficiently large sample has been collected, it can be compared against a database of known profiles. If the match between the current input and the logged-in user's profile fails to reach a sufficiently high threshold, the system automatically takes appropriate action―such as locking them out until they provide further identification information. If the match exceeds the threshold, no action is taken and the user continues work as normal.

It is difficult to devise suitable matching procedures, because any particular user's typing pattern fluctuates over time. Effects including stress, alertness, fatigue, mood, injury, tools (e.g. keyboard differences; holding a pen), time of day, distractions and the type of task being performed all affect both what a user types and how they type it. The samples used to build the user profile cannot be expected to cover every possibility, so it becomes the responsibility of the matching algorithm to ensure that each user's profile covers them broadly enough to tolerate typical variations, but tightly enough to exclude impostors.

Most research on typist verification involves using machine learning algorithms for the matching process. We focus on the re-implementation of two continuous typist recognition algorithms. The first algorithm uses LZ78 compression modified to perform prediction. It takes a sequence of key press and release events and their associated times and builds them into a phrase tree. An unknown sample walks the reference phrase tree, cumulating the log likelihood as it goes. At the end of the walk, the log likelihood is compared to a threshold value and based on this comparison the sample is either accepted or rejected.

The second algorithm uses a nearest neighbor approach, and holds a database of many users. An unknown sample is compared to reference samples from every other known user, including the one being verified. If the sample is closest to the actual user, and is also close enough, the sample is accepted. Otherwise, the sample is rejected.

These two algorithms have been reconstructed from the published description, re-implemented and then evaluated using the data from the original experiments. The process of reconstruction, the comparison of the two systems, and the results of the experiments provide fresh insight into difficulties associated with the problem of continuous typist recognition.

Optimally Predictive StatesKristina Klinkner, Carnegie Mellon University

The prediction of discrete sequential data is an important problem in many fields, including bioinformatics, neuroscience (spike trains), and nonlinear dynamics (symbolic dynamics). Existing prediction methods, with the exception of variable-length Markov model (VLMM) methods, make strong assumptions about the nature of the data-generating process. In this paper, we present an algorithm for the blind construction of asymptotically optimal nonlinear predictors of discrete sequences. These predictors take the form of minimal sufficient statistics, naturally arranged into a hidden Markov model (HMM). We thus secure the many desirable features of HMMs, and hidden-state models more generally, without having to make a priori assumptions about the architecture of the system. Furthermore,

35

our method is strictly more powerful than those based on VLMMs. We also compare our approach to the use of cross-validation to select an HMM architecture, and find our results are at least comparable in terms of accuracy and parsimony, and superior in terms of speed. The source code and documentation for an implementation of this algorithm, Causal State Splitting Reconstruction (CSSR) is at http://bactra.org/CSSR/. CSSR has been published in UAI 2004, and has now also been applied successfully to problems in the areas of anomaly detection, natural language processing, neuroscience and solid state physics.

Joint work with Cosma Shalizi.

Learning gene regulatory programs with machine learningXuejing Li, Columbia University

One central challenge in computational biology is the discovery of transcriptional regulatory mechanism underlying the expression of genes in a cell. In particular, many computational efforts have been made to infer gene regulatory networks from high-throughput genomic data. Here we present a novel predictive modeling approach for study of regulatory networks through a machine learning algorithm MEDUSA (Motif Element Discovery Using Sequence Agglommeration). MEDUSA integrates gene expression, regulatory sequence and transcription factor occupancy data to build a model of condition-specific transcriptional regulation logic. In addition, MEDUSA discovers binding site motifs in the regulatory sequences that are predictive of gene expression and therefore believed to be functional. MEDUSA is based on boosting, a statistical learning method, which enables the algorithm to search through the high dimensional feature space of candidate regulators and motifs while avoiding overfitting. MEDUSA has proven to achieve high prediction accuracy and yield results consistent with biological knowledge when applied to experimental data S.Cerevisiae (yeast). Extensions of MEDUSA to C.elegans (worm) and Drosophila (fruitfly) is in progress.

KEA++: Semantically Enhanced Keyphrase Extraction AlgorithmOlena Medelyan, University of Waikato

Keyphrases are single words or multi-word lexemes that concisely and accurately describe the subject or an aspect of the subject discussed in a document. They are widely used in large document collections to organize material based on content, provide thematic access, represent search results, and assist with navigation. Keyphrase indexing means assigning keyphrases to documents, which is currently carried out manually. Professional human indexers read the full document and select appropriate descriptors freely or from a controlled vocabulary, according to defined cataloguing rules. This is a time-consuming and expensive process and impossible to perform on the vast number of electronic documents available nowadays.

Most existing approaches to automatic keyphrase indexing rely primarily on frequency-based analysis, which results in low quality keyphrases. Other approaches are designed for small vocabularies and are not applicable to very large controlled vocabularies or to cases, where training data is not available. We propose a keyphrase extraction algorithm KEA++ that is based on machine learning techniques. It requires dramatically less training data than other approaches and operates on vocabularies of practically any size. Significant improvements over frequency-based keyphrase extraction were achieved by taking into account semantic knowledge encoded in the controlled vocabulary. Further inclusions of semantically based techniques into the indexing process are currently being investigated

36

to improve KEA++'s coverage of distinct topic discussed in the document. We demonstrate how lexical chains can be applied for this purpose. These are sequences of semantically related words and phrases that span over the document's content and represent its discourse structure. Although successfully applied to other areas of natural language processing, these methods have not yet been explored in automatic indexing.

With advanced machine learning techniques and access to domain and general semantic knowledge we hope to prove that computers are able index better than humans. Better indexing does not imply better understanding. It is apparent that computer systems will never be able to compete with human performance in understanding of meaning. We define "better" indexing with two criteria: quality and consistency. While indexing quality expresses how well a keyphrase set describes a given document, indexing consistency refers to the quality of indexing given the entire document collection. The more consistent the indexing of distinct documents describing the same topics, the higher the retrieval effectiveness on this collection.Experiment with six professionals who indexed the same ten documents confirmed that humans are highly inconsistent with each other. Direct comparison of KEA++'s current version on these documents reveals that the algorithm is only 11 percentage points less consistent with human indexers than they among each other. Indexing consistency is further investigated by designing a larger experiment with human indexers and comparing the performance of the new semantically enhanced indexing method on the obtained data. To compute the indexing quality, an evaluation with human judgments is planned. If both evaluation methods produce positive results, the hypothesis will be confirmed and human indexing may become obsolete.

A Combined SVM / DBN Approach to Automatic SNP DetectionSheila M Reynolds, University of Washington

We describe a new approach to finding and genotyping single-nucleotide polymorphisms (SNPs) in fluorescence-based sequence traces obtained from diploid DNA samples (i.e. samples containing both sets of chromosomes). This approach combines two methods commonly used in machine learning: support vector machines (SVMs) and Dynamic Bayesian networks (DBNs). The combination of these two methods benefits from the generative, probabilistic nature of the DBN and the discriminative strengths of the SVM.

Several different SVMs are trained to classify each base position as follows: a) error vs. not-error; b) homozygous vs. heterozygous; and c) heterozygous vs. error. These three decisions represent the key challenges in detecting and genotyping SNPs in diploid traces.

When comparing an individual's DNA sequence to the GenBank reference sequence, any difference between the two is either an error or a polymorphism (i.e. a naturally occurring variation). However, errors in the sequencing process occur far more frequently than SNPs (approximately 1 error per 100 bases, as compared to 1 SNP per 1000 bases).

Rather than using the hard classification decision from each SVM, we use the distance from the hyperplane as a "soft" classification score. These soft scores are treated as a single continuous-valued multi-dimensional observation in a Bayesian Network, which learns a Gaussian mixture model of the class-conditional densities. The DBN models the generation of the individual's genotype given the reference, the generation of the observed bases given the fluorescent trace, and the relationship between technical or biological replicates as well as DNA samples from a population of individuals. Because of the small number of possible genotypes, it is computationally feasible to do Maximum a Posteriori (MAP) rather than Maximum Likelihood (ML) decoding, and the resulting probability of the most likely genotype given the evidence is a measure of the confidence

37

we can assign to that genotype.

This abstract describes work in progress. The most recent results available will be presented at the workshop.

Joint work with William S Noble and Jeff A Bilmes.

Action Selection in Bayesian Reinforcement LearningTao Wang, University of Alberta

My research attempts to address on-line action selection in reinforcement learning from a Bayesian perspective. The idea is to develop more effective action selection techniques by exploiting information in a Bayesian posterior, while also selecting actions by growing an adaptive, sparse lookahead tree. I further augment the approach by considering a new value function approximation strategy for the belief-state Markov decision processes induced by Bayesian learning.

Imagine a mobile vendor robot loaded with snacks and bustling around a building, learning where to visit to optimize its profit. The robot must choose wisely between selling snacks somewhere far away from its home or going back to its charger before its battery dies. How could a robot effectively learn to behave from its experience (previous sensations and actions) in such an environment? How could a robot learn to do something useful adaptively and independently instead of relying on detailed human guidance?

Reinforcement mechanisms provide the robot an opportunity to improve its decision making via interacting with the world to obtain evaluative feedback on its actions. The goal of the robot is to maximize the total reward it obtains by taking actions in the environment. Normally, the environment is uncertain; therefore, the outcome of an action is non-deterministic. At each decision-making point the robot faces the challenging issue of how to choose its action while learning. There is a fundamental tradeoff: it could exploit its current knowledge to gain reward by taking actions known to give relatively high reward, or explore the environment to gain information by trying actions whose value is uncertain. This is the well-known problem of balancing exploitation with exploration in reinforcement learning, or more generally, the problem of action selection during reinforcement learning.

Interestingly, there exists little convergence on the fundamental question of on-line action selection in reinforcement learning. In my work, I have been investigating a Bayesian approach to action selection in reinforcement learning. The idea I have been exploring is to exploit a Bayesian posterior to make intelligent action selection decisions by constructing and searching a sparse lookahead tree, originally inspired by the idea of sparse sampling. The outcome is a flexible, practical technique―"Bayesian sparse sampling"―for improving action selection in episodic reinforcement learning problems.

Recently, I have been working on a new approach to approximate planning in partially observable Markov decision processes (POMDPs) that is based on convex quadratic function approximation. We approximate the optimal value function by a convex upper bound composed of a fixed number of quadratics, and optimize it at each stage by semi definite programming. I have shown that this approach can achieve competitive approximation quality to current techniques while still maintaining a bounded size representation of the function approximator. Moreover, an upper bound on the optimal value function can be preserved if required. Overall, the technique requires computation time and space that is only linear in the number of iterations.

Joint work with Pascal Poupart, Daniel Lizotte, Michael Bowling and Dale Schuurmans.

38

Wireless Sensing to Support the Diagnosis and Care of Children with AutismTracy Westeyn, Georgia Institute of Technology

We present a wireless on-body accelerometer system that provides continuous recognition of autistic self-stimulatory behaviors. This work is a pilot study for a larger exploratory research project investigating the use of ubiquitous sensors to assist in the diagnosis and care of children with autism. The goal of the present study is to provide a proof-of-concept system capable of collecting data from a child with autism and automatically providing indices into that data.

Autism is a developmental disorder affecting a child's social development and ability to communicate. Children with autism will often exhibit behaviors such as vocal stutters and brief bouts of vigorous activity (e.g., violently striking the back of the hands) to cope with everyday life. Depending on the child's level of functioning, these highly individualized, self-stimulatory (“stimming”) behaviors can be disruptive, socially awkward, or even harmful. Caregivers and researchers would like to explore the correlation between these stimming behaviors and environmental factors, behavioral treatments, mood, and other physiological markers.

To assist in this analysis, we aim to automate the recording and analysis of these behaviors. Although it is impractical for a researcher to monitor a given child continuously for episodes of stimming, an intelligent monitoring system could collect daily data from the child and filter it so that just the stimming episodes are highlighted. An automated data collection system may provide insight into a given child's mental and physiological state. It may also provide detailed, quantitative data for researchers in the field, which is currently rare.

Our initial results indicate that an automatic indexing system for stimming activity is feasible. Our data set consists of acceleration data generated from a neurotypical adult mimicking autistic stimming behaviors while performing unscripted activities. Seven stimming behaviors and intermediary “non-stimming” activities are modeled using hidden Markov models (HMMs). We explored the performance of these models in both the isolated and continuous settings. The isolated HMM experiments assumed slight noise in data segmentation and achieved accuracy rates of 91.0 percent. In the continuous recognition experiments, exact segmentation of the stimming events was not possible due to minor insertion errors. These fragmentation errors (rapid alternation of classes at the boundaries) produce an overall system accuracy of 68.6 percent. However, we improved segmentation accuracy by using insertion penalties and smoothing during the model alignment process. We achieve a recall rate of 100 percent for the self-stimulatory events (with 92.9 percent precision including identification of non-self-stimulatory activities).

Standard accuracy metrics do not always account for the impact that different error types have on applications. For example, a trade-off exists between identifying all instances of stimming and obtaining accurate event boundaries. While some researchers may want monitoring applications to ignore boundary errors, others may find them important. We discuss how Error Division Diagrams (EDDs), a recently introduced metric, can be used to help researchers visually compare the performance of recognition systems to select the system that best suits their needs. We also discuss the types of errors that can occur during continuous recognition and how EDDs can compare systems in terms of these errors.

Joint work with Kristin Vadas, Xuehai Bian, Thad Starner, and Gregory D. Abowd.

39

Applying CTBNs on Host-level Network Anomaly DetectionJing Xu, University of California, Riverside

We present an unsupervised method for detecting anomalies (or attacks) in network traffic at the individual host level. We use continuous time Bayesian networks (CTBNs) to build a model of normal behavior. We can then use this model to flag connection patterns that differ from this norm.

Let M1 and M2 be two machines in the network, and C12 and C21 be two random variables indicating the existence of network connections from M1 to M2, and vice versa. C12 and C21

are both binary variables, which take value 1 if the corresponding connection is on, and take value 0 otherwise. The training trajectories are composed of starting and ending time of normal connections between M1 and M2. Similarly, the testing trajectories contain both normal and abnormal connections, and our task is to report those abnormal connections accurately.

During the training phase, we hypothesize that there exists some hidden variables, which influence the observed variables C12 and C21, as time evolves. We use CTBNs to model the hybrid network composed of hidden and observed variables, and learn both structures and parameters of the CTBN models from partially observed training trajectories, using Expectation Maximization.

For testing, we compute the KL-divergence between empirical distribution of real data and the learned CTBNs distribution. Difficulty arises in calculating the entropy of the empirical distribution, since a testing trajectory is only one sample from real distribution. We approximate it by splitting it into two components: the entropy of leaving (Hleave) and staying (Hstay) in some (combined) states. Hleave can be computed with the jump matrix S of the learned CTBN model; Hstay can be approximated by standard M-Spacing entropy estimators. We use sliding windows of fixed number of connections (or fixed length of time) over the entire testing trajectories, and record the KL-divergence between the connections covered by that window and the learned CTBNs model. The KL-divergence value for each individual testing connection is the weighted average values of all the windows that cover the connection. By setting some threshold on the KL-divergence values for each individual connection, we can then report normal and abnormal connections in the testing data.

This method does not require expensive labeling or prior exposure to the attack type. We do not use a feature-based approach, but rather model the timing of connections explicitly with CTBNs. Our system monitors only the connection timings, and yet provides competitive detection rates, as measured on the DARPA intrusion detection dataset.

Joint work with Christian R. Shelton.

Instance-level constrained clustering and its ApplicationsHui Yang, Carnegie Mellon University

Instance-level constrained clustering is a semi-supervised process, which provides a flexible framework to incorporate constraints on document attributes, content structure, and self-defined preferences to guide through the clustering process. A key component of this semi-supervised clustering algorithm is the use of instance-level constraints that are based on document attributes and editing styles of near-duplicates. Three types of instance-level clustering constraints are used in our system: must-link (believed to be in the same class), cannot-link (believed to be in different classes) and family-link (possibly in the same class) constraints. Note that the instance-level constraints used in semi-

40

supervised clustering are not the same as labeled data used in classification or partial-labeled data in semi-supervised classification, they are pair-wise constraints, which are not sufficient to general class labels. Though they are not as strong conditions as class labels, they provide a valuable guidance for the conventional unsupervised clustering process. Moreover, instance-level pair-wise constraints are much easier to generate than class labels and the approach is still largely unsupervised. This work studies the impact of instance-level constraints on the clustering performance and as well as its applications.

Joint work with Jamie Callan.

Identifying protein functional linkages using high order logic relationshipsXin Zhang, Arizona State University

Inferring protein functions from the sequencing of multiple genomes is a challenge in bioinformatics. Fully sequenced genomes from various species provide large amount of information for the proteins encoded in each organism. The pattern describing the presence or absence of a protein in genomes can be determined by searching its homologs across the organisms. The phylogenetic profile of a protein is a string of length N consisting of 0s and 1s, which represent the presence or absence of the protein in the sequenced genome respectively. Proteins that function in the same pathway or structural complex are more likely to have similar profiles. Pellegrini et al showed that proteins with similar profiles strongly tend to be functionally linked. Hence, the function of uncharacterized protein can be predicted by the characterized proteins within the same cluster. Previous studies have been done to infer protein logic relationships using pairwise and triplet logic analysis on phylogenetic profiles. However, pairwise and triplet logic analysis may have limitation to recover complex network structures.

In our research, as an extension of predicting pairwise and triplet logic relationships, we propose a general method to identify high order logic relationships among proteins such that one protein can be predicted by three or more other proteins. Considering the properties of all possible logic functions, we define the notion of proper function, which can capture the precise representation of the logic relationship. We also present a formula to compute the number of proper functions given n (n ≥ 1) inputs (called predictors). With the increasing number of predictor proteins, the number of logic functions increases exponentially. Therefore it is not feasible to exhaustively search the logic functions that can best predict the proteins. Instead of considering all possible functions given the predictor proteins, we present a linear time algorithm for finding the optimal prediction function that minimizes the prediction error. The complexity of the algorithm is O (n), where n is the length of the profile. We apply our method with quartet analysis among proteins in a public dataset, the group E of phylogenetic profiles. By the general framework for learning high order logic relationships among proteins, we can infer the complex protein functional linkages that arise in cellular networks due to branching, parallel and alternate pathways. The statistical analysis of the results shows that all of the discovered relationships have p-values of less than 1E-04, and 74.2% of them have p-values of less than 6E-05. We also list the top 10 most frequently observed logic functions in protein quartet analysis. Those logic relationships reveal putative functional links among the proteins, which can be applied to predict uncharacterized proteins with the logic functions of the known ones. Those relationships can further aid in assigning biological function to uncharacterized proteins.

Joint work with Chitta Baral.

41

Machine Learning: Theory, Applications, Experiences

Documents

Transcript of Machine Learning: Theory, Applications, Experiences