Dynamic Bayesian Networks for Multimodal Interaction

download Dynamic Bayesian Networks for Multimodal Interaction

of 40

  • date post

  • Category


  • view

  • download


Embed Size (px)


Dynamic Bayesian Networks for Multimodal Interaction. Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu. Outline. Introduction: Multi-Modal and Multi-Person Bayesian Networks and the Junction Tree Algorithm - PowerPoint PPT Presentation

Transcript of Dynamic Bayesian Networks for Multimodal Interaction

  • Dynamic Bayesian Networks for Multimodal Interaction Tony JebaraMachine Learning LabColumbia Universityjoint work with A. Howard and N. Gu

  • OutlineIntroduction: Multi-Modal and Multi-PersonBayesian Networks and the Junction Tree AlgorithmMaximum Likelihood and Expectation MaximizationDynamic Bayesian Networks (HMMs, Kalman Filters)Hidden ARMA Models Maximum Conditional Likelihood and Conditional EMTwo-Person Visual Interaction (Gesture Games)Input-Output Hidden Markov ModelsAudio-Visual Interaction (Conversation)Intractable DBNs, Minimum Free Energy, Generalized EMDynamical System TreesMulti-Person Visual Interaction (Football Plays)Haptic-Visual Modeling (Surgical Drills)Ongoing Directions

  • IntroductionSimplest Dynamical Systems (single Markovian Process)Hidden Markov Model and Kalman FilterBut Multi-modal data (audio, video and haptics) have:Different time scale processesDifferent amplitude scale processesDifferent noise characteristics processesAlso, Multi-person data (multi-limb, two-person, group)Weakly coupledConditionally DependentDangerous to slam all time data into one single series:Find new ways to zipper multiple interacting processes

  • Bayesian NetworksAlso called Graphical ModelsMarry graph theory & statisticsDirected graph which efficiently encodes large p(x1,,xN) as product of conditionals of node given parentsAvoids storing huge hypercube over all variables x1,,xNHere, xi discrete (multinomial) or continuous (Gaussian)Split BNs over sets of hidden XH and observed XV variablesThree basic operations for BNs 1) Infer marginals/conditionals of hidden (JTA) 2) Compute likelihood of data (JTA) 3) Maximize likelihood the data (EM)

  • Bayes Nets to Junction Trees1) Bayes Net3) Triangulated4) Junction tree2) Moral GraphWorkhorse of BNs is Junction Tree Algorithm

  • Junction Tree AlgorithmIf agree:Else:Send messageFrom V to WSend messageFrom W to VThen, Cliques AgreeThe JTA sends messages from cliques through separators (these are just tables or potential functions)Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals).

  • Junction Tree AlgorithmOn trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute

    Ends with potentialsas marginalsor conditionalsof hidden variablesgiven datap(Xh1|Xv)p(Xh2|Xv)p(Xh1, Xh2|Xv)And likelihoodp(Xv)is potentialnormalizer

  • Maximum Likelihood with EMWe wish to maximize the likelihood over q for learning:

    EM instead iteratively maxes lower bound on log-likelihood:



  • Dynamic Bayes NetsHidden Markov ModelLinear Dynamical SystemState Transition Model:State Transition Model:Emission Model:Emission Model:Dynamic Bayesian Networks are BNs unrolled in timeSimples and most classical examples are:

  • Two-Person InteractionInteract with single user via p(y|x)Learn from two users to get p(y|x)Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y.

    One hidden Markov model for each userno coupling!

    One time series for both users too rigid!

  • DBN: Hidden ARMA ModelLearn to imitate behaviorby watching a teacherexhibit it.

    Eg. unsupervisedobservation of 2- agentinteraction

    Eg. Track lip motion

    Discover correlationsbetween past action &subsequent reaction

    Estimate p(Y | past X , past Y)


  • DBN: Hidden ARMA ModelFocus on predicting person Y from past of both X and YHave multiple linear models of the past to the futureUse a window for moving average (compressed with PCA)But, select among them using S (nonlinear)Here, we show only a 2nd order moving average to predict the next Y given past two Ys, past two Xs and current X and random choice of ARMA linear model

  • Hidden ARMA Features:Model skin color as mixture of RGB GaussiansTrack person as mixture of spatial Gaussians

    But, want to predict only Y from X Be discriminativeUse maximum conditional likelihood (CEM)

  • Conditional EMEM:divide &conquerCEM:discriminativedivide &conquerOnly need a conditional? Then maximize conditional likelihood

  • Conditional EMCEM vs. EM p(c|x,y)CEM accuracy = 100%EM accuracy = 51%EM p(y|x)CEM p(y|x)

  • Conditional EM for hidden ARMANearest Neighbor1.57% RMSConstant Velocity0.85% RMSHidden ARMA:0.64% RMS2 Users gesture to each other for a few minutesModel: Mix of 25 Gaussians, STM: T=120, Dims=22+15Estimate Prediction Discriminatively/Conditionallyp(future|past)

  • Hidden ARMA on GestureSCAREWAVECLAP

  • DBN: Input-Output HMM-Sony Picturebook Laptop-2 Cameras (7 Hz) (USB & Analog)-2 Microphones (USB & Analog)-100 Megs per hour (10$/Gig)Similarly, learn persons response audio video stimuli to predict Y (or agent A) from X (or world W)Wearable collects audio & video A,W

  • DBN: Input-Output HMMConsider simulating agent given worldHidden Markov model on its own is insufficient since it does not distinguish between the input rule the world has and the output we need to generateInstead, form input-output HMMOne IOHMM predicts agents audio using all 3 past channelsOne IOHMM predicts agents videoUse CEM to learn the IOHMM discriminatively

  • Input-Output HMM DataVideo-Histogram lighting correction-RGB Mixture of Gaussians to detect skin-Face: 2000 pixels at 7Hz (X,Y,Intensity)Audio-Hamming Window, FFT, Equalization-Spectrograms at 60Hz-200 bands (Amplitude, Frequency)Very noisy data set!

  • Video Representation- Principal Components Analysis - linear vectors in Euclidean space- Images, spectrograms, time series vectors.- Vectorization is bad, nonlinear

    - Images = collections of (X,Y,I) tuples pixels- Spectrograms = collections of (A,F) tuplestherefore...- Corresponded Principal Components Analysis

    M are soft permutationmatrices=X

  • Video Representation Original PCA CPCA2000 XYI Pixels: Compress to 20 dims

  • Input-Output HMMFor agent and world:1 Loudness scalar20 Spectro Coeffs20 Face CoeffsEstimate hiddentrellis frompartial data

  • Input-Output HMM with CEMConditionally model p(Agent Audio | World Audio , World Video)p(Agent Video | World Audio, World Video)Dont care how well we can model world audio and videoJust as long as we can map it to agent audio or agent videoAvoids temporal scale problems too (Video 5Hz, Audio 60 Hz)CEM: 60-state 82-Dim HMMDiagonal Gaussian Emissions90,000 Samples Train / 36,000 TestAudio IOHMM:

  • Input-Output HMM with CEMEM (red)CEM (blue)Audio 99.61 100.58Video-122.46-121.26Joint LikelihoodConditional LikelihoodSpectrograms from eigenspace

    KD-Tree on Video Coefficientsto closest image in training(point-cloud too confusing)TRAINING & TESTINGRESYNTHESIS

  • Input-Output HMM ResultsTestTrain

  • Intractable Dynamic Bayes NetsFactorial Hidden Markov ModelInteraction Through OutputInteraction Through Hidden StatesCoupled Hidden Markov Model

  • Intractable DBNs: Generalized EMAs before, we use bound on likelihood:

    But best q over hidden vars that minimizes KL intractable!Thus, restrict q to only explore factorized distributionsEM still converges underpartial E steps & partial M steps, q(z)q-L(q,q)l(q)q

  • Intractable DBNs Variational EMFactorial Hidden Markov ModelCoupled Hidden Markov ModelNow, the q distributions are limited to be chainsTractable as an iterative methodAlso known as variational EM structured mean-field

  • Dynamical System TreesHow to handle more people and a hieararchy of coupling?DSTs consider coupling university staff: students -> department -> school -> university

    Interaction Through Aggregated Community StateInternal nodes are states. Leaf nodes are emissions.Any subtree is also a DST. DST above unrolled over 2 time steps

  • Dynamical System TreesAlso apply generalization of EM and do variational structured mean field for q distribution.Becomes formulaic fo any DST topology!Code available at http://www.cs.columbia.edu/~jebara/dst

  • DSTs and Generalized EMStructured Mean Field:Use tractable distribution Q to approximate PIntroduce variational parameters Find Min KL(Q||P)Introduce v.p.Introduce v.p.Introduce v.p.InferenceInferenceInferenceInference

  • DSTs for American FootballInitial frame of a typical play Trajectories of players

  • DSTs for American Football~20 time series of two typesof plays (wham and digs)Likelihood ratio of models used as classiferDST1 puts all players into 1 game stateDST2 combines players into two teams and then into game

  • DSTs for Gene NetworksTime series of cell cycleHundreds of gene expression levels over timeUse given hierarchical clusteringDST with hierarchical clustering structure gives best test likelihood

  • Robotic Surgery, Haptics & VideoDavinci Laparoscopic RobotUsed in hundreds of hospitalsSurgeon works on consoleRobot mimics movement on (local) patientCaptures all actuator/robot data as 300Hz time seriesMulti-Channel Video of cameras inside patient

  • Robotic Surgery, Haptics & Video

  • Robotic Surgery, Haptics & VideoSuturingExpertNovice64 Dimensional Time Series @ 300 HzConsole and Actuator Parameters

  • Robotic Surgical Drills ResultsCompress Haptic & Video data with PCA to 60 dims.Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total).Preliminary results:

    Minefield Russian RouletteSuture

  • ConclusionDynamic Bayesian networks are natural upgrade to HMMs.Relevant for structured, multi-modal and multi