# Dynamic Bayesian Networks for Multimodal Interaction

date post

19-Jan-2016Category

## Documents

view

36download

0

Embed Size (px)

description

### Transcript of Dynamic Bayesian Networks for Multimodal Interaction

Dynamic Bayesian Networks for Multimodal Interaction Tony JebaraMachine Learning LabColumbia Universityjoint work with A. Howard and N. Gu

OutlineIntroduction: Multi-Modal and Multi-PersonBayesian Networks and the Junction Tree AlgorithmMaximum Likelihood and Expectation MaximizationDynamic Bayesian Networks (HMMs, Kalman Filters)Hidden ARMA Models Maximum Conditional Likelihood and Conditional EMTwo-Person Visual Interaction (Gesture Games)Input-Output Hidden Markov ModelsAudio-Visual Interaction (Conversation)Intractable DBNs, Minimum Free Energy, Generalized EMDynamical System TreesMulti-Person Visual Interaction (Football Plays)Haptic-Visual Modeling (Surgical Drills)Ongoing Directions

IntroductionSimplest Dynamical Systems (single Markovian Process)Hidden Markov Model and Kalman FilterBut Multi-modal data (audio, video and haptics) have:Different time scale processesDifferent amplitude scale processesDifferent noise characteristics processesAlso, Multi-person data (multi-limb, two-person, group)Weakly coupledConditionally DependentDangerous to slam all time data into one single series:Find new ways to zipper multiple interacting processes

Bayesian NetworksAlso called Graphical ModelsMarry graph theory & statisticsDirected graph which efficiently encodes large p(x1,,xN) as product of conditionals of node given parentsAvoids storing huge hypercube over all variables x1,,xNHere, xi discrete (multinomial) or continuous (Gaussian)Split BNs over sets of hidden XH and observed XV variablesThree basic operations for BNs 1) Infer marginals/conditionals of hidden (JTA) 2) Compute likelihood of data (JTA) 3) Maximize likelihood the data (EM)

Bayes Nets to Junction Trees1) Bayes Net3) Triangulated4) Junction tree2) Moral GraphWorkhorse of BNs is Junction Tree Algorithm

Junction Tree AlgorithmIf agree:Else:Send messageFrom V to WSend messageFrom W to VThen, Cliques AgreeThe JTA sends messages from cliques through separators (these are just tables or potential functions)Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals).

Junction Tree AlgorithmOn trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute

Ends with potentialsas marginalsor conditionalsof hidden variablesgiven datap(Xh1|Xv)p(Xh2|Xv)p(Xh1, Xh2|Xv)And likelihoodp(Xv)is potentialnormalizer

Maximum Likelihood with EMWe wish to maximize the likelihood over q for learning:

EM instead iteratively maxes lower bound on log-likelihood:

E-step:

M-step:q(z)qL(q,q)

Dynamic Bayes NetsHidden Markov ModelLinear Dynamical SystemState Transition Model:State Transition Model:Emission Model:Emission Model:Dynamic Bayesian Networks are BNs unrolled in timeSimples and most classical examples are:

Two-Person InteractionInteract with single user via p(y|x)Learn from two users to get p(y|x)Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y.

One hidden Markov model for each userno coupling!

One time series for both users too rigid!

DBN: Hidden ARMA ModelLearn to imitate behaviorby watching a teacherexhibit it.

Eg. unsupervisedobservation of 2- agentinteraction

Eg. Track lip motion

Discover correlationsbetween past action &subsequent reaction

Estimate p(Y | past X , past Y)

XY

DBN: Hidden ARMA ModelFocus on predicting person Y from past of both X and YHave multiple linear models of the past to the futureUse a window for moving average (compressed with PCA)But, select among them using S (nonlinear)Here, we show only a 2nd order moving average to predict the next Y given past two Ys, past two Xs and current X and random choice of ARMA linear model

Hidden ARMA Features:Model skin color as mixture of RGB GaussiansTrack person as mixture of spatial Gaussians

But, want to predict only Y from X Be discriminativeUse maximum conditional likelihood (CEM)

Conditional EMEM:divide &conquerCEM:discriminativedivide &conquerOnly need a conditional? Then maximize conditional likelihood

Conditional EMCEM vs. EM p(c|x,y)CEM accuracy = 100%EM accuracy = 51%EM p(y|x)CEM p(y|x)

Conditional EM for hidden ARMANearest Neighbor1.57% RMSConstant Velocity0.85% RMSHidden ARMA:0.64% RMS2 Users gesture to each other for a few minutesModel: Mix of 25 Gaussians, STM: T=120, Dims=22+15Estimate Prediction Discriminatively/Conditionallyp(future|past)

Hidden ARMA on GestureSCAREWAVECLAP

DBN: Input-Output HMM-Sony Picturebook Laptop-2 Cameras (7 Hz) (USB & Analog)-2 Microphones (USB & Analog)-100 Megs per hour (10$/Gig)Similarly, learn persons response audio video stimuli to predict Y (or agent A) from X (or world W)Wearable collects audio & video A,W

DBN: Input-Output HMMConsider simulating agent given worldHidden Markov model on its own is insufficient since it does not distinguish between the input rule the world has and the output we need to generateInstead, form input-output HMMOne IOHMM predicts agents audio using all 3 past channelsOne IOHMM predicts agents videoUse CEM to learn the IOHMM discriminatively

Input-Output HMM DataVideo-Histogram lighting correction-RGB Mixture of Gaussians to detect skin-Face: 2000 pixels at 7Hz (X,Y,Intensity)Audio-Hamming Window, FFT, Equalization-Spectrograms at 60Hz-200 bands (Amplitude, Frequency)Very noisy data set!

Video Representation- Principal Components Analysis - linear vectors in Euclidean space- Images, spectrograms, time series vectors.- Vectorization is bad, nonlinear

- Images = collections of (X,Y,I) tuples pixels- Spectrograms = collections of (A,F) tuplestherefore...- Corresponded Principal Components Analysis

M are soft permutationmatrices=X

Video Representation Original PCA CPCA2000 XYI Pixels: Compress to 20 dims

Input-Output HMMFor agent and world:1 Loudness scalar20 Spectro Coeffs20 Face CoeffsEstimate hiddentrellis frompartial data

Input-Output HMM with CEMConditionally model p(Agent Audio | World Audio , World Video)p(Agent Video | World Audio, World Video)Dont care how well we can model world audio and videoJust as long as we can map it to agent audio or agent videoAvoids temporal scale problems too (Video 5Hz, Audio 60 Hz)CEM: 60-state 82-Dim HMMDiagonal Gaussian Emissions90,000 Samples Train / 36,000 TestAudio IOHMM:

Input-Output HMM with CEMEM (red)CEM (blue)Audio 99.61 100.58Video-122.46-121.26Joint LikelihoodConditional LikelihoodSpectrograms from eigenspace

KD-Tree on Video Coefficientsto closest image in training(point-cloud too confusing)TRAINING & TESTINGRESYNTHESIS

Input-Output HMM ResultsTestTrain

Intractable Dynamic Bayes NetsFactorial Hidden Markov ModelInteraction Through OutputInteraction Through Hidden StatesCoupled Hidden Markov Model

Intractable DBNs: Generalized EMAs before, we use bound on likelihood:

But best q over hidden vars that minimizes KL intractable!Thus, restrict q to only explore factorized distributionsEM still converges underpartial E steps & partial M steps, q(z)q-L(q,q)l(q)q

Intractable DBNs Variational EMFactorial Hidden Markov ModelCoupled Hidden Markov ModelNow, the q distributions are limited to be chainsTractable as an iterative methodAlso known as variational EM structured mean-field

Dynamical System TreesHow to handle more people and a hieararchy of coupling?DSTs consider coupling university staff: students -> department -> school -> university

Interaction Through Aggregated Community StateInternal nodes are states. Leaf nodes are emissions.Any subtree is also a DST. DST above unrolled over 2 time steps

Dynamical System TreesAlso apply generalization of EM and do variational structured mean field for q distribution.Becomes formulaic fo any DST topology!Code available at http://www.cs.columbia.edu/~jebara/dst

DSTs and Generalized EMStructured Mean Field:Use tractable distribution Q to approximate PIntroduce variational parameters Find Min KL(Q||P)Introduce v.p.Introduce v.p.Introduce v.p.InferenceInferenceInferenceInference

DSTs for American FootballInitial frame of a typical play Trajectories of players

DSTs for American Football~20 time series of two typesof plays (wham and digs)Likelihood ratio of models used as classiferDST1 puts all players into 1 game stateDST2 combines players into two teams and then into game

DSTs for Gene NetworksTime series of cell cycleHundreds of gene expression levels over timeUse given hierarchical clusteringDST with hierarchical clustering structure gives best test likelihood

Robotic Surgery, Haptics & VideoDavinci Laparoscopic RobotUsed in hundreds of hospitalsSurgeon works on consoleRobot mimics movement on (local) patientCaptures all actuator/robot data as 300Hz time seriesMulti-Channel Video of cameras inside patient

Robotic Surgery, Haptics & Video

Robotic Surgery, Haptics & VideoSuturingExpertNovice64 Dimensional Time Series @ 300 HzConsole and Actuator Parameters

Robotic Surgical Drills ResultsCompress Haptic & Video data with PCA to 60 dims.Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total).Preliminary results:

Minefield Russian RouletteSuture

ConclusionDynamic Bayesian networks are natural upgrade to HMMs.Relevant for structured, multi-modal and multi

Recommended

*View more*