Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of...

25
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)

Transcript of Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of...

Dynamic Bayesian Networks for Meeting Structuring

Alfred Dielmann, Steve Renals

(University of Sheffield)

Introduction

Automatic analysis of meetings through “multimodal events” recognition

GOAL

events which involve one or more communicative modalities, and represent a single participant or a whole group behaviour

Using objective measures and statistical methods

Multimodal Recognition

Meeting Room

Audio Video ………

“Multimodal Events” Recognition

Feature Extraction

KnowledgeDatabase

………

Signal Pre-processing

Models

Specialised Recognition Systems (Speech,Video,Gestures)

Information Retrieval

Group Actions1. The machine observes group behaviours through

objective measures (“external observer”)

2. Results of this analysis are “structured” into a sequence of symbols (“coding system”)

– Exhaustive (covering the entire meeting duration)– Mutually exclusive (non overlapping symbols)We used the coding system adopted by the “IDIAP

framework”, composed by 5 “meeting actions”:• Monologue / Dialogue / Note taking / Presentation /

Presentation at the whiteboard

derived from different comunicative modalities

Corpus• 60 meetings (30x2 set) collected in the “IDIAP Smart Meeting

Room”:– 30 meetings are used for the training– 23 meetings are used for the testing– 7 meetings will be used for the results validation

• 4 participants per meeting• 5 hours of multi-channel Audio-Visual recordings:

– 3 fixed cameras– 4 lapel microphones + 8 element circular microphones array

• Meeting agendas are generated “a priori” and strictly followed, in order to have an average of 5 “meeting actions” for each meeting

• Available for public distributionhttp://mmm.idiap.ch/

Features (1)

Mic. Array

Lapel Mic.

SpeakerTurns

Beam-forming

Rate Of Speech …..

Pitch baseline Energy

Prosody and Acoustic

Only features derived from audio are currently used...

Dimensionreduction

Features (2)Speaker Turns

L1 L2 L3 L4

t-3

t-2

t-1

t

0.1 0.4 0.6 0.3

0.3 0.5 0.5 0.3

0.2 0.4 0.7 0.2

0.2 0.3 0.7 0.1

i

j

k

Location based “Speech activities”(SRP-PHAT beamforming)Kindly provided by IDIAP

Speaker Turns Features

Li(t)*Lj(t-1)*Lk(t-2)

Features (3)

RMS Energy

Pitch

Rate Of Speech

Mask F

eatures using “Speech activity”Mic. Array Beam-forming

Pitchextractor

Filters (*)

MRATE

(*) Histogram, median and interpolating filter

Lapel Mic.

Gestures and Actions …

Features (4)

VideoOther blob positions …

Participants Motion featuresImage Processing

Audio. ASR Transcripts

We’d like to integrate other features…..

… Everything that could be automaticallyextracted from a recorded meeting …

Other …

Given a set of examples, EM learning algorithms(ie: Baum-Welch) could be used to train CPTs

Dynamic Bayesian Networks (1)Bayesian Networks are a convenient graphical way to describe

statistical (in)dependencies among random variables

C S

L

A F Direct Acyclic Graph

Conditional Probability Tables

Given a set of known evidence nodes, theprobability of other nodes can be computed

through inference O

Dynamic Bayesian Networks (2)DBN are an extension of BNs with random variables that evolves in time:• Instancing a static BN for each temporal slice t• Explicating temporal dependences between variables

C S

L

O

C S

L

O

C S

L

O

……..

t=0 t=+1 t=T

Dynamic Bayesian Networks (3)

Hidden Markov Models, Kalman Filter Models and other state-space models are just a special case of DBNs :

Q0

Y0

Qt

Yt

Qt+1

Yt+1

…. ….

t=0 t t+1

Representation of an HMMas an instance of a DBN

Dynamic Bayesian Networks (4)Representing HMMs in terms of DBNs makes easy

to create variations on the basic theme ….

Q0

Y0

Qt

Yt

Qt+1

Yt+1

….

Z0 Zt Zt+1….

X0 Xt Xt+1….

Factorial HMMs Coupled HMMs

Q0

Y0

Qt

Yt

…. Qt

Yt

Z0

V0

Zt

Vt

…. Zt

Vt

Dynamic Bayesian Networks (5)Use of DBN and BN present some advantages:

• Intuitive way to represent models graphically, with a standard notation

• Unified theory for a huge number of models• Connecting different models in a structured view

• Making easier to study new models

• Unified set of instruments (ie: GMTK) to work with them (training, inference, decoding)

• Maximizes resources reuse

• Minimizes “setup” time

First Model (1)“Early integration” of features and modelling through a

2-level Hidden Markov Model

S0

Y0

St

Yt

St+1

Yt+1

…. ….

A0 At At+1…. ….

ST

YT

AT

HiddenMeeting Actions

ObservableFeaturesVector

HiddenSub-states

First Model (2)The main idea behind this model is to decompose each

“meeting action” in a sequence of “sub actions” or substates(Note that different actions are free to share the same sub-state)

S0

Y0

St

Yt

….

A0 At….The structure is composed by two Ergodic HMM chains:•The top chain links sub-states {St} with “actions” {At}

•The lower one maps directly the feature vectors {Yt} into a sub-state {St}

First Model (3)• The sequence of actions {At} is known a priori

• The sequence {St} is determined during the training process,and the meaning of each substate is unknown

S0

Y0

St

Yt

….

A0 At….•The cardinality of {St} is one of the model’s parameters

•The mapping of observable features {Yt} into hidden sub-states {St} is obtained through Gaussian Mixture Models

Second Model (1)Multistream processing of features through two parallel and

independent Hidden Markov Models

S01

Y01

St1

Yt1

St+11

Yt+11

…. ….

A0 At At+1…. ….

ST1

YT1

AT Meeting Actions

ProsodicFeatures

HiddenSub-states

S02

Y02

St2

Yt2

St+12

Yt+12

…. …. ST2

YT2

Speaker Turns Features

C0

E0

C0

E0

C0

E0

C0….….

Enable Transitions

ActionCounter

Second Model (2)Each features-group (or modality) Ym, is mapped into an

independent HMM chain, therefore every group is evaluated independently and mapped into an hidden sub-state {St

n}

S01

Y01

St1

Yt1

….

A0 At….

S02

Y02

St2

Yt2

….

As in the previous model, there is another HMM layer (A), witch represents “meeting actions”

The whole sub-state {St

1 x St2 x … St

n} is

mapped into an action {At}

Second Model (3)It is a variable-duration HMM with explicit enable node:

• At represents “meeting actions” as usual• Ct counts “meeting actions”• Et is a binary indicator variable that enables states changes

inside the node At

A0 At At+1…. ….

C0

E0

C0

E0

C0

E0

….…. Ct

1

1

2

2

2

Et

0

1

0

0

0

At

8

8

5

5

5

Second Model (4)• Training: when {At} changes {Ct} is

incremented and is set on for a single frame {Et} (At ,Et and Ct are part of the training dataset)

Ct

1

1

2

2

2

Et

0

1

0

0

0

At

8

8

5

5

5

• Decoding: {At} is free to change only if {Et} is high, andthen according to {Ct} state

Behaviours of {Et} and {Ct} learnedduring the training phase are then

exploited during the decoding

ResultsUsing the two models previously described, results

obtained using only audio derived features:

Corr. Sub. Del. Ins. AER

First Model 93.2 2.3 4.5 4.5 11.4

Second Model 94.7 1.5 3.8 0.8 6.1

Equivalent to the Word Error Ratemeasure, used to evaluate speechrecogniser performances

100Sub Ins Del

AERTotalActionsNumber

The second model reduces effectively both the number of Substitutions and the number of Insertions

Conclusions

• A new approach has been proposed

• Achieved results seem to be promising, and in the future we’d like to:– Validate them with the remaining part of the test-set

(or eventually an independent test-set)– Integrate other features:

• video, ASR transcripts, Xtalk, ….

– Try new experiments with existing models– Develop new DBNs based models

Multimodal Recognition (2)

• Raw Audio• Raw Video• Acoustic Features• Visual Features• Automatic Speech

Recognition• Video Understanding• Gesture Recognition• Eye Gaze Tracking• Emotion Detection• ….

Fusion of different recognisersat an early stage, generatinghybrid recognisers (like AVSR)

Knowledge sources: Approaches:

Integration of recognisers outputs through an “high level”recogniser

A standalone hi-level recogniseroperating on low level raw data