Speaker Naming in MoviesMahmoud Azab, Mingzhe Wang, Max Smith, Noriyuki Kojima, Jia Deng, Rada...

Proceedings of NAACL-HLT 2018, pages 2206–2216New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

Speaker Naming in Movies

Mahmoud Azab, Mingzhe Wang, Max Smith, Noriyuki Kojima, Jia Deng, Rada MihalceaComputer Science and Engineering, University of Michigan

{mazab,mzwang,mxsmith,kojimano,jiadeng,mihalcea}@umich.edu

Abstract

We propose a new model for speaker namingin movies that leverages visual, textual, andacoustic modalities in an unified optimizationframework. To evaluate the performance ofour model, we introduce a new dataset con-sisting of six episodes of the Big Bang The-ory TV show and eighteen full movies cover-ing different genres. Our experiments showthat our multimodal model significantly out-performs several competitive baselines on theaverage weighted F-score metric. To demon-strate the effectiveness of our framework, wedesign an end-to-end memory network modelthat leverages our speaker naming model andachieves state-of-the-art results on the subtitlestask of the MovieQA 2017 Challenge.

1 Introduction

Identifying speakers and their names in movies,and videos in general, is a primary task for manyvideo analysis problems, including automatic sub-title labeling (Hu et al., 2015), content-based videoindexing and retrieval (Zhang et al., 2009), videosummarization (Tapaswi et al., 2014), and videostoryline understanding (Tapaswi et al., 2014). Itis a very challenging task, as the visual appearanceof the characters changes over the course of themovie due to several factors such as scale, cloth-ing, illumination, and so forth (Arandjelovic andZisserman, 2005; Everingham et al., 2006). Theannotation of movie data with speakers’ namescan be helpful in a number of applications, such asmovie question answering (Tapaswi et al., 2016),automatic identification of character relationships(Zhang et al., 2009), or automatic movie caption-ing (Hu et al., 2015).

Most previous studies relied primarily on vi-sual information (Arandjelovic and Zisserman,2005; Everingham et al., 2006), and aimed forthe slightly different task of face track labeling;

01:02:00-->01:02:01Jack,must you go?

01:02:01-->01:02:04Timeformetogorowwith other slaves.

01:02:07-->01:02:08Goodnight,Rose.

Names:JackRose……

Jack,mustyougo?……

Goodnight

Jack:HoldOn! Rose:Itrustyou.

Jack:Allright!Openyoureyes.

Inpu

tInference

Rose:I'm flying.Jack.

Outpu

t

JointOptimization

Figure 1: Overview of our approach for speakernaming.

speakers who did not appear in the video framewere not assigned any names, which is common inmovies and TV shows. Other available sources ofinformation such as scripts were only used to ex-tract cues about the speakers’ names to associatethe faces in the videos with their correspondingcharacter name (Everingham et al., 2006; Tapaswiet al., 2015; Bauml et al., 2013; Sivic et al., 2009);however since scripts are not always available, theapplicability of these methods is somehow limited.

Other studies focused on the problem of speakerrecognition without naming, using the speechmodality as a single source of information. Whilesome of these studies attempted to incorporatethe visual modality, their goal was to cluster thespeech segments rather than name the speakers

2206

(Erzin et al., 2005; Bost and Linares, 2014; Kap-souras et al., 2015; Bredin and Gelly, 2016; Huet al., 2015; Ren et al., 2016). None of thesestudies used textual information (e.g., dialogue),which prevented them from identifying speakernames.

In our work, we address the task of speakernaming, and propose a new multimodal model thatleverages in an unified framework of the visual,speech, and textual modalities that are naturallyavailable while watching a movie. We do not as-sume the availability of a movie script or a castlist, which makes our model fully unsupervisedand easily applicable to unseen movies.

The paper makes two main contributions. First,we introduce a new unsupervised system forspeaker naming for movies and TV shows thatexclusively depends on videos and subtitles, andrelies on a novel unified optimization frameworkthat fuses visual, textual, and acoustic modalitiesfor speaker naming. Second, we construct andmake available a dataset consisting of 24 movieswith 31,019 turns manually annotated with charac-ter names. Additionally, we also evaluate the roleof speaker naming when embedded in an end-to-end memory network model, achieving state-of-the-art performance results on the subtitles task ofthe MovieQA 2017 Challenge.

2 Related Work

The problem of speaker naming in movies hasbeen explored by the computer vision and thespeech communities. In the computer vision com-munity, the speaker naming problem is usuallyconsidered as a face/person naming problem, inwhich names are assigned to their correspondingfaces on the screen (Everingham et al., 2006; Couret al., 2010; Bauml et al., 2013; Haurilet et al.,2016; Tapaswi et al., 2015). On the other hand,the speech community considered the problem asa speaker identification problem, which focuseson recognizing and clustering speakers rather thannaming them (Reynolds, 2002; Campbell, 1997).In this work, we aim to solve the problem ofspeaker naming in movies, in which we label eachsegment of the subtitles with its correspondingspeaker name whether the speaker’s face appearedon in the video or not.

Previous work can be furthered categorized ac-cording to the type of supervision used to buildthe character recognition and speaker recognitionmodels: supervised vs. weakly supervised mod-

els. In the movie and television domains, utiliz-ing scripts in addition to subtitles to obtain times-tamped speaker information was also studied in(Everingham et al., 2006; Tapaswi et al., 2015;Bauml et al., 2013; Sivic et al., 2009). Moreover,they utilized this information to resolve the ambi-guity introduced by co-occurring faces in the sameframe. Features were extracted through the pe-riod of speaking (detected via lip motion on eachface). Then they assigned the face based on can-didate names from the time-stamped script. Thus,these studies used speaker recognition as an essen-tial step to construct cast-specific face classifiers.(Tapaswi et al., 2012) extended the face identifi-cation problem to include person tracking. Theyutilized available face recognition results to learnclothing models for characters to identify persontracks without faces.

In (Cour et al., 2010; Haurilet et al., 2016), theauthors proposed a weakly supervised model de-pending on subtitles and a character list. Theyextracted textual cues from the dialog: first, sec-ond, and third person references, such as “I’mJack”, “Hey, Jack!”, and “Jack left”. Using a char-acter list from IMDB, they mapped these refer-ences onto true names using minimum edit dis-tance, and then they ascribed the references toface tracks. Other work removed the dependencyon a true character list by determining all namesthrough coreference resolution. However, thiswork also depended on the availability of scripts(Ramanathan et al., 2014). In our model, we re-moved the dependency on both the true cast listand the script, which makes it easier to apply ourmodel to other movies and TV shows.

Recent work proposed a convolutional neuralnetwork (CNN) and Long Short-Term Memory(LSTM) based learning framework to automati-cally learn a function that combines both facialand acoustic features (Hu et al., 2015; Ren et al.,2016). Using these cues, they tried to learn match-ing face-audio pairs and non-matching face-audiopairs. They then trained a SVM classifier onthe audio-video pairings to discriminate betweenthe non-overlapping speakers. In order to traintheir models, they manually identified the lead-ing characters in two TV shows, Friends and TheBig Bang Theory (BBT), and collected their facetracks and corresponding audio segments usingpre-annotated subtitles. Despite the very high per-formance reported in these studies, it is very hardto generalize their approach since it requires a lot

2207

of training data.On the other hand, talking faces have been used

to improve speaker recognition and diarizationin TV shows (Bredin and Gelly, 2016; Bost andLinares, 2014; Li et al., 2004). In the case of (Liuet al., 2008), they modeled the problem of speakernaming as facial recognition to identify speakersin news broadcasts. This work leveraged opti-cal character recognition to read the broadcasters’names that were displayed on screen, requiring thefaces to already be annotated.

3 Datasets

Our dataset consists of a mix of TV show episodesand full movies. For the TV show, we use six fullepisodes of season one of the BBT. The number ofnamed characters in the BBT episodes varies be-tween 5 to 8 characters per episode, and the back-ground noise level is low. Additionally, we alsoacquired a set of eighteen full movies from differ-ent genres, to evaluate how our model works underdifferent conditions. In this latter dataset, the num-ber of named characters ranges between 6 and 37,and it has varied levels of background noise.

We manually annotated this dataset with thecharacter name of each subtitle segment. To fa-cilitate the annotation process, we built an inter-face that parses the movies subtitles files, collectsthe cast list from IMDB for each movie, and thenshows one subtitle segment at a time along withthe cast list so that the annotator can choose thecorrect character. Using this tool, human anno-tators watched the movies and assigned a speakername to each subtitle segment. If a character namewas not mentioned in the dialogue, the annotatorslabeled it as “unknown.” To evaluate the qual-ity of the annotations, five movies in our datasetwere double annotated. The Cohen’s Kappa inter-annotator agreement score for these five movies is0.91, which shows a strong level of agreement.

To clean the data, we removed empty segments,as well as subtitle description parts written be-tween brackets such as “[groaning]” and “[sniff-ing]”. We also removed segments with two speak-ers at the same time. We intentionally avoided us-ing any automatic means to split these segments,to preserve the high-quality of our gold standard.

Table 1 shows the statistics of the collected data.Overall, the dataset consists of 24 videos with a to-tal duration of 40.28 hours, a net dialogue durationof 21.99 hours, and a total of 31,019 turns spo-ken by 463 different speakers. Four of the movies

in this dataset are used as a development set todevelop supplementary systems and to fine tuneour model’s parameters; the remaining movies areused for evaluation.

Min Max Mean σ

# characters/video 5 37 17.8 9.55# Subtitle turns/video 488 2212 1302.4 563.06# words/turn 1 28 8.02 4.157subtitles duration (sec) 0.342 9.59 2.54 1.02

Table 1: Statistics on the annotated movie dataset.

4 Data Processing and Representations

We process the movies by extracting several tex-tual, acoustic, and visual features.

4.1 Textual FeaturesWe use the following representations for the tex-tual content of the subtitles:SkipThoughts uses a Recurrent Neural Networkto capture the underlying semantic and syntacticproperties, and map them to a vector representa-tion (Kiros et al., 2015). We use their pretrainedmodel to compute a 4,800 dimensional sentencerepresentation for each line in the subtitles.1

TF-IDF is a traditional weighting scheme in in-formation retrieval. We represent each subtitle asa vector of tf-idf weights, where the length of thevector (i.e., vocabulary size) and the idf scores areobtained from the movie including the subtitle.

4.2 Acoustic FeaturesFor each movie in the dataset, we extract the au-dio from the center channel. The center chan-nel is usually dedicated to the dialogue in movies,while the other audio channels carry the surround-ing sounds from the environment and the musi-cal background. Although doing this does notfully eliminate the noise in the audio signal, itstill improves the speech-to-noise ratio of the sig-nal. When a movie has stereo sound (left and rightchannels only), we down-mix both channels of thestereo stream into a mono channel.

In this work, we use the subtitles timestamps asan estimate of the boundaries that correspond tothe uttered speech segments. Usually, each subti-tle corresponds to a segment being said by a singlespeaker. We use the subtitle timestamps for seg-mentation so that we can avoid automatic speakerdiarization errors and focus on the speaker namingproblem.

1https://github.com/ryankiros/skip-thoughts

2208

To represent the relevant acoustic informationfrom each spoken segment, we use iVectors, whichis the state-of-the-art unsupervised approach inspeaker verification (Dehak et al., 2011). Whileother deep learning-based speaker embeddingsmodels also exist, we do not have access to enoughsupervised data to build such models. We train un-supervised iVectors for each movie in the dataset,using the iVector extractor used in (Khorram et al.,2016). We extract iVectors of size 40 using aGaussian Mixture Model-Universal BackgroundModel (GMM-UBM) with 512 components. EachiVector corresponds to a speech segment utteredby a single speaker. We fine tune the size of theiVectors and the number of GMM-UBM compo-nents using the development dataset.

4.3 Visual FeaturesWe detect faces in the movies every five frames us-ing the recently proposed MTCNN (Zhang et al.,2016) model, which is pretrained for face detec-tion and facial landmark alignment. Based on theresults of face detection, we apply the forward andbackward tracker with an implementation of theDlib library (King, 2009; Danelljan et al., 2014) toextract face tracks from each video clip. We rep-resent a face track using its best face in terms ofdetection score, and use the activations of the fc7layer of pretrained VGG-Face (Parkhi et al., 2015)network as visual features.

We calculate the distance between the upper lipcenter and the lower lip center based on the 68-point facial landmark detection implemented inthe Dlib library (King, 2009; Kazemi and Sullivan,2014). This distance is normalized by the heightof face bounding boxes and concatenated acrossframes to represent the amount of mouth opening.A human usually speaks with lips moving with acertain frequency (3.75 Hz to 7.5 Hz used in thiswork) (Tapaswi et al., 2015). We apply a band-pass filter to amplify the signal of true lip motionin these segments. The overall sum of lip motionis used as the score for the talking face.

5 Unified Optimization Framework

We tackle the problem of speaker naming as atransductive learning problem with constraints. Inthis approach, we want to use the sparse positivelabels extracted from the dialogue and the under-lying topological structure of the rest of the un-labeled data. We also incorporate multiple cuesextracted from both textual and multimedia infor-

mation. A unified learning framework is proposedto enable the joint optimization over the automat-ically labeled and unlabeled data, along with mul-tiple semantic cues.

5.1 Character Identification and Extraction

In this work, we do not consider the set of char-acter names as given because we want to build amodel that can be generalized to unseen movies.This strict setting adds to the problem’s complex-ity. To extract the list of characters from the subti-tles, we use the Named Entity Recognizer (NER)in the Stanford CoreNLP toolkit (Manning et al.,2014). The output is a long list of person namesthat are mentioned in the dialogue. This list isprone to errors including, but not limited to, nounsthat are misclassified by the NER as person’s namesuch as “Dad” and “Aye”, names that are irrele-vant to the movie such as “Superman” or namedanimals, or uncaptured character names.

To clean the extracted names list of each movie,we cluster these names based on string minimumedit distance and their gender. From each cluster,we then pick a name to represent it based on itsfrequency in the dialogue. The result of this stepconsists of name clusters along with their distri-bution in the dialogue. The distribution of eachcluster is the sum of all the counts of its mem-bers. To filter out irrelevant characters, we runa name reference classifier, which classifies eachname into first, second or third person references.If a name was only mentioned as a third personthroughout the whole movie, we discard it fromthe list of characters. We remove any name clusterthat has a total count less than three, which takescare of the misclassified names’ reference types.

5.2 Grammatical Cues

We use the subtitles to extract the name mentionsin the dialogue. These mentions allow us to ob-tain cues about the speaker name and the absenceor the presence of the mentioned character in thesurrounding subtitles. Thus, they affect the prob-ability that the mentioned character is the speakeror not. We follow the same name reference cat-egories used in (Cour et al., 2010; Haurilet et al.,2016). We classify a name mention into: first (e.g.,“I’m Sheldon”), second (e.g., “Oh, hi, Penny”) orthird person reference (e.g., “So how did it go withLeslie?”). The first person reference represents apositive constraint that allows us to label the cor-responding iVector of the speaker and his face if

2209

it exists during the segment duration. The secondperson reference represents a multi-instance con-straint that suggests that the mentioned name isone of the characters that are present in the scene,which increases the probability of this characterto be one of the speakers of the surrounding seg-ments. On the other hand, the third person ref-erence represents a negative constraint, as it sug-gests that the speaker does not exist in the scene,which lowers the character probability of the char-acter being one of the speakers of the next or theprevious subtitle segments.

To identify first, second and third person refer-ences, we train a linear support vector classifier.The first person, the second and third person clas-sifier’s training data are extracted and labeled fromour development dataset, and fine tuned using 10-fold cross-validation. Table 2 shows the results ofthe classifier on the test data. The average numberof first, second and third-person references in eachmovie are 14.63, 117.21, and 95.71, respectively.

Precision Recall F1-ScoreFirst Person 0.625 0.448 0.522Second Person 0.844 0.863 0.853Third Person 0.806 0.806 0.806Average / Total 0.819 0.822 0.820

Table 2: Performance metrics of the reference clas-sifier on the test data.

5.3 Unified Optimization Framework

Given a set of data points that consist of l la-beled2 and u unlabeled instances, we apply an op-timization framework to infer the best predictionof speaker names. Suppose we have l+u instancesX = {x1, x2, ..., xl, xl+1, ..., xl+u} and K pos-sible character names. We also get the dialogue-based positive labels yi for instances xi, where yiis a k-dimension one-hot vector and yji = 1 if xibelongs to the class j, for every 1 ≤ i ≤ l and1 ≤ j ≤ K. To name each instance xi, we wantto predict another one-hot vector of naming scoresf(xi) for each xi, such that argmaxjf

j(xi) = ziwhere zi is the ground truth number of class forinstance xi.

To combine the positive labels and unlabeleddata, we define the objective function for predic-

2Note that in our setup, all the labeled instances are ob-tained automatically, as described above.

tions f as follows:

Linitial(f) =1

l

l∑

i=1

||f(xi)− yi||2

+1

l + u

l+u∑

i=1

l+u∑

j=1

wij ||f(xi)− f(xj)||2

(1)Here wij is the similarity between xi and xj ,

which is calculated as the weighted sum of textual,acoustic and visual similarities. The inverse Eu-clidean distance is used as similarity function foreach modality. The weights for different modali-ties are selected as hyperparameters and tuned onthe development set. This objective leads to a con-vex loss function which is easier to optimize overfeasible predictions.

Besides the positive labels obtained from firstperson name references, we also introduce othersemantic constraints and cues to enhance thepower of our proposed approach. We implementthe following four types of constraints:

Multiple Instance Constraint. Although the sec-ond person references cannot directly provide pos-itive constraints, they imply that the mentionedcharacters have high probabilities to be in this con-versation. Following previous work (Cour et al.,2010), we incorporate the second person refer-ences as multiple instances constraints into our op-timization: if xi has a second person referencej, we encourage j to be assigned to its neigh-bors, i.e., its adjacent subtitles with similar times-tamps. For the implementation, we simply in-clude multiple instances constraints as a variant ofpositive labels with decreasing weights s, wheres = 1/(l − i) for each neighbor xl.

Negative Constraint. For the third person refer-ences, the mentioned characters may not occur inthe conversation and movies. So we treat them asnegative constraints, which means they imply thatthe mentioned characters should not be assignedto corresponding instances. This constraint is for-mulated as follows:

Lneg(f) =∑

(i,j)∈N[f j(xi)]

2 (2)

where N is the set of negative constraints xidoesn’t belong class j.

Gender Constraint. We train a voice-based gen-der classifier by using the subtitles segments fromthe four movies in our development dataset (5,543

2210

segments of subtitles). We use the segments inwhich we know the speaker’s name and manuallyobtain the ground truth gender label from IMDB.We extract the signal energy, 20 Mel-frequencycepstral coefficients (MFCCs) along with theirfirst and second derivatives, in addition to time-and frequency-based absolute fundamental fre-quency (f0) statistics as features to represent eachsegment in the subtitles. The f0 statistics has beenfound to improve the automatic gender detectionperformance for short speech segments (Levitanet al., 2016), which fits our case since the medianduration of the dialogue turns in our dataset is 2.6seconds.

The MFCC features are extracted using a stepsize of 16 msec over a 64 msec window using themethod from (Mathieu et al., 2010), while the f0statistics are extracted using a step size of 25 msecover a 50 msec window as the default configura-tion in (Eyben et al., 2013). We then use these fea-tures to train a logistic regression classifier usingthe Scikit-learn library (Pedregosa et al., 2011).The average accuracy of the gender classifier ona 10-fold cross-validation is 0.8867.

Given the results for the gender classificationof audio segments and character names, we definethe gender loss to penalize inconsistency betweenthe predicted gender and character names:

Lgender(f) =∑

(i,j)∈Q1

Pga(xi)(1− Pgn(j))fj(xi)

+∑

(i,j)∈Q2

(1− Pga(xi))Pgn(j)fj(xi)

(3)where Pga(xi) is the probability for instance xi tobe a male, and Pgn(j) is the probability for namej to be a male, and Q1 = {(i, j)|Pga(xi) <0.5, Pgn(j) > 0.5}, Q2 = {(i, j)|Pga(xi) >0.5, Pgn(j) < 0.5}.Distribution Constraint. We automatically ana-lyze the dialogue and extract the number of men-tions of each character in the subtitles using Stan-ford CoreNLP and string matching to capturenames that are missed by the named entity recog-nizer. We then filter the resulting counts by remov-ing third person mention references of each nameas we assume that this character does not appearin the surrounding frames. We use the results toestimate the distribution of the speaking charac-ters and their importance in the movies. The maingoal of this step is to construct a prior probabilitydistribution for the speakers in each movie.

To encourage our predictions to be consistentwith the dialogue-based priors, we penalize thesquare error between the distributions of predic-tions and name mentions priors in the followingequation:

Ldis(f) =

K∑

j=1

(∑

(f j(xi))− dj)2 (4)

where dj is the ratio of name j mentions in all sub-titles.

Final Framework. Combining the loss in Eqn. 1and multiple losses with different constraints, weobtain our unified optimization problem:

f∗ = argminfλ1Linitial(f) + λ2LMI(f)

+ λ3Lneg(f) + λ4Lgender(f) + λ5Ldis(f)(5)

All of the λs are hyper-parameters to be tunedon development set. We also include the constraintthat predictions for different character names mustsum to 1. We solve this constrained optimizationproblem with projected gradient descent (PGD).Our optimization problem in Eqn. 5 is guaranteedto be a convex optimization problem and thereforeprojected gradient descent is guaranteed to stopwith global optima. PGD usually converges after800 iterations.

6 Evaluation

We model our task as a classification problem, anduse the unified optimization framework describedearlier to assign a character name to each subtitle.

Since our dataset is highly unbalanced, with afew main characters usually dominating the entiredataset, we adopt the weighted F-score as our eval-uation metric, instead of using an accuracy met-ric or a micro-average F-score. This allows us totake into account that most of the characters haveonly a few spoken subtitle segments, while at thesame time placing emphasis on the main charac-ters. This leads sometimes to an average weightedF-score that is not between the average precisionand recall.

One aspect that is important to note is that char-acters are often referred to using different names.For example, in the movie “The Devil’s Advo-cate,” the character Kevin Lomax is also referredto as Kevin or Kev. In more complicated situ-ations, characters may even have multiple iden-tities, such as the character Saul Bloom in themovie “Ocean’s Eleven,” who pretends to be an-other character named Lyman Zerga. Since our

2211

Precision Recall F-scoreB1: MFMC 0.0910 0.2749 0.1351B2: DRA 0.2256 0.1819 0.1861B3: Gender-based DRA 0.2876 0.2349 0.2317Our Model (Skip-thoughts)* 0.3468 0.2869 0.2680Our Model (TF-IDF)* 0.3579 0.2933 0.2805Our Model (iVectors) 0.2151 0.2347 0.1786Our Model (Visual)* 0.3348 0.2659 0.2555Our Model (Visual+iVectors)* 0.3371 0.2720 0.2617Our Model (TF-IDF+iVectors)* 0.3549 0.2835 0.2643Our Model (TF-IDF+Visual)* 0.3385 0.2975 0.2821Our Model (all)* 0.3720 0.3108 0.2920

Table 3: Comparison between the average ofmacro-weighted average of precision, recall and f-score of the baselines and our model. * means sta-tistically significant (t-test p-value < 0.05) whencompared to baseline B3.

goal is to assign names to speakers, and not nec-essarily solve this coreference problem, we con-sider the assignment of the subtitle segments toany of the speaker’s aliases to be correct. Thus,during the evaluation, we map all the characters’aliases from our model’s output to the names inthe ground truth annotations. Our mapping doesnot include other referent nouns such as “Dad,”“Buddy,” etc.; if a segment gets assigned to anysuch terms, it is considered a misprediction.

We compare our model against three baselines:B1: Most-frequently mentioned character con-sists of selecting the most frequently mentionedcharacter in the dialogue as the speaker for all thesubtitles. Even though it is a simple baseline, itachieves an accuracy of 27.1%, since the leadingcharacters tend to speak the most in the movies.B2: Distribution-driven random assignmentconsists of randomly assigning character namesaccording to a distribution that reflects their frac-tion of mentions in all the subtitles.B3: Gender-based distribution-driven randomassignment consists of selecting the speakernames based on the voice-based gender detec-tion classifier. This baseline randomly selects thecharacter name that matches the speaker’s genderaccording to the distribution of mentions of thenames in the matching gender category.

The results obtained with our proposed unifiedoptimization framework and the three baselinesare shown in Table 3. We also report the perfor-mance of the optimization framework using dif-ferent combinations of the three modalities. Themodel that uses all three modalities achieves thebest results, and outperforms the strongest base-line (B3) by more than 6% absolute in average

(a) The Big Bang Theory

(b) Titanic

Figure 2: For each speech segment, we applied t-SNE (Van Der Maaten, 2014) on their correspond-ing iVectors. The points with the same color rep-resent instances with the same character name.

weighted F-score. It also significantly outper-forms the usage of the visual and acoustic fea-tures combined, which have been frequently usedtogether in previous work, suggesting the impor-tance of textual features in this setting.

The ineffectiveness of the iVectors might be aresult of the background noise and music, whichare difficult to remove from the speech signal. Fig-ure 2 shows the t-Distributed Stochastic Neigh-bor Embedding (t-SNE) (Van Der Maaten, 2014),which is a nonlinear dimensionality reductiontechnique that models points in such a way thatsimilar vectors are modeled by nearby points anddissimilar objects are modeled by distant points,visualization of the iVectors over the whole BBTshow and the movie “Titanic.” In the BBT thereis almost no musical background or backgroundnoise, while, Titanic has musical background inaddition to the background noise such as thescreams of the drowning people. From the graph,

2212

the difference between the quality of the iVectorsclusters on different noise-levels is clear.

Table 4 shows the effect of adding componentsof our loss function to the initial loss Linit func-tion. The performance of the model using onlyLinit without the other parts is very low due to thesparsity of first person references and errors thatthe person reference classifier introduces.

Precision Recall F-scoreLinitial 0.0631 0.1576 0.0775Linitial + Lgender 0.1160 0.1845 0.1210Linitial + Lnegative 0.0825 0.0746 0.0361Linitial + Ldistribution 0.1050 0.1570 0.0608Linitial + LMultipleInstance 0.3058 0.2941 0.2189

Table 4: Analysis of the effect of adding each com-ponent of the loss function to the initial loss.

In order to analyze the effect of the errors thatseveral of the modules (e.g., gender and name ref-erence classifiers) propagate into the system, wealso test our framework by replacing each one ofthe components with its ground truth information.As seen in Table 5, the results obtained in thissetting show significant improvement with the re-placement of each component in our framework,which suggests that additional work on these com-ponents will have positive implications on theoverall system.

Precision Recall F-scoreOur Model 0.3720 0.3108 0.2920Voice Gender (VG) 0.4218 0.3449 0.3259VG + Name Gender (NG) 0.4412 0.3790 0.3645VG + NG + Name Ref 0.4403 0.3938 0.3748

Table 5: Comparison between our model whilereplacing different components with their groundtruth information.

7 Speaker Naming for MovieUnderstanding

Identifying speakers is a critical task for under-standing the dialogue and storyline in movies.MovieQA is a challenging dataset for movie un-derstanding. The dataset consists of 14,944 mul-tiple choice questions about 408 movies. Eachquestion has five answers and only one of themis correct. The dataset is divided into three splits:train, validation, and test according to the movie ti-tles. Importantly, there are no overlapping moviesbetween the splits. Table 6 shows examples of thequestion and answers in the MovieQA dataset.

Subtitles+speakernames

A1:JonesA2:GordonA3:AlfredA4:FoxA5:Blake

Q:WhoseesBruce andSelina togetherinFlorence?"

Relatedcharacters

Softmax

SubtitlesEmbedding

Weights

FC

W2V

Conv

Speaker+MentionMask

W2V

WeightedSum

Output

FC

W2V

Softmax

Prediction

∑

⨷

⨷

Figure 3: The diagram describing our Speaker-based Convolutional Memory Network (SC-MemN2N) model.

The MovieQA 2017 Challenge3 consists of sixdifferent tasks according to the source of infor-mation used to answer the questions. Given thatfor many of the movies in the dataset the videosare not completely available, we develop our ini-tial system so that it only relies on the subtitles;we thus participate in the challenge subtitles task,which includes the dialogue (without the speakerinformation) as the only source of information toanswer questions.

To demonstrate the effectiveness of our speakernaming approach, we design a model basedon an end-to-end memory network (Sukhbaataret al., 2015), namely Speaker-based Convolu-tional Memory Network (SC-MemN2N), whichrelies on the MovieQA dataset, and integrates thespeaker naming approach as a component in thenetwork. Specifically, we use our speaker nam-ing framework to infer the name of the speakerfor each segment of the subtitles, and prepend thepredicted speaker name to each turn in the subti-tles.4 To represent the movie subtitles, we repre-sent each turn in the subtitles as the mean-poolingof a 300-dimension pretrained word2vec (Mikolovet al., 2013) representation of each word in thesentence. We similarly represent the input ques-tions and their corresponding answers. Given aquestion, we use the SC-MemN2N memory to findan answer. For questions asking about specificcharacters, we keep the memory slots that have thecharacters in question as speakers or mentioned in,and mask out the rest of the memory slots. Figure

3http://movieqa.cs.toronto.edu/workshops/iccv2017/4We strictly follow the challenge rules, and only use text

to infer the speaker names.

2213

Movie Question Answers

FargoWhat did Mike’s wife, as hesays, die from?

A1: She was killed A2: Breast cancerA3: Leukemia A4: Heart diseaseA5: Complications due to child birth

TitanicWhat does Rose ask Jack to doin her room?

A1: Sketch her in her best dress A2: Sketch her nudeA3: Take a picture of her nude A4: Paint her nudeA5: Take a picture of her in her best dress

Table 6: Example of questions and answers from the MQA benchmark. The answers in bold are thecorrect answers to their corresponding question.

3 shows the architecture of our model.Table 7 includes the results of our system on

the validation and test sets, along with the bestsystems introduced in previous work, showingthat our SC-MemN2N achieves the best perfor-mance. Furthermore, to measure the effectivenessof adding the speaker names and masking, we testour model after removing the names from the net-work (C-MemN2N). As seen from the results, thegain of SC-MemN2N is statistically significant5

compared to a version of the system that does notinclude the speaker names (C-MemN2N). Figure4 shows the performance of both C-MemN2N andSC-MemN2N models by question type. The re-sults suggest that our speaker naming helps themodel better distinguish between characters, andthat prepending the speaker names to the subti-tle segments improves the ability of the memorynetwork to correctly identify the supporting factsfrom the story that answers a given question.

MethodSubtitles

val testSSCB-W2V (Tapaswi et al., 2016) 24.8 23.7SSCB-TF-IDF (Tapaswi et al., 2016) 27.6 26.5SSCB Fusion (Tapaswi et al., 2016) 27.7 -MemN2N (Tapaswi et al., 2016) 38.0 36.9Understanding visual regions - 37.4RWMN (Na et al., 2017) 40.4 38.5C-MemN2N (w/o SN) 40.6 -SC-MemN2N (Ours) 42.7 39.4

Table 7: Performance comparison for the subti-tles task on the MovieQA 2017 Challenge on bothvalidation and test sets. We compare our modelswith the best existing models (from the challengeleaderboard).

8 Conclusion

In this paper, we proposed a unified optimiza-tion framework for the task of speaker naming

5Using a t-test p-value<0.05 with 1,000 folds each con-taining 20 samples.

Figure 4: Accuracy comparison according to ques-tion type.

in movies. We addressed this task under a dif-ficult setup, without a cast-list, without supervi-sion from a script, and dealing with the com-plicated conditions of real movies. Our modelincludes textual, visual, and acoustic modalities,and incorporates several grammatical and acous-tic constraints. Empirical experiments on a moviedataset demonstrated the effectiveness of our pro-posed method with respect to several competitivebaselines. We also showed that an SC-MemN2Nmodel that leverages our speaker naming modelcan achieve state-of-the-art results on the subtitlestask of the MovieQA 2017 Challenge.

The dataset annotated with character namesintroduced in this paper is publicly avail-able from http://lit.eecs.umich.edu/downloads.html.

Acknowledgments

We would like to thank the anonymous reviewersfor their valuable comments and suggestions. Thiswork is supported by a Samsung research grantand by a DARPA grant HR001117S0026-AIDA-FP-045.

2214

ReferencesOgnjen Arandjelovic and Andrew Zisserman. 2005.

Automatic face recognition for film character re-trieval in feature-length films. In Computer Visionand Pattern Recognition (CVPR).

Martin Bauml, Makarand Tapaswi, and Rainer Stiefel-hagen. 2013. Semi-supervised learning with con-straints for person identification in multimedia data.In IEEE Conference on Computer Vision and PatternRecognition (CVPR).

Xavier Bost and Georges Linares. 2014. Constrainedspeaker diarization of tv series based on visual pat-terns. In IEEE Spoken Language Technology Work-shop (SLT).

Herve Bredin and Gregory Gelly. 2016. Improvingspeaker diarization of tv series using talking-face de-tection and clustering. In Proceedings of the 24thAnnual ACM Conference on Multimedia.

Joseph P Campbell. 1997. Speaker recognition: A tu-torial. Proceedings of the IEEE .

Timothee Cour, Benjamin Sapp, Akash Nagle, and BenTaskar. 2010. Talking pictures: Temporal groupingand dialog-supervised person recognition. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Martin Danelljan, Gustav Hager, Fahad Khan, andMichael Felsberg. 2014. Accurate scale estimationfor robust visual tracking. In British Machine VisionConference (BMVC).

Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du-mouchel, and Pierre Ouellet. 2011. Front-end factoranalysis for speaker verification. IEEE Transactionson Audio, Speech, and Language Processing .

Engin Erzin, Yucel Yemez, and A Murat Tekalp. 2005.Multimodal speaker identification using an adap-tive classifier cascade based on modality reliability.IEEE Transactions on Multimedia .

Mark Everingham, Josef Sivic, and Andrew Zisserman.2006. “hello! my name is... buffy” – automatic nam-ing of characters in tv video. In BMVC.

Florian Eyben, Felix Weninger, Florian Gross, andBjorn Schuller. 2013. Recent developments inopenSMILE, the munich open-source multimediafeature extractor. In Proceedings of the 21st ACMInternational Conference on Multimedia.

Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah, and Rainer Stiefelhagen. 2016. Naming tvcharacters by watching and analyzing dialogs. InIEEE Winter Conference on Applications of Com-puter Vision (WACV).

Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan,Li Xu, and Wenping Wang. 2015. Deep multimodalspeaker naming. In Proceedings of the 23rd AnnualACM Conference on Multimedia Conference.

Ioannis Kapsouras, Anastasios Tefas, Nikos Nikolaidis,and Ioannis Pitas. 2015. Multimodal speaker di-arization utilizing face clustering information. In In-ternational Conference on Image and Graphics.

Vahid Kazemi and Josephine Sullivan. 2014. One mil-lisecond face alignment with an ensemble of regres-sion trees. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Soheil Khorram, John Gideon, Melvin McInnis, andEmily Mower Provost. 2016. Recognition of de-pression in bipolar disorder: Leveraging cohort andperson-specific knowledge. In Interspeech.

Davis E King. 2009. Dlib-ml: A machine learningtoolkit. Journal of Machine Learning Research .

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in neural information processing systems.

Sarah Ita Levitan, Taniya Mishra, and Srinivas Banga-lore. 2016. Automatic identification of gender fromspeech. Speech Prosody .

Ying Li, Shrikanth S Narayanan, and C-C Jay Kuo.2004. Adaptive speaker identification with audiovi-sual cues for movie content analysis. Pattern Recog-nition Letters .

Chunxi Liu, Shuqiang Jiang, and Qingming Huang.2008. Naming faces in broadcast news video by im-age google. In Proceedings of the 16th ACM inter-national conference on Multimedia.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations.

Benoit Mathieu, Slim Essid, Thomas Fillon, JacquesPrado, and Gal Richard. 2010. Yaafe, an easy touse and efficient audio feature extraction software.In Proceedings of the 11th International Society forMusic Information Retrieval Conference.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems (NIPS).

Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim.2017. A read-write memory network for moviestory understanding. In International Conference onComputer Vision (ICCV).

O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015.Deep face recognition. In British Machine VisionConference (BMVC).

2215

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine learningin Python. Journal of Machine Learning Research .

V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei.2014. Linking people with “their” names usingcoreference resolution. In IEEE Conference on Eu-ropean Conference on Computer Vision (ECCV).

Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang,Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look,listen and learna multimodal lstm for speaker identi-fication. In Thirtieth AAAI Conference on ArtificialIntelligence.

Douglas A Reynolds. 2002. An overview of auto-matic speaker recognition technology. In Acoustics,speech, and signal processing (ICASSP).

Josef Sivic, Mark Everingham, and Andrew Zisserman.2009. “who are you?” – learning person specificclassifiers from video. In IEEE Conference on Com-puter Vision and Pattern Recognition(CVPR).

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In Advancesin neural information processing systems (NIPS).

Makarand Tapaswi, Martin Bauml, and Rainer Stiefel-hagen. 2012. “knock! knock! who is it?” prob-abilistic person identification in tv-series. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Makarand Tapaswi, Martin Bauml, and Rainer Stiefel-hagen. 2014. Storygraphs: Visualizing character in-teractions as a timeline. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Makarand Tapaswi, Martin Bauml, and Rainer Stiefel-hagen. 2015. Improved weak labels using contex-tual cues for person identification in videos. In IEEEInternational Conference and Workshops on Auto-matic Face and Gesture Recognition (FG).

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen,Antonio Torralba, Raquel Urtasun, and Sanja Fi-dler. 2016. Movieqa: Understanding stories inmovies through question-answering. In IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Laurens Van Der Maaten. 2014. Accelerating t-sneusing tree-based algorithms. Journal of machinelearning research .

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, andYu Qiao. 2016. Joint face detection and alignmentusing multitask cascaded convolutional networks.IEEE Signal Processing Letters .

Yi-Fan Zhang, Changsheng Xu, Hanquig Lu, andYeh-Min Huang. 2009. Character identification infeature-length films using global face-name match-ing. In IEEE Transactions on Multimedia.

2216

Speaker Naming in MoviesMahmoud Azab, Mingzhe Wang, Max Smith, Noriyuki Kojima, Jia Deng, Rada...

Documents

Transcript of Speaker Naming in MoviesMahmoud Azab, Mingzhe Wang, Max Smith, Noriyuki Kojima, Jia Deng, Rada...