[IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) -...

6
Towards Audio-Video Based Handwritten Mathematical Content Recognition in Classroom Videos Smita Vemulapalli Center for Signal & Image Processing Georgia Institute of Technology, Atlanta, GA [email protected] Monson Hayes Advanced Imaging Science, Multimedia and Film Chung-Ang University, Seoul, Korea [email protected] Abstract Recognizing handwritten mathematical content in class- room videos poses a range of interesting challenges. In this paper, we focus on improving the character recognition ac- curacy in such videos using a combination of video and au- dio based text recognizers. We propose a two step assembly consisting of a video text recognizer (VTR) as the primary character recognizer and an audio text recognizer (ATR) for disambiguating, if needed, the output of the VTR. We pro- pose techniques for (1) detecting ambiguity in the output of the VTR so that a combination with the ATR may be trig- gered only for ambiguous characters, (2) synchronizing the output of the two recognizers for enabling combination, and (3) combining the options generated by the two recogniz- ers using measurement and rank based methods. We have implemented the system using an open source implementa- tion of a character recognizer and a commercially available phonetic word-spotter. Through experiments conducted us- ing video recorded in a classroom-like environment, we demonstrate the improvement in the character recognition accuracy that can be achieved using our approach. 1 Introduction Recent years have witnessed a rapid increase in the num- ber of e-learning and advanced learning initiatives that ei- ther use classroom videos as the primary medium of in- struction or make them available online for reference by the students. As the volume of such recorded video content increases, it is amply clear that in order to efficiently navi- gate through the available classroom videos, there is a need for techniques that can help extract, identify and summa- rize the content in such videos. In this context, and given the fact that the whiteboard continues to be the preferred and effective medium for teaching complex mathematical and scientific concepts, this paper focuses on improving the character recognition accuracy associated with handwritten mathematical content in classroom videos. Our solution for improving the character recognition ac- curacy makes use of a combination of video and audio based text recognizers. The video text recognizer (VTR) acts as the primary character recognizer and the subse- quent step of combination with the audio text recognizer (ATR) is triggered only for the VTR outputs that are de- termined to be ambiguous. While our research utilizes the advances made across several areas of signal process- ing research like extraction and recognition of textual con- tent from videos [12, 5], mathematical content recognition, speech recognition and classifier combination, our focus on the recognition of handwritten mathematical content from classroom videos and the use of accompanying audio con- tent to assist in recognition, poses a range of new and inter- esting challenges. Specifically, to address the challenge of improving the character recognition accuracy using a com- bination of video and audio based text recognizers, we make the following contributions: Ambiguity Detection - In the proposed solution, trigger- ing a combination with the audio for every output of the VTR may sometimes result in degrading the recognition accuracy due to errors in the ATR output. We propose techniques for restricting the combination to VTR out- puts that have a higher chance of being erroneous. Audio-Video Synchronization - Synchronizing an occur- rence of a character in the video to its occurrence in the audio stream is needed for enabling combination and can be a very challenging problem. We propose and evaluate a number of synchronization techniques. Audio-Video Combination - Determining the final out- put from the options generated by the VTR and the ATR is dependent on a large number of factors such as the option match scores, synchronization, character-specific VTR and ATR accuracies, etc. We investigate various measurement and rank-level combination techniques. 2 System Overview The end-to-end recognition system (block diagram shown in Figure 1) performs handwritten mathematical content recognition from classroom videos in three dis- 774 978-1-4577-0253-2/11/$26.00 ©2011 IEEE

Transcript of [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) -...

Page 1: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

Towards Audio-Video Based Handwritten Mathematical Content Recognition in

Classroom Videos

Smita Vemulapalli

Center for Signal & Image Processing

Georgia Institute of Technology, Atlanta, GA

[email protected]

Monson Hayes

Advanced Imaging Science, Multimedia and Film

Chung-Ang University, Seoul, Korea

[email protected]

Abstract

Recognizing handwritten mathematical content in class-

room videos poses a range of interesting challenges. In this

paper, we focus on improving the character recognition ac-

curacy in such videos using a combination of video and au-

dio based text recognizers. We propose a two step assembly

consisting of a video text recognizer (VTR) as the primary

character recognizer and an audio text recognizer (ATR) for

disambiguating, if needed, the output of the VTR. We pro-

pose techniques for (1) detecting ambiguity in the output of

the VTR so that a combination with the ATR may be trig-

gered only for ambiguous characters, (2) synchronizing the

output of the two recognizers for enabling combination, and

(3) combining the options generated by the two recogniz-

ers using measurement and rank based methods. We have

implemented the system using an open source implementa-

tion of a character recognizer and a commercially available

phonetic word-spotter. Through experiments conducted us-

ing video recorded in a classroom-like environment, we

demonstrate the improvement in the character recognition

accuracy that can be achieved using our approach.

1 Introduction

Recent years have witnessed a rapid increase in the num-

ber of e-learning and advanced learning initiatives that ei-

ther use classroom videos as the primary medium of in-

struction or make them available online for reference by the

students. As the volume of such recorded video content

increases, it is amply clear that in order to efficiently navi-

gate through the available classroom videos, there is a need

for techniques that can help extract, identify and summa-

rize the content in such videos. In this context, and given

the fact that the whiteboard continues to be the preferred

and effective medium for teaching complex mathematical

and scientific concepts, this paper focuses on improving the

character recognition accuracy associated with handwritten

mathematical content in classroom videos.

Our solution for improving the character recognition ac-

curacy makes use of a combination of video and audio

based text recognizers. The video text recognizer (VTR)

acts as the primary character recognizer and the subse-

quent step of combination with the audio text recognizer

(ATR) is triggered only for the VTR outputs that are de-

termined to be ambiguous. While our research utilizes

the advances made across several areas of signal process-

ing research like extraction and recognition of textual con-

tent from videos [12, 5], mathematical content recognition,

speech recognition and classifier combination, our focus on

the recognition of handwritten mathematical content from

classroom videos and the use of accompanying audio con-

tent to assist in recognition, poses a range of new and inter-

esting challenges. Specifically, to address the challenge of

improving the character recognition accuracy using a com-

bination of video and audio based text recognizers, we make

the following contributions:

• Ambiguity Detection - In the proposed solution, trigger-

ing a combination with the audio for every output of the

VTR may sometimes result in degrading the recognition

accuracy due to errors in the ATR output. We propose

techniques for restricting the combination to VTR out-

puts that have a higher chance of being erroneous.

• Audio-Video Synchronization - Synchronizing an occur-

rence of a character in the video to its occurrence in the

audio stream is needed for enabling combination and

can be a very challenging problem. We propose and

evaluate a number of synchronization techniques.

• Audio-Video Combination - Determining the final out-

put from the options generated by the VTR and the ATR

is dependent on a large number of factors such as the

option match scores, synchronization, character-specific

VTR and ATR accuracies, etc. We investigate various

measurement and rank-level combination techniques.

2 System Overview

The end-to-end recognition system (block diagram

shown in Figure 1) performs handwritten mathematical

content recognition from classroom videos in three dis-

774 978-1-4577-0253-2/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

���������

��� ��

�������� �

������� ���

��� ��� ��� ������

�������� � ���

� ��� ����

��������

�������� �

������� ���

�����

��������

�����

��������

����������

����

� ��� ����

���������� ��������� ���

������� ���

�������� �������� ���

�����

������������

����!�����"����

������ ���������� ���

����!�����"����� ��� ����

��������

������� ��

�� �� ������������� ��� ������� ���

9.6

11.0

16.8

12.3

13.7

14.7

17.2

[“)”,0.98] [“1”,0.96] [“l”,0.88]

[“q”,0.98] [“9”,0.90] [“1”,0.84]

[“4”,0.94] [“h”,0.81] [“+”,0.76]

[“+”,0.96] [“i”,0.84] [“f”,0.77]

[“b”,1.00] [“6”,0.99] [“L”,0.81]

[“8”,1.00] [“g”,0.90] [“S”,0.78]

[“Z”,0.98] [“7”,0.90] [“J”,0.89]

Y

Y

N

Y

N

Y

Y

[“1”,10.0,0.63] [“l”,10.3,0.06]

[“q”,15.3,0.08] [“9”,10.4,0.59] [“1”,10.0,0.63]

[“b”,15.3,0.35] [“6”,14.5,0.71] [“L”,16.1,0.14]

[“Z”,14.5,0.05] [“7”,20.9,0.77] ] [“J”,15.3,0.09]

“1”

“9”

“8”

“h”

“+”

“6”

“7”

[“4”,10.8,0.05] [“h”,15.4,0.28] [“+”,12.8,0.29]

Segmented Characters & Timestamp

Video Text Recognizer Output [character, video match-score]

Ambiguous? (Y/N)

Synchronized Audio Text Recognizer Output [character, audio time, audio match-score]

Disambiguation Output

������ ������������� ���

��

��! ��

Video Frame

Extracted & Segmented Text

Figure 1. System overview. The lower half of the figure depicts the intermediate output from various components of the system. Given an input video,the preprocessing stage outputs a set of segmented and timestamped characters that form the input to the character recognizer, which, for each segmented character,produces a set of video options. The ambiguity detection stage is used to determine the set of ambiguous video options, which undergo a combination with the audiooptions generated by the audio text recognizer and selected by the synchronization module to produce the final output. The tick marks denote the correct output.

tinct stages: the video preprocessing stage, the audio-video

(A/V) based character recognition stage and the A/V based

structure analysis stage. The video preprocessing stage in-

cludes all processing required to extract text regions (here,

mathematical content) from the video, segment characters

using a component labeling algorithm and also generate in-

formation such as the timestamp and the location of each

segmented character. The A/V based character recognition

stage performs video based character recognition followed

by audio-assisted character disambiguation. Similarly, the

A/V based structure analysis stage (briefly described here

for completeness), first performs video based structure anal-

ysis (example techniques include [13, 1]) and then audio-

assisted structure disambiguation. In this paper, we focus

almost exclusively on the character disambiguation compo-

nent which involves the three main tasks of ambiguity de-

tection, A/V synchronization and A/V combination which

are discussed in detail in the following sections.

3 Mathematical Model

Data Set. Let C represent the character set recognizable by

the VTR and S be the set of segmented character elements

s from the given test set of videos such that each element

s = (si, st, sl) contains the segmented character’s image si,

timestamp st and location coordinates sl, and G is a func-

tion that returns it’s ground truth G(s) ∈ C. For future use,

we define an Equals function E(c1, c2) which maps to 1 for

c1 = c2 and 0 otherwise. Here, c1, c2 ∈ C.

VTR. For any segmented character s, the VTR function

V is defined as V (s) = ((V cj (s), V

pj (s)) | ∀V

cj (s) ∈

C ∧ Vpj (s) ∈ [0, 1]). The function V (s) returns an or-

dered set of video options arranged in decreasing order of

the score Vpj (s). Here, V c

j (s) represents the jth video op-

tion’s recognized character name and Vpj (s) is the corre-

sponding match score generated by the recognizer. In ab-

sence of other inputs, V c1 (s) is the final recognized output

for the segmented character s.

ATR. The output of the ATR for the jth input video op-

tion Vj(s) is an ordered set of audio options and is rep-

resented as Aj(s) = (Aj,k(s)) where Aj,k(s) is the kth

audio option corresponding to the jth video option. Each

audio option can further be represented as Aj,k(s) =((Ac

j,k(s), Apj,k(s), A

tj,k(s)) | A

cj,k(s) ∈ C ∧ A

pj,k(s) ∈

[0, 1]). Here, Acj,k(s) represents the character name of jth

video option’s kth audio option which is the same as V cj (s)

in our implementation, Apj,k(s) is the corresponding match

score generated by the ATR and Atj,k(s) is the correspond-

ing time of occurrence in the audio segment. The above set

is arranged in decreasing order of the score Apj,k(s).

Ambiguity Detection. The ambiguity detection function

D(S) = SD ⊂ S is designed to return SD, a subset of

characters from the set S that are categorized as ambiguous

based on various conditions imposed on the VTR output V .

The set of non-ambiguous characters is S − SD. Methods

for ambiguity detection are described in Section 4.

Synchronization. The A/V synchronization function Y is

defined as Y (s) = (Aj,k′(s)) where k′ is the index of the

audio option that we consider to be best synchronized with

the jth video option of the segmented character s. Section 5

775

Page 3: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

explains some synchronization techniques and also shows

how k′ may be determined.

Combination. The A/V combination function Z is defined

as Z(s) = V cj′(s) ∈ C where j′ is the index of the video op-

tion that is considered to be correct by the system at the end

of the A/V combination process. Section 6 describes how j′

may be computed using simple rank and measurement-level

combination techniques.

Evaluation. Finally, the character recognition accuracy of

the end-to-end system α computed on the entire test set S

with A/V combination employed on the set of ambiguous

characters SD and purely video based character recognition

employed on the set of non-ambiguous characters S − SD

can be expressed as:

α(S) =|S − SD| × αV (S − SD) + |SD| × αZ(SD)

|S|(1)

The character recognition accuracy of the VTR for the

set of non-ambiguous characters S − SD is represented as

αV (S) and it can be computed as:

αV (S − SD) =1

|S − SD|

∀s∈S−SD

E(Vc1(s), G(s)) (2)

The character recognition accuracy αZ of the A/V com-

bination based system for the set of ambiguous characters

SD can be expressed as:

αZ(SD) =1

|SD|

∀s∈SD

E(Z(s), G(s)) (3)

4 Ambiguity Detection

Ambiguity detection involves the detection of characters

whose VTR output has a higher chance of being erroneous

and passing their video options (a selected subset) to the

A/V synchronization and A/V combination (AVC) compo-

nents for improving the recognition accuracy. Our aim here

is to increase the possibility of correcting VTR errors as

well as to limit errors that may be introduced at the combi-

nation stage. Ambiguity detection can be divided into the

three main tasks of mapping, thresholding and option selec-

tion which are described below.

4.1 Mapping

When the VTR generated video option scores are not the

best basis for ambiguity detection and classifier combina-

tion, we may map/rescore these values to a new set of val-

ues. Some commonly used score normalization techniques

have been discussed in [7].

Score to conditional probability mapping. Assume we

are rescoring a segmented character s and consider the ith

video option (V ci (s) = X,V

pi (s) = Y ). The new score,

Vpi′(s), is the conditional probability that the segmented

character s is X given that the VTR score correspondingto X is Y ,

Vpi′(s) = Prob

(G(s) = X|V

pi (s) = Y

)(4)

which is estimated from a video training set as follows

Vpi′(s) =

|{s ∈ ST | ∃j : G(s) = X ∧ V cj (s) = X ∧ V p

j (s) = Y }|

|{s ∈ ST | ∃k : V ck(s) = X ∧ V p

k(s) = Y }|

(5)

When the training set is limited, we compute Vpi′(s) for

sub-ranges of values instead of specific values of Y .

4.2 Thresholding

Simple Threshold Based Methods. Simple threshold tech-

niques are used to determine if a segmented character s ∈ S

is ambiguous or not based on one or more thresholding con-

ditions that are applied to all the characters. The use of an

absolute threshold T1 and a relative threshold T2 are given

in Eqs. 6 and 7, respectively, as follows

D(S, V ) = {s ∈ S | Vp1(s) < T1} (6)

D(S, V ) = {s ∈ S | Vp2(s)/V

p1(s) > T2} (7)

Character-Specific Threshold Based Methods. To exploit

the fact that different characters are recognized with dif-

ferent accuracies, we perform thresholding using different

threshold values T (V c1 (s)) for each character as follows,

D(S, V ) = {s ∈ S | Vp1(s) < T (V

c1(s))} (8)

For each character c ∈ C, the character-specific thresh-

old T (c) is found that maximizes the character-level ac-

curacy of the end-to-end character recognition system

α(S(c), T ) computed over the set S(c) which consists of

all characters in the training set ST that claim to be a c, i.e.,

S(c) = {s ∈ ST | V c1 (s) = c}. For mathematical sim-

plicity, we assume that only the top N options are passed to

the AVC. We calculate α(S(c), T ) as the weighted sum of

the accuracies of the four disjoint subsets of S(c) namely

TP (c), FN(c), TN(c) and FP (c). The set of True Posi-

tives TP (c) contains those characters that are correctly rec-

ognized by the VTR and also tagged as non-ambiguous and

not passed to the AVC, therefore the accuracy for this subset

is 1. The set of False Negatives FN(c) refers to those char-

acters that are correctly recognized by the VTR but tagged

as ambiguous and the set of True Negatives TN(c) refers

to those that are not correctly recognized by VTR but con-

tain the correct option within the top N options and tagged

as ambiguous. Both these sets are passed to the AVC and

have an αNF′

chance of finally getting the correct video op-

tion chosen. Finally, the set of False Positives FP (c) refers

to those that are not correctly recognized but have the cor-

rect video option within the top N options and have been

tagged as non-ambiguous and therefore not passed to the

AVC. Here, αNF′

represents the constant recognition accu-

racy of the AVC, assuming that it is given a set of N options

out of which one is the correct option. Those characters for

which the correct option is not found in the top N options

cannot be corrected by our system as they are not passed

to the AVC. Good mapping and option selection techniques

776

Page 4: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

reduce the number of characters that fall in this category.

We calculate α(S(c), T ) and the character-specific thresh-

old T (c) as follows:

α(S(c), T ) =1× |TP (c)|+ αN

F

′× (|FN(c)|+ |TN(c)|) + 0× |FP (c)|

|TP (c)|+ |FN(c)|+ |TN(c)|+ |FP (c)|(9)

T (c) = argmaxT

(α(S(c), T )) (10)

4.3 Option Selection

For each ambiguous character, we need to find the right

subset of video options that should be passed to the AVC.

One needs to determine a subset which has a high chance of

having the correct output, but limits the possibility of an in-

correct recognition by the AVC. Some simple option selec-

tion techniques include selecting the top NumOpt number

of options and selecting options whose absolute (or rela-

tive) score is greater than an absolute threshold AbsThr (or

a relative threshold RelThr). These different option selec-

tion techniques can be expressed as:

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ x ≤ NumOpt} (11)

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V

px (s) > AbsThr} (12)

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V

px (s)/V

p1(s) > RelThr} (13)

5 A/V Synchronization

A/V synchronization, which takes place in the combi-

nation stage, refers to the identification of the audio seg-

ment in the classroom video that corresponds to the hand-

written content. The factors that affect synchronization are

described below followed by a description of the synchro-

nization techniques.

5.1 Factors Affecting Synchronization

Video Timestamping Accuracy. The video time stamping

accuracy is a measure of how often the automatically gener-

ated video timestamps for the characters fall within a preset

time window around the manually generated video times-

tamps. The manually generated video timestamp refers to

the instant when the character is fully visible to an observer.

The timestamping accuracy depends on the quality of the

preprocessing/timestamping algorithms and tends to deteri-

orate due to shadows (excluding the occlusion) caused by

the instructor standing in front of the whiteboard.

A/V Recording Alignment Factor. The A/V recording

alignment factor is a measure of how well the audio and

video in the recording are aligned. It is inversely propor-

tional to the average time difference between when the char-

acter is first observed by the student and when the corre-

sponding audio is spoken by the instructor. This factor de-

pends on the time difference between the writing and the

speaking of the characters and the amount of occlusion.

VTR Accuracy. The characters in the video that are in the

temporal vicinity of the character under investigation are

referred to as its neighbors. VTR accuracy, which deter-

mines the reliability of the neighbors, is very important as

the neighbors can be used to synchronize the video occur-

rence of a character to its audio occurrence.

ATR Accuracy. The ATR may cause false positives or false

negatives that increase the synchronization error. False pos-

itives, which are high for small search strings, refer to the

hits generated by the ATR when the character has not been

spoken. False negatives refer to the case when the ATR is

unable to find the correct audio occurrence of the character.

5.2 Synchronization Techniques

We now describe the set of audio features that form the

basis for the proposed synchronization techniques.

F1 - audio match score generated by the ATR. The pruning

threshold used for this feature is represented as PP .

F2 - the automatic video timestamp minus the automatic

audio timestamp. The audio pruning window used is repre-

sented as [PB , PA] where PB and PA refer to the amount

of time before (usually a negative value) and after the audio

timestamp of the audio option under consideration.

F3 - number of video neighbors from a window [NB , NA]that are also found in an audio time window [WB ,WA]where NB and NA refer to the number of neighbors before

and after the video option under consideration and WB and

WA refer to the amount of time before and after the audio

timestamp of the audio option under consideration.

A/V Time Difference Based Technique. This technique

operates by first performing a simple pruning of the audio

options based on features F1 and F2, followed by a se-

lection of the audio option with the lowest value for F2.

Mathematically, this is represented as:

Y (s) = ((Aj,k′ (s)) | k′= argmin

k(F2(Aj,k(s)))

∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (14)

Basic A/V Neighbor Based Technique. This technique,

like the previous one, performs a simple pruning followed

by a selection of the audio option with the highest value for

F3. This technique may be described mathematically as:

Y (s) = ((Aj,k′ (s)) | k′= argmax

k(F3(Aj,k(s)))

∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (15)

Feature Rank Sum Based Technique. After an initial

pruning, feature ranks (R1, R2, R3) are assigned to each

audio feature based on the rank of the audio feature in the

pruned set. For instance, R1 = 1 for an audio option in-

dicates that it’s feature F1 has the maximum value in the

pruned set. After rank assignment, the audio option with

the minimum value of rank sum, R1 +R2 +R3, is chosen

as the output. R2 and R3 may be used to break ties.

777

Page 5: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

Table 1. Ambiguity detection techniquesMapping Thresholding Option System

Selection Accuracy(%)

No No NumOpt=1 57.5No No NumOpt=all 53.9No Simple (T1=0.90) NumOpt=3 60.4No Simple (T1=0.90) RelThr=0.85 62.9No Relative (T2=0.90) RelThr=0.90 64.1Yes Simple (T1=0.90) RelThr=0.90 65.2

6 A/V Combination

After A/V synchronization, each video option has at

most one corresponding audio option. The A/V combina-

tion stage generates the final recognized output based on

the match scores generated by the two recognizers as well

as the computed audio features. Classifier combination

approaches have also been applied in several related con-

texts, for instance, to improve the recognition accuracy of

handwriting recognizers [11], speech recognizers [3] and a

combination of the two [6]. We have experimented with

several rank-level and measurement-level decision mak-

ing/combination techniques [9] discussed below.

6.1 Combination Techniques

Rank Based Techniques. We have used rank sum (Borda

count [9]), accompanied by a suitable tie-breaking strategy,

for the purpose of combination. One of the advantages of

using a rank based technique is that there is no need to com-

pute weights for the audio and video components or to nor-

malize the scores generated by different classifiers.

Classifier-Specific Weight Based Techniques. Apart from

simple sum rule, we have made use of a set of classifier-

specific weights [wV , wA] to combine the scores generated

by each of the recognizers. We have used the accuracy val-

ues of the two classifiers as their weights for combination.

Z(s) = (Vcj′ (s) | j

′= argmax

j(wV × V

pj (s) + wA × A

p

j,k′(s))

∧Vcj (s) ∈ O(s)) (16)

Character-Specific Weight Based Techniques. As the

accuracy of each of the classifiers (in this case, a sin-

gle VTR and a single ATR) may be significantly differ-

ent for each character, the use of a different set of weights

[wV (c), wA(c)] for every character has the potential to fur-

ther improve the recognition accuracy.

Z(s) = (Vcj′ (s) | j

′= argmax

j(wV (V

cj (s))× V

pj (s)

+wA(Acj,k′ (s))× A

p

j,k′(s)) ∧ V

cj (s) ∈ O(s)) (17)

7 Experimental Results

Implementation & Setup. While the system is not tied to a

specific character recognizer or speech recognizer, our cur-

rent implementation uses an extended version of the GNU

Optical Character Recognition or GOCR [4] tool and the

Nexidia word spotter [8]. Our video data set [2] has been

Table 2. Synchronization techniquesTechnique Pruning Audio Neighbor Sync. Accuracy (%)

[PB , PA] [WB ,WA] Good RA Poor RA

Time Diff. [-1,5] - 86.4 23.1Time Diff. [-6,20] - 77.2 41.0Neighbor [-6,20] [-4,4] 63.6 28.2Neighbor [-6,20] [-8,8] 65.9 43.5Rank Sum [-1,5] [-8,8] 86.3 23.1Rank Sum [-6,20] [-8,8] 81.8 46.1

recorded in a classroom-like environment and consists of a

fairly large training and test datasets with more than 6000

characters from more than 100 videos. The character set

includes capital and small alphabets, numbers and basic al-

gebraic operators.

Ambiguity Detection. Table 1 lists some combinations

of mapping, thresholding and option selection techniques

that were evaluated [10]. While the results are largely self-

explanatory, some things to note are as follows. The base-

line system (VTR) has an end-to-end character recognition

accuracy of 57.5% and this degrades to 53.9% if all of the

characters are passed through the A/V combination system

with all of the options. A simple thresholding technique

shows significant improvement in system accuracy (60.4%)

and the use of a relative threshold for option selection is

much better than using a fixed number of options. Also,

relative thresholding appears to give better performance for

this set of videos. Finally, we can see that the best system

accuracy (65.2%) is obtained when using the conditional

probability mapping technique described in Section 4.1.

Synchronization. Figure 2 plots the difference between

the automatic video timestamp TSaV and the manual video

timestamp TSmV against the character index (character’s po-

sition in an arrangement ordered by TSaV ) for a single video

file with good recording alignment (RA) using two differ-

ent timestamping techniques, one good and the other poor

in terms of timestamping accuracy. The poor timestamping

technique used a single binarization threshold during text

extraction and timestamping. As a result, it performs well

in the first half of the video (upper half of the whiteboard)

where there were no shadows caused by the instructor; but

in the second half of the video, due to the shadows (exclud-

ing occlusions) caused by the instructor, the single thresh-

old is insufficient. The good timestamping technique, on

the other hand, makes use of a more sophisticated multiple

threshold based text extraction technique and is therefore

able to perform well even in presence of shadows.

Figure 3 plots the difference between the manual video

timestamp TSmV and the manual audio timestamp TSm

A

against the character index for two separate audio files (with

good video timestamping), one with good RA and the other

with poor RA. This time difference is significantly larger in

case of poor RA.

The performance of the synchronization techniques pro-

posed in this paper using videos with good and poor video

778

Page 6: [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) - Victoria, BC, Canada (2011.08.23-2011.08.26)] Proceedings of 2011 IEEE Pacific

0

5

10

15

20

25TSa V - TSm V (sec)

Character Index

Poor timestampingGood timestamping

Figure 2. Timestamping

RA are shown in Table 2. The videos were timestamped

using a good timestamping technique. For the purpose

of evaluation, we considered a character to be correctly

synchronized if |TSmA − TSa

A| ≤ 2, where TSaA is the

automatic audio timestamp determined after synchroniza-

tion. First let us compare the synchronization accuracies of

the proposed techniques for a pruning window [PB , PA]=[-

6,20], number of neighbors considered [NB , NA]=[2,2] and

a neighbor audio window [WB ,WA]=[-8,8]. As expected,

the time difference based technique (77.2%) outperforms

the neighbor based technique (65.9%) for good RA, and

the neighbor based technique (43.5%) outperforms the time

difference based technique (41.0%) for poor RA while the

feature rank sum based technique outperforms the other

two techniques for both good RA (81.8%) and poor RA

(46.1%). We therefore utilize, for synchronization, the fea-

ture rank sum based technique in the classifier combination

experiments reported next. Also, for good RA and good

timestamping, when using techniques based on time dif-

ference or feature rank sum (which includes the time dif-

ference feature), a small pruning window [PB , PA]=[-1,5]

may prove to be very beneficial (86.3%) but for poor RA

and poor timestamping is poor, the accuracy greatly dete-

riorates (23.1%). For neighbor based techniques, note that

the window [WB ,WA] size is also crucial and the synchro-

nization accuracy deteriorates if this window is too small.

Combination. The A/V combination techniques described

in Section 6.1 were implemented and evaluated, and the re-

sults are tabulated in Table 3. For each of the A/V combina-

tion techniques, we perform ambiguity detection with a rel-

ative threshold of 0.9 for thresholding and a relative thresh-

old of 0.9 for option selection. Note that a significant part of

the improvement observed in the system accuracy can be at-

tributed to the ambiguity detection techniques that precede

A/V combination. The character-specific weighted sum

technique with the weights wV =0.55 and wA=0.45 shows

a 17.9% (relative) improvement compared to the VTR.

Table 3. Combination techniquesTechnique System Accuracy (%)

Video Only 57.5Rank Sum 65.1Classifier-Specific Weights 67.4Character-Specific Weights 67.8

0

5

10

15

20

25

|TSm V - TSm A| (sec)

Character Index

Poor A/V recording alignmentGood A/V recording alignment

Figure 3. Alignment

8 Conclusions & Future Work

We have proposed, implemented and evaluated tech-

niques for ambiguity detection, A/V synchronization and

A/V combination. Going forward, we plan to extend these

by taking into account the penalties associated with incor-

rect ambiguity detection, improve the set of audio features

and implement an intelligent neighbor based synchroniza-

tion. As much of this research relies on the existence of

large labeled data sets, we plan to extend our data set with

videos recorded by different subjects. Finally, we intend

to explore the use of audio information to disambiguate the

output of the structure analysis component.

References

[1] R. H. Anderson. Two-dimensional mathematical notations.

Syntactic Pattern Recognition Applications (K.S. FU), 1977.

[2] Classroom Videos Dataset. http://users.ece.

gatech.edu/˜smita/dataset/.

[3] J. Fiscus. A post-processing system to yield reduced word er-

ror rates: Recognizer output voting error reduction (ROVER).

In ASRU, 1997.

[4] GOCR. http://jocr.sourceforge.net/.

[5] L.-W. He and Z. Zhang. Real-time whiteboard capture and

processing using a video camera for remote collaboration.

IEEE Transactions on Multimedia, 9(1), 2007.

[6] J. Hunsinger and M. Lang. A speech understanding mod-

ule for a multimodal mathematical formula editor. Proc. Int.

Conf. on Acoust., Speech, Signal Proc., 2000.

[7] A. Jain et al. Score normalization in multimodal biometric

systems. Pattern Recognition, 38(12), 2005.

[8] Nexidia. http://www.nexidia.com/.

[9] S. Tulyakov et al. Review of classifier combination methods.

In Studies in Computational Intelligence: Machine Learning

in Document Analysis and Recognition. 2008.

[10] S. Vemulapalli and M. Hayes. Ambiguity detection meth-

ods for improving handwritten mathematical character recog-

nition accuracy in classroom videos. Proc. 17th Int. Conf. on

DSP, 2011.

[11] W. Wang et al. Combination of multiple classifiers for hand-

written word recognition. IWFHR, 2002.

[12] M. Wienecke et al. Toward automatic video-based white-

board reading. IJDAR, 7(2–3), 2005.

[13] R. Zanibbi et al. Recognizing mathematical expressions us-

ing tree transformation. IEEE Trans. Pattern Analysis and

Machine Intelligence, 24(11), 2002.

779