[IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) -...

6
Synchronization and Combination Techniques for Audio-Video Based Handwritten Mathematical Content Recognition in Classroom Videos Smita Vemulapalli Center for Signal & Image Processing Georgia Institute of Technology, Atlanta, GA, USA [email protected] Monson Hayes Advanced Imaging Science, Multimedia and Film Chung-Ang University, Seoul, Korea [email protected] Abstract—Recognizing handwritten mathematical content is a challenging problem, and more so when such content appears in classroom videos. However, given the fact that in such videos the handwritten text and the accompanying audio refer to the same content, a combination of a video and an audio based recognizer has the potential to significantly improve the content recognition accuracy. In this paper, using a combination of video and audio based recognizers, we focus on improving the character recognition accuracy in such videos and propose: (1) synchronization techniques for establishing a correspondence between the handwritten and the spoken content, and (2) combination techniques for combining the outputs of the video and audio based recognizers. The current implementation of the system makes use of a modified open source text recognizer and a commercially available phonetic word-spotter. For evaluation purposes, we use videos recorded in a classroom-like environment and our experiments demon- strate the significant improvements (24% relative increase as compared to the baseline video based recognizer) in character recognition accuracy that can be achieved using our techniques. Keywords-handwriting recognition, speech recognition, audio-video classifier combination; I. I NTRODUCTION Driven by recent technological advances, and multiple social and economic factors, e-learning and distance edu- cation initiatives have witnessed a rapid proliferation and acceptance into the mainstream. Such e-learning initiatives often rely on pre-recorded or live videos from a classroom as the primary means of delivering the content to the students, and the videos, almost invariably, are also made available online for reference by the students. As the volume of such recorded video content increases, efficient navigation through this content is becoming an important concern, which motivates the need for techniques that can be used to index and summarize the content in such videos. The techniques proposed in this paper, given the fact that math- ematical and scientific concepts are more effectively taught using a whiteboard, focus on the problem of recognizing handwritten mathematical content in classroom videos. There is a significant body of research devoted to the topic of mathematical content recognition [1], [2], and to the extraction and recognition of textual content from videos [3], [4]. Although our work is closely related and dependent on the advances made in the aforementioned fields, its focus on the use of an audio based recognizer in combination with a primary, video based recognizer to improve the recognition of handwritten mathematical content in classroom videos of- fers a range of new and interesting challenges. In this paper, we focus on improving the character recognition accuracy in such videos, and propose techniques for establishing a correspondence between the outputs of the two recognizers and techniques for combining such synchronized output. Our approach (shown in Figure 1) makes use of a video text recognizer (VTR) which extracts, segments and outputs a set of video options for every segmented character from the input classroom video. Each video option generated by the VTR consists of a character and the corresponding match score, where the match score represents the recognizer’s belief in the correctness of the video option. In some cases the match score for the top video option may be high enough, in absolute and/or relative terms, to accept the corresponding character as the recognized output, but in other cases the choice for the final recognized output may not be clear and we consider such characters to be ambiguous [5]. For am- biguous characters (determined by the ambiguity detection stage) our approach makes use of an audio text recognizer (ATR) and synchronization techniques to locate, for each video option, a subset of candidate audio options, which together with the video options are used by the combina- tion stage to generate the set of recognized characters (an example is shown in Figure 2). In an end-to-end setup, the recognized characters are routed to an A/V based structure analysis stage which generates the final output. In this paper, we focus on the synchronization and combination stages and make the following specific contributions: Audio-Video Synchronization - In classroom videos, given the occlusions, shadows and the fact that every character that appears on the whiteboard may not appear in the spoken content or may appear several seconds later, establishing correspondence between the audio and the video occurrence of a character can be very challenging. We propose a number of heuristics for establishing this correspondence. Audio-Video Combination - The correctness of the final output determined by combining the synchronized video and audio options is dependent on a large number of 941 978-1-4577-1676-8/11/$26.00 c 2011 IEEE

Transcript of [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) -...

Page 1: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

Synchronization and Combination Techniques for Audio-Video Based HandwrittenMathematical Content Recognition in Classroom Videos

Smita VemulapalliCenter for Signal & Image Processing

Georgia Institute of Technology, Atlanta, GA, USA

[email protected]

Monson HayesAdvanced Imaging Science, Multimedia and Film

Chung-Ang University, Seoul, Korea

[email protected]

Abstract—Recognizing handwritten mathematical content isa challenging problem, and more so when such content appearsin classroom videos. However, given the fact that in suchvideos the handwritten text and the accompanying audiorefer to the same content, a combination of a video andan audio based recognizer has the potential to significantlyimprove the content recognition accuracy. In this paper, usinga combination of video and audio based recognizers, we focuson improving the character recognition accuracy in such videosand propose: (1) synchronization techniques for establishinga correspondence between the handwritten and the spokencontent, and (2) combination techniques for combining theoutputs of the video and audio based recognizers. The currentimplementation of the system makes use of a modified opensource text recognizer and a commercially available phoneticword-spotter. For evaluation purposes, we use videos recordedin a classroom-like environment and our experiments demon-strate the significant improvements (≈ 24% relative increase ascompared to the baseline video based recognizer) in characterrecognition accuracy that can be achieved using our techniques.

Keywords-handwriting recognition, speech recognition,audio-video classifier combination;

I. INTRODUCTION

Driven by recent technological advances, and multiple

social and economic factors, e-learning and distance edu-

cation initiatives have witnessed a rapid proliferation and

acceptance into the mainstream. Such e-learning initiatives

often rely on pre-recorded or live videos from a classroom as

the primary means of delivering the content to the students,

and the videos, almost invariably, are also made available

online for reference by the students. As the volume of

such recorded video content increases, efficient navigation

through this content is becoming an important concern,

which motivates the need for techniques that can be used

to index and summarize the content in such videos. The

techniques proposed in this paper, given the fact that math-

ematical and scientific concepts are more effectively taught

using a whiteboard, focus on the problem of recognizing

handwritten mathematical content in classroom videos.

There is a significant body of research devoted to the

topic of mathematical content recognition [1], [2], and to the

extraction and recognition of textual content from videos [3],

[4]. Although our work is closely related and dependent on

the advances made in the aforementioned fields, its focus on

the use of an audio based recognizer in combination with a

primary, video based recognizer to improve the recognition

of handwritten mathematical content in classroom videos of-

fers a range of new and interesting challenges. In this paper,

we focus on improving the character recognition accuracy

in such videos, and propose techniques for establishing a

correspondence between the outputs of the two recognizers

and techniques for combining such synchronized output.

Our approach (shown in Figure 1) makes use of a video

text recognizer (VTR) which extracts, segments and outputs

a set of video options for every segmented character from the

input classroom video. Each video option generated by the

VTR consists of a character and the corresponding match

score, where the match score represents the recognizer’s

belief in the correctness of the video option. In some cases

the match score for the top video option may be high enough,

in absolute and/or relative terms, to accept the corresponding

character as the recognized output, but in other cases the

choice for the final recognized output may not be clear and

we consider such characters to be ambiguous [5]. For am-

biguous characters (determined by the ambiguity detection

stage) our approach makes use of an audio text recognizer

(ATR) and synchronization techniques to locate, for each

video option, a subset of candidate audio options, which

together with the video options are used by the combina-

tion stage to generate the set of recognized characters (an

example is shown in Figure 2). In an end-to-end setup, the

recognized characters are routed to an A/V based structure

analysis stage which generates the final output. In this paper,

we focus on the synchronization and combination stages and

make the following specific contributions:

• Audio-Video Synchronization - In classroom videos,

given the occlusions, shadows and the fact that every

character that appears on the whiteboard may not appear

in the spoken content or may appear several seconds

later, establishing correspondence between the audio

and the video occurrence of a character can be very

challenging. We propose a number of heuristics for

establishing this correspondence.

• Audio-Video Combination - The correctness of the final

output determined by combining the synchronized video

and audio options is dependent on a large number of

941978-1-4577-1676-8/11/$26.00 c©2011 IEEE

Page 2: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

��������

��� ��

�������� �

������� ���

��� ��� ��� ������

�������� � ���

� ��� ����

��������

�������� �

������� ���

�����

��������

�����

��������

����������

����

� ��� ����

���������� ���

�������� �������� ���

������ ���

������� ���

�����

����������������!�����"����

������ ���������� ���

����!�����"����� ��� ����

��������

������� ��

�� �� ��� ������������ ��������

������ ������������� ���

��

��! ��

Figure 1. End-to-end system overview

factors e.g. video and audio options’ match scores, syn-

chronization accuracy, recognizers’ accuracy for specific

characters, etc. For improving the A/V combination

accuracy, we investigate various rank and measurement-

based techniques, both at the classifier and the character

level, and also explore the use of an ensemble of audio-

video based recognizers.

II. MATHEMATICAL MODEL

Data Set. Let C represent the character set recognizable by

the VTR and S be the set of segmented character elements

s from the given test set of videos such that each element

s = (si, st, sl) contains the segmented character’s image si,

timestamp st and location coordinates sl, and G is a function

that returns it’s ground truth G(s) ∈ C. For future use, we

define an Equals function E(c1, c2) which maps to 1 for

c1 = c2 and 0 otherwise. Here, c1, c2 ∈ C.

VTR. For any segmented character s, the VTR function V is

defined as V (s) = ((V cj (s), V

pj (s)) | ∀V

cj (s) ∈ C∧V p

j (s) ∈[0, 1]). The function V (s) returns an ordered set of video

options arranged in decreasing order of Vpj (s). Here, V c

j (s)represents the jth video option’s recognized character name

and Vpj (s) is the corresponding match score generated by

the VTR. In the absence of other inputs, V c1 (s) is the final

recognized output for the segmented character s.

ATR. The output of the ATR for the jth input video

option Vj(s) is an ordered set of audio options and is

represented as Aj(s) = (Aj,k(s)) where Aj,k(s) is the

kth audio option corresponding to the jth video option.

Each audio option can further be represented as Aj,k(s) =((Ac

j,k(s), Apj,k(s), A

tj,k(s)) | Ac

j,k(s) ∈ C ∧ Apj,k(s) ∈

[0, 1]). Here, Acj,k(s) represents the character name of jth

video option’s kth audio option which is the same as V cj (s)

in our implementation, Apj,k(s) is the corresponding match

score generated by the ATR and Atj,k(s) is the corresponding

time of occurrence in the audio segment. The above set is

arranged in decreasing order of the score Apj,k(s).

Ambiguity Detection. The ambiguity detection function

D(S) = SD ⊂ S is designed to return SD, a subset of

characters from the set S that are categorized as ambiguous

based on various conditions imposed on the VTR output V .

The set of non-ambiguous characters is S − SD.

Synchronization. The A/V synchronization function Y is

defined as Y (s) = (Aj,k′(s)) where k′ is the index of the

audio option that we consider to be best synchronized with

the jth video option of the segmented character s. Section III

explains some synchronization techniques and also shows

how k′ may be determined.

Combination. The A/V combination function Z is defined as

Z(s) = V cj′(s) ∈ C where j′ is the index of the video option

that is considered to be correct by the system at the end of

the A/V combination process. Section IV describes how j′

may be computed using simple rank and measurement-level

combination techniques.

Evaluation. Finally, the character recognition accuracy of the

end-to-end system α computed on the entire test set S with

A/V combination employed on SD and purely video based

character recognition employed on S−SD can be expressed

as follows:

α(S) =|S − SD| × αV (S − SD) + |SD| × αZ(SD)

|S|(1)

αV (S − SD) =

∀s∈S−SD

E(V c1 (s), G(s))

|S − SD|(2)

αZ(SD) =

∀s∈SD

E(Z(s), G(s))

|SD|(3)

III. A/V SYNCHRONIZATION

The accuracy of the proposed end-to-end recognition

system is critically dependent on the ability of the solution

to identify the segment in the audio which corresponds to

the handwritten content, termed A/V synchronization. In

this paper, we assume that everything that is handwritten

is spoken and focus on handling synchronization issues that

arise due to occlusions, shadows and the skew between the

writing and the utterance of a character. In this section, we

first discuss the factors that affect synchronization and then

describe the A/V synchronization techniques.

A. Factors Affecting Synchronization

Video Timestamping Accuracy. Our recognition system

relies heavily on an automatically determined (by the pre-

processing stage) video timestamp, TSaV (s) associated with

each handwritten character s. The video timestamping ac-

curacy is a measure of how often the TSaV (s) falls within

a preset time window around the TSmV (s), where TSm

V (s)

942 2011 11th International Conference on Intelligent Systems Design and Applications

Page 3: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

9.6

11.0

16.8

12.3

13.7

14.7

Video Frame

Extracted & Segmented Text

[“)”,0.98]

[“1”,0.96]

[“l”,0.88]

[19.1,0.11] [26.6,0.07]

[10.0,0.63] [22.7,0.38] [27.0,0.26]

[10.3,0.06] [19.7,0.09] [21.7,0.19]

[“b”,1.00]

[“6”,0.99]

[“L”,0.81]

[15.3,0.35] [22.4,0.62] [19.0,0.18]

[14.5,0.71] [22.4,0.11]

[10.6,0.06] [16.1,0.14] [26.7,0.29]

“1”

“6”

“1”

“6”

“1”

“b”

Video Options

[character, match score]

Audio Options & Synchronized Option

[audio time, match score]

Weighted

[0.5,0.5]

Weighted

[0.8,0.2]Rank-Sum

A/V Combination Output

Vid

eo T

ime (

Seconds)

17.2

Figure 2. Example: A/V Synchronization and Combination . The figure depicts the intermediate output from various components of the A/V based recognition system.Given an input video, the preprocessing stage outputs a set of segmented and timestamped characters that form the input to the character recognizer, which, for each segmentedcharacter, produces a set of video options. The figure then shows the various audio options that are generated by the ATR for each video option, one such audio option is chosenby the synchronization stage and is forwarded for combination. We show the output of three different A/V combination techniques for two sample characters.

is the manually determined video timestamp and it roughly

corresponds to the time instant when the character becomes

fully visible to a student watching the video. The timestamp-

ing accuracy depends on the quality of the preprocessing

algorithms and tends to deteriorate due to shadows and

occlusions caused by the instructor. We have observed that

a very simple (poor) timestamping technique that uses a

single binarization threshold tends to have a much higher

timestamping difference, δT =| TSaV (s) − TSm

V (s) |, in

regions where there are shadows caused by the instructor.

While using a more sophisticated timestamping technique

(good TS) with multiple binarization thresholds , we observe

that δT is low for the entire region of the whiteboard. The δT

value discussed here is shown in Figure 3 for both good TS

and poor TS for a single video file (between two complete

erasures of the whiteboard) with good recording alignment

(RA). Here, in the poor TS plot, the high δT that we observe

in the second half of the video corresponds to characters

written in the lower half of the whiteboard and it is due to

the shadows caused by the instructor.

0

5

10

15

20

25

|TS

a V(s

) -

TS

m V(s

)| (

sec)

Character Index

Poor timestampingGood timestamping

Figure 3. Timestamping

A/V Recording Alignment Factor. The A/V recording

alignment factor is a measure of how well the audio and

video in the recording are aligned. It is inversely propor-

tional to the average value of the recording time difference,

δR =| TSmV (s)−TSm

A (s) |, where TSmA (s) is the manually

determined audio timestamp and it corresponds to the time

instant when the character s is spoken by the instructor. We

define TSwV (s) to represent the actual time instant when the

character s is actually written on the whiteboard. This is de-

fined for theoretical purposes and it cannot be determined by

observing the video due to the presence of occlusions. The

δR value depends on how close in time the actual writing

and the utterance of the characters is | TSwV (s)− TSm

A (s) |and also on the amount of occlusion, which introduces an

additional delay | TSwV (s) − TSm

V (s) |. Figure 4 plots δR

against the character index for two separate video files, one

with good RA and the other with poor RA. Both these videos

have been timestamped using the good TS.

0

5

10

15

20

25

|TS

m V(s

) -

TS

m A(s

)| (

sec)

Character Index

Poor A/V recording alignmentGood A/V recording alignment

Figure 4. Alignment

VTR Accuracy. Since neighbors are a very important

feature for synchronization, the reliability of these neighbors

from the VTR output is also important. If the accuracy of

the VTR is low, then the neighbors have a higher chance of

being incorrectly recognized and this impacts the correctness

of the neighbor based feature used by the synchronization

techniques, resulting in a wrong A/V synchronization.

ATR Accuracy. The ATR may fail to recognize an utterance

corresponding to a character or in some cases the ATR

may incorrectly detect an utterance when one is not present.

Such ATR errors may lead to synchronization errors, either

directly for the character in question or indirectly for the

neighboring characters.

B. Synchronization Techniques

The basic audio features that form the basis for the

proposed synchronization techniques are described below.

Other audio features related to the order of neighbors and

the character repetitions have not been discussed here as

they do not appear in the results reported in this paper.

F1 - the audio match score generated by the ATR.

F2 - the automatic video timestamp minus the automatic

audio timestamp. The audio pruning window used is repre-

sented as [PB , PA] where PB and PA refer to the amount

of time before and after the audio timestamp of the audio

option under consideration.

2011 11th International Conference on Intelligent Systems Design and Applications 943

Page 4: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

F3 - the number of neighbors from a certain window

[NB , NA] in the video which are also found in a certain

audio time window [WB ,WA] where NB and NA refer to

the number of video neighbors before and after the video

option under consideration and WB and WA refer to the

amount of time before and after the audio timestamp of the

audio option under consideration.

A/V Time Difference Based Technique. This technique

operates by first performing simple pruning of the audio

options based on the audio features F1 and F2, followed

by the selection of the audio option with the lowest value

for F2. Mathematically, this can be represented as follows.

Y (s) = ((Aj,k′ (s)) | k′= argmin

k(F2(Aj,k(s)))

∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (4)

A/V Neighbor Based Technique. This technique, like the

one described above, performs simple pruning followed by

the selection of the audio option with the highest value for

F3. It is mathematically expressed as follows.

Y (s) = ((Aj,k′ (s)) | k′= argmax

k(F3(Aj,k(s)))

∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (5)

Feature Rank Sum Based Technique. After initial pruning,

feature ranks (R1, R2, R3) are assigned to the different

audio features (F1, F2, F3) based on the rank of the audio

feature in the pruned set of audio options for that particular

video option. For instance, a value of R1 = 1 for an audio

option indicates that the corresponding feature F1 has the

maximum value in the pruned set. Once the ranks have been

assigned, the audio option with the minimum value of the

rank sum (i.e. R1 + R2 + R3) is chosen as the output. R2has been used for tie-breaking.

IV. A/V COMBINATION

After A/V synchronization, each of the video options

for a character has at most one corresponding audio op-

tion. The A/V combination stage generates the final recog-

nized output based on the match scores generated by the

two recognizers as well as the computed audio features.

Classifier combination approaches have also been applied

in several related contexts, for instance, to improve the

recognition accuracy of handwriting recognizers [6], speech

recognizers [7] and a combination of the two [8]. We

have experimented with several rank-level and measurement-

level decision making/combination techniques [9] such as

rank sum and weighted sum rule using classifier-specific

weights and character-specific weights, and also combined

an ensemble of A/V combination classifiers.

A. Combination Techniques

Rank Based Techniques. Simple rank sum or weighted

rank sum may be used for the purpose of combination. This

needs to be accompanied by a suitable tie-breaking strategy.

One of the advantages of using a rank based technique here

is that there is no need to compute weights for the audio

and video components or normalize the scores generated

by different classifiers. We have implemented rank sum for

different subsets of the audio and video recognizer scores

and the audio features.

Classifier-Specific Weight Based Techniques. Apart from

simple sum rule, we have made use of a set of classifier-

specific weights [wV , wA] to combine the scores generated

by each of the recognizers. Th weight wV is for the VTR

and wA is for the ATR. We have shown the results of

weighted combination for different combinations of the

classifier specific weights in Table II.

Z(s) = (Vcj′ (s) | j

′= argmax

j(wV × V

pj (s) + wA × A

p

j,k′ (s))

∧Vcj (s) ∈ O(s)) (6)

Character-Specific Weight Based Techniques. As the

accuracy of the VTR and the ATR may be significantly

different for each character, the use of a different set of

weights for every character could prove to be advantageous.

One possible way to compute the video weight wV (Vcj (s))

(in Equation 7) for the jth video option for the character s

would be to use the VTR’s accuracy-related metrics, such

as precision and sensitivity, that correspond to the character

label V cj (s). The value of the audio weight wA(V

cj (s)) can

either be computed in a similar fashion for the ATR and

both the weights normalized to return a sum of 1 or we can

assign wA(Vcj (s)) = 1− wV (V

cj (s)).

Z(s) = (Vcj′ (s) | j

′= argmax

j(wV (V

cj (s)) × V

pj (s)

+wA(Acj,k′ (s)) × A

p

j,k′ (s)) ∧ Vcj (s) ∈ O(s)) (7)

Classifier Ensemble Based Techniques. This is a two-

level combination technique. First, we create an ensemble

of classifiers i.e. the first stage of combination, by using

different audio and video combination weights, different

subsets of the audio feature set and different combination

techniques. After this we prune away those classifiers that

show no potential to improve the final accuracy when eval-

uated on a training set. Finally, we perform the second level

of combination i.e. combining the outputs of the ensemble

of classifiers. A number of combination techniques such as

majority voting, rank sum and character-specific classifier

selection may be employed. We have used rank sum as

the second level combination technique and the results are

shown in SectionV.

V. EXPERIMENTAL RESULTS

Implementation & Setup. While the system is not tied

to a specific character recognizer or speech recognizer, our

current implementation makes use of an extended version

of the GNU Optical Character Recognition or GOCR [10]

tool and the Nexidia word-spotter [11]. Our video data

set [12] has been recorded in a classroom-like environment

944 2011 11th International Conference on Intelligent Systems Design and Applications

Page 5: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

Table ISYNCHRONIZATION TECHNIQUES

Technique Pruning Time Neighbor Audio Synchronization Accuracy (%)Window Time Window Good Timestamping Poor Timestamping Good Timestamping[PB , PA] [WB ,WA] Good Alignment Good Alignment Poor Alignment

A/V Time Difference [-1,5] - 86.4 56.6 23.1A/V Time Difference [-6,20] - 77.3 60.2 41.0A/V Neighbor [-6,20] [-4,4] 63.6 47.0 28.2A/V Neighbor [-6,20] [-8,8] 65.9 55.4 43.6Feature Rank Sum [-1,5] [-8,8] 86.4 56.6 23.1Feature Rank Sum [-6,20] [-8,8] 81.8 65.1 46.2

and consists of a fairly large training and test dataset with

more than 6000 characters and more than 100 videos. The

character set includes capital and small alphabets, numbers

and basic algebraic operators.

Synchronization. We evaluated the synchronization tech-

niques (Table I) proposed in the paper using videos with a

good audio-video recording alignment (RA) that were times-

tamped using both the poor and the good timestamping (TS)

methods. While videos with poor TA or poor RA resulted

in regions with a large values for δT and δR (≈ 10-20

seconds), the videos with both good TS and good RA have

fairly small values for these time differences (≈ 2-5 seconds).

For all the experiments discussed here, we have used the

number of neighbors window [NB , NA] = [2, 2] i.e. two

neighbors before and two neighbors after the character under

consideration. We have used different pruning time windows

and neighbor audio time windows such as [PB , PA]=[-6,20]

which refers to 6 seconds before and 20 seconds after and

[WB ,WA]=[-8,8] refers to 8 seconds before and after the

audio timestamp of the character under consideration.

A/V Time Difference Based Technique. The A/V time differ-

ence based technique outperformed the A/V neighbor based

technique for videos with good TS and good RA as the

best audio option is most likely to be in the vicinity due to

very small δT and δR values. For videos with poor RA (i.e.

large value of δR), we notice a significant deterioration in

synchronization accuracy. For videos with good RA but poor

TS, we saw a deterioration in accuracy but not as significant

as the videos with poor RA. This is due to the fact that

only the characters in the lower half of the whiteboard were

affected by the shadows. Notice that a small pruning window

[PB , PA]=[-1,5] for this technique (and the feature rank sum

technique) results in a very high synchronization accuracy

(≈ 86.4%), but significantly deteriorates the accuracy for

other techniques. This is because most of the correct audio

options were pruned away by the small pruning window.

A/V Neighbor Based Technique. For videos with poor RA,

neighbors become very important and we can see that this

technique outperforms the time difference technique. The

improvement was not very significant due to the fact that

the VTR accuracy for the system is 55.0%, so the neighbors

were not very reliable. Therefore, the next step for neighbor

based techniques would be to select the best neighbors for

synchronization. Non-ambiguous characters and characters

that are usually well recognized by the ATR (without too

many false positives or negatives) would be good choices.

One should also observe that for this technique, the neighbor

audio window [WB ,WA] size is also crucial and the syn-

chronization deteriorates if this window is too small ([-4,4]).

Feature Rank Sum Based Technique. The feature rank sum

based technique outperforms both the time difference and

neighbor based techniques for most videos. For example,

feature rank sum gave a synchronization accuracy of 81.8%

compared to 77.3% (time difference based) and 65.9%

(neighbor based) for good TS and good RA with [PB , PA]=[-

6,20] and [WB ,WA]=[-8,8]. We utilize this synchronization

technique for the combination experiments below.

Combination. Table II shows the results for various A/V

combination techniques described in Section IV.

Baseline and Ambiguity Detection. We observe that the VTR

(without the ambiguity detection and the A/V combination)

provides a character recognition accuracy of 55.0%. This

forms the baseline for future comparisons. For each of the

reported experiments, we performed ambiguity detection and

option selection [5], both with a relative threshold of 0.9.

This divides the test set S into two subsets of ambiguous

characters SD (27.9 + 8.6 + 8.8 + 32.0 = 77.3% of S)and

non-ambigiuous characters S − SD (18.5 + 4.1 = 22.6% of

S). The VTR accuracy for the non-ambiguous character set

αv(S − SD) is 81.7% and for the ambiguous character set

αV (SD) is 47.1%. It is this αV (SD) value that we hope to

boost by replacing it with αZ(SD) to improve α(S).

Rank Based Techniques. Table II shows several different

rank based techniques. The rank sum for these experiments

was based on different subsets of audio features F1, F2,

F3 and the VTR match score V. We observe that the rank

sum that is based on V and F1 has the highest end-to-end

system accuracy of 59.9% for this technique. Notice that the

other audio features do not seem to significantly impact the

combination stage, which can be attributed to the fact that

the test set used here is reasonably well aligned.

Classifier-Specific Weight Based Techniques. For classifier-

specific weights (Cfr. Wts.), we used different combinations

of [wV , wA] and the results tabulated in Table II show that

the optimal value of weights is close to [0.8, 0.2], with

α(S) = 61.9%. It is interesting to see how the four ratios of

ambiguous characters that are correctly/wrongly recognized

by the VTR and the combined recognizer (AV), change when

2011 11th International Conference on Intelligent Systems Design and Applications 945

Page 6: [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) - Cordoba, Spain (2011.11.22-2011.11.24)] 2011 11th International Conference on Intelligent

Table IICOMBINATION TECHNIQUES

Technique Non-Ambiguous Chars. S − SD Ambiguous Chars. SD All Chars. S

#VC|S|

#VW|S|

αV (S − SD)#(VC,AVC )

|S|#(VC,AVW )

|S|#(VW ,AVC )

|S|#(VW ,AVW )

|S|αV (SD) αZ (SD) α(S)

% % % % % % % % % %

VTR � 55.0 45.0 55.0 – – – – – – 55.0RankSum[V, F1, 2, 3] 18.5 4.1 81.7 27.9 8.6 8.8 32.0 47.1 47.5 55.2[V, F1, 2] " " " 30.9 5.5 8.6 32.3 " 51.1 58.0[V, F1]� " " " 29.8 6.6 11.6 29.3 " 53.6 59.9Cfr. Wts.[0.1, 0.9] 18.5 4.1 81.7 26.8 9.7 12.2 28.7 47.1 50.4 57.5[0.2, 0.8] " " " 27.1 9.4 12.2 28.7 " 50.7 57.7[0.3, 0.7] " " " 27.9 8.6 12.2 28.7 " 51.8 58.6[0.4, 0.6] " " " 28.2 8.3 12.2 28.7 " 52.1 58.8[0.5, 0.5] " " " 28.2 8.3 12.2 28.7 " 52.1 58.8[0.6, 0.4] " " " 29.3 7.2 12.2 28.7 " 53.6 59.9[0.7, 0.3] " " " 30.9 5.5 11.9 29.0 " 55.4 61.3[0.8, 0.2]♠ " " " 32.3 4.1 11.0 29.8 " 56.1 61.9[0.9, 0.1] " " " 34.8 1.7 7.5 33.4 " 54.6 60.8Chr. Wts.Sensitivity 18.5 4.1 81.7 28.2 8.3 12.2 28.7 47.1 52.1 58.8Precision ♣ " " " 28.5 8.0 17.4 23.5 " 59.3 64.4Ensemble�,♠,♣ 18.5 4.1 81.7 – – – – 47.1 57.1 62.7�,�,♠,♣ " " " – – – – " 61.1 65.7�,♠,♣ " " " – – – – " 64.3 68.2

VC : Correct VTR Output, VW : Wrong VTR Output, AVC : Correct A/V Combination Output, AVW : Wrong A/V Combination Output

the weights are varied. For high values of wV , the final AV

output tends to be very similar to the VTR output and results

in a high ratio for (VC , AVC) and (VW , AVW ) and a much

smaller ratio of charaters change from VC to AVW or VW to

AVC . As expected, the reverse trend is true when wA takes

a higher value. The improvement in the final result comes

from reducing the number of characters in (VC , AVW ).

Character-Specific Weight Based Techniques. We have gen-

erated character-specific weights (Chr. Wts.) in two ways.

The first uses the VTR sensitivity of each character as wV

and the second used the VTR precision of each character as

wV . Here, wA = 1 − wV . Between these two techniques,

the precision based weights seem to perform better with an

α(S) of 64.4%. Here, the improvement mostly comes from

increasing the number of characters in (VW , AVC).

Classifier Ensemble Based Techniques. Finally for ensemble

based techniques, Table II shows that we need to select the

correct set of classifiers to combine and that increasing the

number of classifiers does not necessarily increase the final

system accuracy. For ensemble based techniques, we see a

very significant improvement in the end-to-end system char-

acter recognition accuracy. The maximum accuracy achieved

was 68.2% which is 13.2% absolute improvement and a

24.0% relative improvement compared to the VTR.

VI. CONCLUSIONS & FUTURE WORK

In this paper, we focused on techniques for A/V synchro-

nization and A/V combination that can assist in improving

the character recognition accuracy for handwritten mathe-

matical content in classroom videos. Going forward, we plan

to carefully examine the impact of errors introduced by each

component in this multi-stage recognition system, develop

synchronization techniques that take in consideration both

the accuracy and the order of the neighbors and techniques

for audio-assisted structure disambiguation.

REFERENCES

[1] R. Zanibbi et al., “Recognizing mathematical expressionsusing tree transformation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 11, 2002.

[2] R. H. Anderson, “Two-dimensional mathematical notations,”Syntactic Pattern Recognition Applications (K.S. FU), 1977.

[3] M. Wienecke et al., “Toward automatic video-based white-board reading,” IJDAR, vol. 7, no. 2–3, 2005.

[4] L.-W. He and Z. Zhang, “Real-time whiteboard capture andprocessing using a video camera for remote collaboration,”IEEE Transactions on Multimedia, vol. 9, no. 1, 2007.

[5] S. Vemulapalli and M. Hayes, “Ambiguity detection methodsfor improving handwritten mathematical character recognitionaccuracy in classroom videos,” in DSP, 2011.

[6] W. Wang et al., “Combination of multiple classifiers forhandwritten word recognition,” in IWFHR, 2002.

[7] J. Fiscus, “A post-processing system to yield reducedword error rates: Recognizer output voting error reduction(ROVER),” in ASRU, 1997.

[8] J. Hunsinger and M. Lang, “A speech understanding modulefor a multimodal mathematical formula editor,” in ICASSP,2000.

[9] S. Tulyakov et al., “Review of classifier combination meth-ods,” in Studies in Computational Intelligence: MachineLearning in Document Analysis and Recognition, 2008.

[10] “GOCR,” http:// jocr.sourceforge.net/ .

[11] “Nexidia,” http://www.nexidia.com/ .

[12] “Classroom Videos Dataset,” http://users.ece.gatech.edu/~smita/dataset/ .

946 2011 11th International Conference on Intelligent Systems Design and Applications