[IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece...

6
AMBIGUITY DETECTION METHODS FOR IMPROVING HANDWRITTEN MATHEMATICAL CHARACTER RECOGNITION ACCURACY IN CLASSROOMVIDEOS Smita Vemulapalli 1 Monson Hayes 2 1 School of Electrical & Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA 2 Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, Korea ABSTRACT In classroom videos, recognizing mathematical content, handwritten on the whiteboard presents a unique opportu- nity in the form of audio content spoken by the instructor. This recognized audio content can be used to improve the character recognition accuracy by providing evidence in corroboration or contradiction of the output options gen- erated by the primary, video based recognizer. However, such audio-video based disambiguation also has the potential to introduce errors in what may have been the correct output from the video based recognizer. In this paper, we focus on improving the character recognition accuracy by developing ambiguity detection methods that can be used to determine the set of potentially incorrect outputs from the video based recognizer and, for each such output, determining the subset of possibly correct output options that must be forwarded for audio-video based character disambiguation. In this pa- per, we propose, implement and evaluate a number of such ambiguity detection methods. Index TermsClassifier combination, character recog- nition, multimodal, video processing, speech processing. 1. INTRODUCTION The rapid proliferation of high-speed networks coupled with a decline in the cost of storage has fueled an unprecedented growth in the number of e-learning and advanced learning initiatives. Initiatives like these rely on online delivery of pre-recorded classroom videos as the primary medium of in- struction or make them available online for reference by the students. As the volume of such recorded video content in- creases, it is becoming clear that in order to efficiently nav- igate through the available classroom videos there is a need for techniques that can help extract, identify and summarize the content in such videos. In this context, and given the fact that the whiteboard continues to be the preferred and effec- tive medium for teaching complex mathematical and scien- tific concepts, this paper focuses on the problem of recogniz- ing handwritten mathematical content in classroom videos. There is a significant body of research devoted to the topic of mathematical content recognition [1, 2], and to the extrac- tion and recognition of textual content from videos [3, 4]. Al- though our work closely relates to and utilizes the advances 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 Accuracy Number of options Video Recognizer + Oracle Video Recognizer + A/V Combination Video Recognizer + Ambiguity + A/V Combination Fig. 1. Importance of ambiguity detection made in the aforementioned fields, its focus on the use of an audio based recognizer in combination with a primary, video based recognizer to possibly improve the character recogni- tion accuracy offers a range of new and interesting challenges. A particularly interesting challenge, amongst several others, stems from the the fact that a combination with the audio based recognizer can possibly introduce errors in the output of the video based recognizer and even deteriorate the char- acter recognition accuracy as compared to that of the primary recognizer. In this paper, we propose a set of ambiguity de- tection methods that can be used to not only limit the errors introduced by the combination stage but also increase the pos- sibility of correcting the errors in the output of the video based recognizer. Our approach makes use of a video based recognizer which, for each segmented character from the video, outputs a set of possible video options; where each video option con- tains a recognized character corresponding to the segmented character and a video match score that corresponds to the recognizer’s belief in the correctness of the video option. In many cases, due to reasons like a significantly high video match score assigned to the top video option or by using the information derived from the recognizer’s past runs over a user-specific training set, one can attempt to determine the most likely correct output from the set of video options gen- erated by the video recognizer. The segmented characters, which are thus considered to be correctly recognized are not forwarded to the combined recognizer, eliminating the possibility of any deterioration (which may be more likely in this case) or improvement in accuracy for this subset of characters. The remaining segmented characters are consid- ered to be ambiguous and are forwarded for combination. In particular, we make the following contributions:

Transcript of [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece...

Page 1: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

AMBIGUITY DETECTION METHODS FOR IMPROVING HANDWRITTENMATHEMATICAL CHARACTER RECOGNITION ACCURACY IN CLASSROOM VIDEOS

Smita Vemulapalli1 Monson Hayes2

1School of Electrical & Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA2Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, Korea

ABSTRACT

In classroom videos, recognizing mathematical content,

handwritten on the whiteboard presents a unique opportu-

nity in the form of audio content spoken by the instructor.

This recognized audio content can be used to improve the

character recognition accuracy by providing evidence in

corroboration or contradiction of the output options gen-

erated by the primary, video based recognizer. However, such

audio-video based disambiguation also has the potential to

introduce errors in what may have been the correct output

from the video based recognizer. In this paper, we focus on

improving the character recognition accuracy by developing

ambiguity detection methods that can be used to determine

the set of potentially incorrect outputs from the video based

recognizer and, for each such output, determining the subset

of possibly correct output options that must be forwarded

for audio-video based character disambiguation. In this pa-

per, we propose, implement and evaluate a number of such

ambiguity detection methods.

Index Terms— Classifier combination, character recog-

nition, multimodal, video processing, speech processing.

1. INTRODUCTION

The rapid proliferation of high-speed networks coupled with

a decline in the cost of storage has fueled an unprecedented

growth in the number of e-learning and advanced learning

initiatives. Initiatives like these rely on online delivery of

pre-recorded classroom videos as the primary medium of in-

struction or make them available online for reference by the

students. As the volume of such recorded video content in-

creases, it is becoming clear that in order to efficiently nav-

igate through the available classroom videos there is a need

for techniques that can help extract, identify and summarize

the content in such videos. In this context, and given the fact

that the whiteboard continues to be the preferred and effec-

tive medium for teaching complex mathematical and scien-

tific concepts, this paper focuses on the problem of recogniz-

ing handwritten mathematical content in classroom videos.

There is a significant body of research devoted to the topic

of mathematical content recognition [1, 2], and to the extrac-

tion and recognition of textual content from videos [3, 4]. Al-

though our work closely relates to and utilizes the advances

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Accuracy

Number of options

Video Recognizer + Oracle

Video Recognizer + A/V Combination

Video Recognizer + Ambiguity + A/V Combination

Fig. 1. Importance of ambiguity detection

made in the aforementioned fields, its focus on the use of an

audio based recognizer in combination with a primary, video

based recognizer to possibly improve the character recogni-

tion accuracy offers a range of new and interesting challenges.

A particularly interesting challenge, amongst several others,

stems from the the fact that a combination with the audio

based recognizer can possibly introduce errors in the output

of the video based recognizer and even deteriorate the char-

acter recognition accuracy as compared to that of the primary

recognizer. In this paper, we propose a set of ambiguity de-

tection methods that can be used to not only limit the errors

introduced by the combination stage but also increase the pos-

sibility of correcting the errors in the output of the video based

recognizer.

Our approach makes use of a video based recognizer

which, for each segmented character from the video, outputs

a set of possible video options; where each video option con-

tains a recognized character corresponding to the segmented

character and a video match score that corresponds to the

recognizer’s belief in the correctness of the video option. In

many cases, due to reasons like a significantly high video

match score assigned to the top video option or by using the

information derived from the recognizer’s past runs over a

user-specific training set, one can attempt to determine the

most likely correct output from the set of video options gen-

erated by the video recognizer. The segmented characters,

which are thus considered to be correctly recognized are

not forwarded to the combined recognizer, eliminating the

possibility of any deterioration (which may be more likely

in this case) or improvement in accuracy for this subset of

characters. The remaining segmented characters are consid-

ered to be ambiguous and are forwarded for combination. In

particular, we make the following contributions:

Bean Sidhe
Typewriter
978-1-4577-0274-7/11/$26.00 ©2011 IEEE DSP2011
Page 2: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

• Mathematical Model - A formal model capturing the var-

ious aspects of the combination system is presented and

used to describe ambiguity detection and classifier com-

bination.

• Ambiguity Detection - We propose simple and character

specific thresholding methods, as well as one that relies

on observed accuracy to map the confidence score of the

video options to a new value.

• Option Selection - We suggest methods for selecting video

options for submission to the combination stage, and pro-

vide preliminary evidence of the impact of option selec-

tion on recognition accuracy.

2. SYSTEM OVERVIEW

Our system [5] for handwritten mathematical content recog-

nition in classroom videos is shown as a block diagram in

Figure 2. We are interested in the blocks that fall under the

audio-video based character recognition part. Our character

recognition system is organized as a two-stage assembly. The

first stage consists of a video text recognizer (VTR), based on

an open source character recognizer GOCR [6], and is used as

the primary character recognizer. The VTR is designed to re-

turn a list of possible characters and the corresponding video

match scores for each input character. The second stage con-

sists of the combination stage which includes the character

disambiguation block and the audio text recognizer (ATR).

The A/V Combination (AVC) module makes use of the ATR,

which in this case is the Nexidia word spotter [7], to disam-

biguate the output from the VTR to produce the final recog-

nized character. However, the AVC does not operate on all the

input characters; it internally utilizes the ambiguity detection

module for detecting potentially incorrect VTR output and for

such VTR output forwards only the subset of options that have

a high likelihood of being the final correct output to the AVC.

The VTR output that is classified as correct by the ambigu-

ity detection module is directly forwarded as an output of the

combined stage. The output of the combination stage acts as

an input to the structure analysis stage, which is presented

here for completeness. In this paper, we focus primarily on

the ambiguity detection module. While our existing imple-

mentation is based on GOCR and the Nexidia word spotter,

the methods proposed in this paper are not tied to these im-

plementation and can be easily extended for use with other

such recognizers. Other multi-stage (serial) classifier combi-

nation systems include [8] and [9].

3. MATHEMATICAL MODEL

In this section, we present a formal model for the recognizers

which can be used to describe the ambiguity detection and

classifier combination approaches. We have also mathemati-

cally modeled the recognition accuracies for the separate rec-

ognizers (Equations 2 and 3) as well as for the end-to-end

system (Equations 4 and 5).

Let C represent the character set recognizable by the VTR

(and the ATR), and S be the set of segmented character ele-

ments s from the given test set of videos such that each ele-

ment s = (si, st, sl) contains the segmented character’s im-

age si, timestamp st and location coordinates sl, and G is a

function that returns it’s ground truth G(s) ∈ C. For future

use, we define an Equals function E(c1, c2) where c1, c2 ∈ C.

E(c1, c2) =

{

1 , c1 = c2

0 , otherwise(1)

The character recognizer used within the VTR is repre-

sented as V . For any segmented character s, the function

V (s) = ((V cj (s), V

pj (s)) | ∀V

cj (s) ∈ C ∧ V p

j (s) ∈ [0, 1]) re-

turns an ordered set of video options arranged in descending

order of the value of the score V pj (s). Here, V c

j (s) represents

the jth video option’s recognized character name and V pj (s)

is the corresponding match score generated by the recognizer.

In absence of other inputs, V c1 (s) is the final recognized out-

put for the segmented character s. The character recognition

accuracy of the VTR αV (S) can be computed as follows:

αV (S) =

∀s∈S

E(V c1 (s), G(s))

|S|(2)

The ambiguity detection function D(S, V ) = SD ⊂ S is

designed to return SD, a subset of characters from the set Sthat are categorized as ambiguous based on various conditions

imposed on the output of the VTR V . Various methods for

ambiguity detection have been described in Section 4. Here,

S − SD represents the set of non-ambiguous characters.

We do not present the model for the ATR, referred to as A,

as it is not relevant to this paper but one may refer to [10] for

details. Now, we model the combination/fusion function for

AVC as F (s, S, V,A) = c ∈ C where s ∈ SD and makes use

of the set of segmented characters S, the output of the VTR

V and the ATR A to return the recognized character corre-

sponding to the segmented character element s. The AVC

algorithm used in our system is described in Section 5. The

character recognition accuracy of the AVC algorithm αF for

the set of ambiguous characters SD can now be expressed as:

αF (S, V,A, SD) =

∀s∈SD

E(F (s, S, V,A), G(s))

|SD|(3)

Finally, the character recognition accuracy α computed

on the entire test set S with AVC employed on the set of am-

biguous characters SD and only VTR employed on the set of

non-ambiguous characters S − SD can be expressed as fol-

lows:

α(S, V,A,D) =αV (S − SD) × |S − SD| + αF (S, V,A, SD) × |SD|

|S|(4)

The above equation for character recognition accuracy

can be modified to show (in Equation 5) the contribution of

each stage of ambiguity detection such as mapping, thresh-

olding and option selection. Here, the sets S − SD and

SD are the result of thresholding and they determine which

characters are passed only through the VTR and which are

Page 3: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

���������

������

�� ��� ����������

��������������

����������

�������������

��������

�����

��������

���������

����

����������

�����������������������

�����������

������� �������������

������������������ ���� �����!������������������������ ���� �����!����������������������

������� ��

�����������

"������ �����������#������

���������

���� �����

����������

�������������������

�����������������������

������� ��

�����������

��

��� ��

Fig. 2. End-to-end system overview

forwarded to the AVC. αV+M represents the recognition ac-

curacy achievable at the end of the mapping stage. In case

of identity mapping (i.e. no mapping), αV+M is the same as

αV . αNF

′represents the constant recognition accuracy of the

AVC, given that it is given a set of N options out of which

one is the correct option and RNAC represents the ratio of am-

biguous characters SD that have the correct option in the Noptions forwarded to the AVC. Therefore, the final character

recognition accuracy when passing N options to the AVC is

shown below:

αN(S) =

αV +M (S − SD) × |S − SD| + αNF

′× RN

AC(SD) × |SD|

|S|(5)

4. AMBIGUITY DETECTION

In our two-stage character recognition system, one of the most

important decisions is when to use the first stage alone and

when to make use of the second stage. If this task is done

well, we may see an end-to-end system accuracy that is higher

than the recognition accuracy achieved by passing the entire

test set through any one of the recognizers or through both

(without ambiguity detection). As can be seen in Figure 2, the

ambiguity detection module involves three main operations:

mapping, thresholding and option selection. These operations

are followed by a suitable A/V combination (AVC) strategy.

4.1. Thresholding

4.1.1. Simple Thresholding

A very simple ambiguity detection method (in Equation 6)

considers the segmented character element s ∈ S to be am-

biguous if the score of the first video option V p1 (s) is less than

a predetermined absolute threshold value T1.

D(S, V ) = {s ∈ S | Vp1 (s) < T1} (6)

Another simple condition for ambiguity (as shown in

Equation 7) compares the difference (or the ratio) between the

scores of first and second video options to a relative threshold

value T2. Note that both the methods discussed above make

use of a suitable single threshold for all characters.

D(S, V ) = {s ∈ S | Vp1 (s) − V

p2 (s) < T2} (7)

4.1.2. Character-specific thresholding

The use of a single threshold for all characters may not be the

most optimal thresholding technique. A natural progression

from a single threshold value for all characters would be the

use of character-specific thresholds T (c) (as shown in Equa-

tion 8) in the place of a single absolute threshold.

D(S, V ) = {s ∈ S | Vp1 (s) < T (V

c1 (s))} (8)

For each character c ∈ C, we compute a threshold T (c)(shown in Equation 18) that maximizes the system accuracy

α(S(c), T (c)) (defined in Equations 16 and 17) computed

over the set S(c) which consists of all characters in the train-

ing set ST that claim to be a c, i.e. S(c) = {s ∈ ST | V c1 (s) =

c}. Each character in S(c) falls into one of the three sets:

S1(c), SN (c) and S∞(c). Here, S1(c) refers to those ele-

ments of S(c) whose first option is correct, SN (c) refers to

those that have the correct option within the first N options

but not the first option and S∞(c) refers to those that do not

have the correct option in the first N options and therefore

cannot be corrected by our system as they will not be passed

on to the AVC.

S1(c) = {s ∈ S(c) | Vc1 (s) = G(s)} (9)

SN (c) = {s ∈ S(c) | ∃x : Vcx (s) = G(s) ∧ 1 < x ≤ N} (10)

S∞(c) = {s ∈ S(c) | ∃x : Vcx (s) = G(s) ∧ x > N} (11)

Our aim would be to pass as many elements from SN (c)to the AVC by tagging them as ambiguous and for the el-

ements in S1(c), the VTR output seems to be correct and

therefore we do not wish to introduce errors by passing it

to the AVC. For this we need to find the character-specific

threshold T (c) value that maximizes the end-to-end charac-

ter recognition system α(S(c), T ). To compute this accu-

racy α(S(c), T ), we define four subsets of S(c) (see Equa-

tions 12, 13, 14 and 15) which are referred to as True Posi-

tive TP (c), False Negative FN(c), False Positive FP (c) and

True Negative TN(c) given that our aim is to tag elements of

S1(c) as non-ambiguous and those of SN (c) as ambiguous.

TP (c) = {s ∈ S1(c) | Vp1 (s) ≥ T} (12)

FN(c) = {s ∈ S1(c) | Vp1 (s) < T} (13)

Page 4: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

FP (c) = {s ∈ SN (c) | Vp1 (s) ≥ T} (14)

TN(c) = {s ∈ SN (c) | Vp1 (s) < T} (15)

In order to calculate the character-level accuracy of the

end-to-end character recognition system α(S(c), T ) (Equa-

tions 16 and 17), we weight each of the above four subsets

by the chances of selecting the correct option from them. It

is this α(S(c), T ) value that we need to maximize to obtain

the optimal value of the character-specific threshold T (c) .

For TP (c), the correct option is in the first place and it is not

passed to the AVC, therefore the chances of finally getting the

correct option from it is 1 and for FN(c) and TN(c), since

they are passed to the AVC and have the correct option within

the top N options, the weight used is αNF

′. Finally for FP (c),

the first option is not the correct option and since it has not

been passed to the AVC this error cannot be corrected.

α(S(c), T ) =1 × TP (c) + αN

F

′× FN(c) + 0 × FP (c) + αN

F

′× TN(c)

TP (c) + FN(c) + FP (c) + TN(c)(16)

α(S(c), T ) =TP (c) + αN

F

′× (FN(c) + TN(c))

TP (c) + FN(c) + FP (c) + TN(c)(17)

T (c) = argmaxT

(α(S(c), T )) (18)

4.2. Mapping

Often, the scores generated by the VTR for the video options

are not the best basis for ambiguity detection and classifier

combination. We may map/rescore these values to a new set

of values that help in better detection of ambiguous charac-

ters, better selection of options to forward to the AVC and

better combination with the ATR confidences as the new nor-

malized scores used for sum-rule or as weights for weighted

sum. Some commonly used score normalization techniques

have been discussed in [11].

4.2.1. Score to conditional probability mapping

One mapping (in this case, rescoring) technique that we have

used to recompute the score for each video option is to com-

pute the value of the conditional probability (defined in Equa-

tions 19 and 20) for the video option and replace its score by

this conditional probability. Let us assume we are rescoring

for a segmented character s and let us consider the ith video

option (V ci (s) = X,V p

i (s) = Y ) to explain the conditional

probability function. The new score/conditional probability

value V pi

′(s) is the conditional probability (computed on the

training set ST ) that the segmented character s is really an Xgiven that the VTR score corresponding to X is Y .

Vpi

′(s) = Prob

(

G(s) = X

Vpi (s) = Y

)

(19)

Vpi

′(s) =

|{s ∈ ST | ∃j : G(s) = X ∧ V cj (s) = X ∧ V

pj (s) = Y }|

|{s ∈ ST | ∃k : V ck(s) = X ∧ V

p

k(s) = Y }|

(20)

Since the scores of the options generated by the VTR have

changed, the set needs to be reordered again in decreasing

order of the new scores. Although the mapping stage is an

optional stage and may be replaced by an identity mapping

V pi

′(s) = V p

i (s), the use of a suitable mapping technique

may lead to significant improvements (results discussed in

Section 6). In some cases, such as in the case of a limited

training set, it may be advisable to divide the entire range

of scores into sub-ranges and then compute the value of the

conditional probability given that V pi (s) falls in a certain sub-

range instead of a specific value Y .

4.3. Option selection

Using a suitable thresholding method, we may determine

which characters are ambiguous and need to be passed to the

AVC. The next question is how many video options should

be passed to the AVC for each of these ambiguous characters.

The discussion for finding the optimal number of options can

be found in Section 6.

Simple option selection techniques (defined in Equa-

tions 21, 22 and 23) include selecting the top NumOptnumber of options, selecting options whose score is greater

than an absolute threshold AbsThr and selecting options

whose score relative to that of the top option is less than a rel-

ative threshold RelThr. In most of our experiments, we have

used a fixed number of options. A good mapping technique

may improve the chances of choosing the correct options.

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ x ≤ NumOpt} (21)

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V

px (s) > AbsThr} (22)

O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V

p1 (s) − V

px (s) < RelThr} (23)

5. A/V CLASSIFIER COMBINATION

Although the primary focus of this paper is to motivate, de-

scribe and evaluate the ambiguity detection methods used to

improve the character recognition accuracy, for complete-

ness we briefly explain the audio-video based combination

techniques that were used along with the ambiguity detec-

tion methods to perform character disambiguation. The AVC

component (see Figure 2) consists of two stages: synchro-

nization followed by combination. The synchronization is

necessary because for a single video option the ATR may

return multiple audio options from which we must select

(synchronize) the most likely audio option corresponding to

the video option. This synchronization is based on audio fea-

tures such as the difference between the time of occurrence of

the character in the video component and the audio compo-

nent, the presence of neighboring characters from the video

in the neighborhood of the audio occurrence, and the audio

match score returned by the ATR. The combination stage

generates the final recognized output based on the match

scores generated by the two recognizers.

Page 5: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

We have experimented with several rank and measurement-

level combination techniques [12] such as Borda count (with

a suitable tie breaking strategy), simple sum-rule, weighted

sum-rule using classifier-level weights and weighted sum-rule

using character-level weights and the results of preliminary

experiments with these combination techniques have been

reported in Figure 8. Further details of audio-video combina-

tion techniques used may be found in [5] and [10].

6. EXPERIMENTS

Implementation & Setup. The current implementation of the

system makes use of the GNU Optical Character Recogni-

tion tool [6] and the Nexidia word spotter [7]. Our video data

set [13] has been recorded in a classroom-like environment

and consists of a set of character-specific videos (for esti-

mating character-specific thresholds and weights) and a set of

test videos (used for evaluation). While the set of character-

specific videos is fairly extensive, with more than 4000 char-

acters, the test videos, although substantial in number, are still

a work in progress. The character set includes capital and

small alphabets, numbers and basic algebraic operators.

Ambiguity Detection. Figure 3 shows the results for a sys-

tem that uses identity mapping followed by thresholding us-

ing a single absolute threshold, Figure 4 refers to a system that

uses identity mapping followed by character specific thresh-

olds and Figure 5 refers to one that uses the score to condi-

tional probability mapping followed by thresholding with a

single absolute threshold. The option selection for all three

experiments involves the use of the top N options.

Simple Thresholding. Figure 3 shows that as the threshold in-

creases, the ratio of ambiguous characters |SD|/|S| increases

while the set of non-ambiguous characters S − SD passed

only through the VTR reduces to a smaller set with a higher

αV+M (S−SD). For a fixed value of αNF

′= 0.7, the character

recognition accuracy when using the top 3 and top 4 options

is shown as α3(S) and α4(S) respectively. The maximum

accuracy achieved using this technique is α4(S) = 0.717

for α4F

′= 0.7 (an improvement of 0.018 over the original

αV +M(S) = 0.699). Here, the mapping used is an identity

mapping and therefore this figure purely reflects the improve-

ment that can be achieved by simple (absolute) thresholding.

Character-specific Thresholding. The character-specific

thresholding plot (Figure 4) is quite similar to Figure 3 except

that we cannot plot against the threshold (as we use a differ-

ent threshold for each character) and since we have computed

the character-specific thresholds by maximizing α(S(c)) for

the same value of αNF

′, we use αN

F

′as the x-axis for the plot.

We generate the plot for discrete values of αNF

′between 0.4

and 0.9, represented by the point markers in the plot. Here,

the maximum value of α3(S) (with identity mapping and

αNF

′= 0.7) was 0.744 while that of the simple thresholding

technique was α3(S) = 0.717.

Mapping. Figure 5 has been generated for a system that uses

the mapping/rescoring method described in Section 4.2 along

with simple thresholding. The plots show a very similar be-

havior except for two things: the value of αV+M and α3

corresponding to maximum value of α show a significant in-

crease compared to the simple thresholding case with values

of 0.906 and 0.754 respectively. Note that the value of αV+M

corresponding to a threshold of 0 is the improvement caused

by mapping alone is 0.744 and is significantly higher than the

accuracy without mapping 0.699.

Option Selection. Figure 1 shows that using an additional

classifier (an oracle, in this case), which can select the correct

recognition output from the video options, has the potential to

improve recognition accuracy. It also shows that the use of an

optimal number of video options and an ambiguity detection

method prevents degradation in the recognition accuracy. In

Figures 3, 4 and 5, we can see that, as expected, the accu-

racy improves when using more options due to more chances

of having the correct option in the selected options (shown in

Figure 7) but we must note that this is for the same αNF

′value.

The αNF

′term is defined as the probability of picking the cor-

rect option given N options with one of them being the correct

option and it decreases with an increase in N because increas-

ing the number of options leads to too many incorrect options

for the AVC to choose from and there is a higher chance that

one of these incorrect options may actually be spoken in the

immediate neighborhood of the character under consideration

and therefore lead to errors from the ATR. If a purely random

(uniform distribution) selection system is used in place of the

AVC, the value of αNF

′= 1/N . These values of αN

F

′depend

on the discriminative power of the ATR for the test set used.

For our system, we have observed a decrease in αNF

′with an

increase in N (as expected) but were unable to extensively

test it to provide the exact values. Since increasing the value

of N tends to increase the RNAC value (shown in Figure 7) and

decreases the αNF

′, it becomes necessary to find the optimal

value for N .

Figure 6 shows that for all techniques discussed above,

the accuracy αV+M (S − SD) improves with a smaller and

more confident subset of non-ambiguous characters. Fur-

thermore, both character-specific thresholding and mapping

methods perform better than the simple thresholding in terms

of αV+M (S − SD). The relative flattening of the curve for

lower |S − SD|/|S| values is because these non-ambiguous

characters have top options with almost similar accuracy and

a lot of values tend to be equally distributed in this range.

For instance, in case of the simple thresholding method with

|S − SD|/|S| ≈ 0.6 there are only two distinct values of

match scores (0.99 and 1.00) remaining in the set for the top

video option. Therefore, the accuracy remains the same while

the fraction |S − SD|/|S| decreases to ≈ 0.4 on account of

removal of a large set of characters with top scores equal to

0.99.

Figure 7 shows that as the ratio of ambiguous charac-

ters increases, many characters that could have been non-

ambiguous start to fall into the ambiguous set thereby increas-

ing RNAC(SD) which is the ratio of characters in the ambigu-

ous set SD with the correct option in the top N options. Also,

as N increases, the ratio RNAC(SD) increases as more options

Page 6: [IEEE 2011 17th International Conference on Digital Signal Processing (DSP) - Corfu, Greece (2011.07.6-2011.07.8)] 2011 17th International Conference on Digital Signal Processing (DSP)

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7 0.75 0.8 0.85 0.9 0.95 1 0

0.2

0.4

0.6

0.8

1

Accuracy

Fraction

Threshold

|S-SD|/|S|

|SD|/|S|

αV+M(S-SD)

α3(S)

α4(S)

Fig. 3. Simple thresholding

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2

0.3

0.4

0.5

0.6

0.7

0.8

Accuracy

Fraction

Accuracy αFN’

|S-SD|/|S|

|SD|/|S|

αV+M(S-SD)

α3(S)

α4(S)

Fig. 4. Character-specific thresholding

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7 0.75 0.8 0.85 0.9 0.95 1 0

0.2

0.4

0.6

0.8

1

Accuracy

Fraction

Threshold

|S-SD|/|S|

|SD|/|S|

αV+M(S-SD)

α3(S)

α4(S)

Fig. 5. Mapping

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy

αV+M(S-SD)

Fraction |S-SD|/|S|

Simple Threshold (for αFN’=0.7)

Mapping (for αFN’=0.7)

Character Specific Threshold

Fig. 6. Accuracy of VTR+Mapping

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6

RACN(SD)

Fraction |SD|/|S|

Simple Threshold RAC3(SD)

Simple Threshold RAC4(SD)

Fig. 7. The ratio RNAC(SD)

0.5

0.6

0.7

0.8

0.9

Accuracy

Different weighted combination methods

Video recognizer only

Sum-rule

Classifier-specific weights

Character-specific video weights

Character-specific audio weights

Fig. 8. Weighted combination

are being considered, leading to a higher chance of having the

correct option within the top N options.

Weighted Combination. Figure 8 shows how the recognition

accuracy α(S) varies for different A/V combination weight-

ing strategies such as sum-rule, classifier-specific weights

and character-specific weights. For details on the weight-

ing schemes used here, refer to [10]. Although the results

reported in Figure 8 are based on a fairly small data set

and therefore cannot be considered conclusive, they open up

interesting possibilities for future work.

7. FUTURE WORK

In future, we plan to extend the ambiguity detection and

A/V classifier combination methods by taking into account

the penalties associated with incorrectly labeling the video

recognizer output as non-ambiguous or ambiguous, enabling

automatic estimation of optimal values for ambiguity thresh-

olds and employing better mapping and option selection

strategies. We are also interested in investigating better clas-

sifier combination strategies and methods that can use the

audio content to disambiguate the structure of the recognized

mathematical content.

8. REFERENCES

[1] R. Zanibbi et al., “Recognizing mathematical expres-

sions using tree transformation,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 24, no. 11, 2002.

[2] R. H. Anderson, “Two-dimensional mathematical nota-

tions,” Syntactic Pattern Recognition Applications (K.S.

FU), 1977.

[3] M. Wienecke et al., “Toward automatic video-based

whiteboard reading,” IJDAR, vol. 7, no. 2–3, 2005.

[4] L.-W. He and Z. Zhang, “Real-time whiteboard capture

and processing using a video camera for remote collabo-

ration,” IEEE Trans. on Multimedia, vol. 9, no. 1, 2007.

[5] S. Vemulapalli and M. H. Hayes, “Using audio based

disambiguation for improving handwritten mathemati-

cal content recognition in classroom videos,” in DAS,

2010.

[6] “GOCR,” http://jocr.sourceforge.net/.

[7] “Nexidia,” http://www.nexidia.com/.

[8] S. Madhvanath and V. Govindaraju, “Serial classifier

combination for handwritten word recognition,” in IC-

DAR, 1995.

[9] Kr. Ianakiev and V. Govindaraju, “Architecture for clas-

sifier combination using entropy measures,” in Multiple

Classifier Systems, vol. 1857 of Lecture Notes in Com-

puter Science. Springer Berlin / Heidelberg, 2000.

[10] S. Vemulapalli and M. H. Hayes, “Character disam-

biguation for audio-video based handwritten mathemat-

ical content recognition,” in submission to ICIP, 2011

(Under review).

[11] A. Jain et al., “Score normalization in multimodal bio-

metric systems,” Pattern Recognition, vol. 38, no. 12,

2005.

[12] S. Tulyakov et al., “Review of classifier combination

methods,” in Studies in Computational Intelligence:

Machine Learning in Document Analysis and Recogni-

tion. 2008.

[13] “Classroom Videos Dataset,” http://users.ece.

gatech.edu/˜smita/dataset/.