AMBIGUITY DETECTION METHODS FOR IMPROVING HANDWRITTENMATHEMATICAL CHARACTER RECOGNITION ACCURACY IN CLASSROOM VIDEOS
Smita Vemulapalli1 Monson Hayes2
1School of Electrical & Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA2Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, Korea
ABSTRACT
In classroom videos, recognizing mathematical content,
handwritten on the whiteboard presents a unique opportu-
nity in the form of audio content spoken by the instructor.
This recognized audio content can be used to improve the
character recognition accuracy by providing evidence in
corroboration or contradiction of the output options gen-
erated by the primary, video based recognizer. However, such
audio-video based disambiguation also has the potential to
introduce errors in what may have been the correct output
from the video based recognizer. In this paper, we focus on
improving the character recognition accuracy by developing
ambiguity detection methods that can be used to determine
the set of potentially incorrect outputs from the video based
recognizer and, for each such output, determining the subset
of possibly correct output options that must be forwarded
for audio-video based character disambiguation. In this pa-
per, we propose, implement and evaluate a number of such
ambiguity detection methods.
Index Terms— Classifier combination, character recog-
nition, multimodal, video processing, speech processing.
1. INTRODUCTION
The rapid proliferation of high-speed networks coupled with
a decline in the cost of storage has fueled an unprecedented
growth in the number of e-learning and advanced learning
initiatives. Initiatives like these rely on online delivery of
pre-recorded classroom videos as the primary medium of in-
struction or make them available online for reference by the
students. As the volume of such recorded video content in-
creases, it is becoming clear that in order to efficiently nav-
igate through the available classroom videos there is a need
for techniques that can help extract, identify and summarize
the content in such videos. In this context, and given the fact
that the whiteboard continues to be the preferred and effec-
tive medium for teaching complex mathematical and scien-
tific concepts, this paper focuses on the problem of recogniz-
ing handwritten mathematical content in classroom videos.
There is a significant body of research devoted to the topic
of mathematical content recognition [1, 2], and to the extrac-
tion and recognition of textual content from videos [3, 4]. Al-
though our work closely relates to and utilizes the advances
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Accuracy
Number of options
Video Recognizer + Oracle
Video Recognizer + A/V Combination
Video Recognizer + Ambiguity + A/V Combination
Fig. 1. Importance of ambiguity detection
made in the aforementioned fields, its focus on the use of an
audio based recognizer in combination with a primary, video
based recognizer to possibly improve the character recogni-
tion accuracy offers a range of new and interesting challenges.
A particularly interesting challenge, amongst several others,
stems from the the fact that a combination with the audio
based recognizer can possibly introduce errors in the output
of the video based recognizer and even deteriorate the char-
acter recognition accuracy as compared to that of the primary
recognizer. In this paper, we propose a set of ambiguity de-
tection methods that can be used to not only limit the errors
introduced by the combination stage but also increase the pos-
sibility of correcting the errors in the output of the video based
recognizer.
Our approach makes use of a video based recognizer
which, for each segmented character from the video, outputs
a set of possible video options; where each video option con-
tains a recognized character corresponding to the segmented
character and a video match score that corresponds to the
recognizer’s belief in the correctness of the video option. In
many cases, due to reasons like a significantly high video
match score assigned to the top video option or by using the
information derived from the recognizer’s past runs over a
user-specific training set, one can attempt to determine the
most likely correct output from the set of video options gen-
erated by the video recognizer. The segmented characters,
which are thus considered to be correctly recognized are
not forwarded to the combined recognizer, eliminating the
possibility of any deterioration (which may be more likely
in this case) or improvement in accuracy for this subset of
characters. The remaining segmented characters are consid-
ered to be ambiguous and are forwarded for combination. In
particular, we make the following contributions:
• Mathematical Model - A formal model capturing the var-
ious aspects of the combination system is presented and
used to describe ambiguity detection and classifier com-
bination.
• Ambiguity Detection - We propose simple and character
specific thresholding methods, as well as one that relies
on observed accuracy to map the confidence score of the
video options to a new value.
• Option Selection - We suggest methods for selecting video
options for submission to the combination stage, and pro-
vide preliminary evidence of the impact of option selec-
tion on recognition accuracy.
2. SYSTEM OVERVIEW
Our system [5] for handwritten mathematical content recog-
nition in classroom videos is shown as a block diagram in
Figure 2. We are interested in the blocks that fall under the
audio-video based character recognition part. Our character
recognition system is organized as a two-stage assembly. The
first stage consists of a video text recognizer (VTR), based on
an open source character recognizer GOCR [6], and is used as
the primary character recognizer. The VTR is designed to re-
turn a list of possible characters and the corresponding video
match scores for each input character. The second stage con-
sists of the combination stage which includes the character
disambiguation block and the audio text recognizer (ATR).
The A/V Combination (AVC) module makes use of the ATR,
which in this case is the Nexidia word spotter [7], to disam-
biguate the output from the VTR to produce the final recog-
nized character. However, the AVC does not operate on all the
input characters; it internally utilizes the ambiguity detection
module for detecting potentially incorrect VTR output and for
such VTR output forwards only the subset of options that have
a high likelihood of being the final correct output to the AVC.
The VTR output that is classified as correct by the ambigu-
ity detection module is directly forwarded as an output of the
combined stage. The output of the combination stage acts as
an input to the structure analysis stage, which is presented
here for completeness. In this paper, we focus primarily on
the ambiguity detection module. While our existing imple-
mentation is based on GOCR and the Nexidia word spotter,
the methods proposed in this paper are not tied to these im-
plementation and can be easily extended for use with other
such recognizers. Other multi-stage (serial) classifier combi-
nation systems include [8] and [9].
3. MATHEMATICAL MODEL
In this section, we present a formal model for the recognizers
which can be used to describe the ambiguity detection and
classifier combination approaches. We have also mathemati-
cally modeled the recognition accuracies for the separate rec-
ognizers (Equations 2 and 3) as well as for the end-to-end
system (Equations 4 and 5).
Let C represent the character set recognizable by the VTR
(and the ATR), and S be the set of segmented character ele-
ments s from the given test set of videos such that each ele-
ment s = (si, st, sl) contains the segmented character’s im-
age si, timestamp st and location coordinates sl, and G is a
function that returns it’s ground truth G(s) ∈ C. For future
use, we define an Equals function E(c1, c2) where c1, c2 ∈ C.
E(c1, c2) =
{
1 , c1 = c2
0 , otherwise(1)
The character recognizer used within the VTR is repre-
sented as V . For any segmented character s, the function
V (s) = ((V cj (s), V
pj (s)) | ∀V
cj (s) ∈ C ∧ V p
j (s) ∈ [0, 1]) re-
turns an ordered set of video options arranged in descending
order of the value of the score V pj (s). Here, V c
j (s) represents
the jth video option’s recognized character name and V pj (s)
is the corresponding match score generated by the recognizer.
In absence of other inputs, V c1 (s) is the final recognized out-
put for the segmented character s. The character recognition
accuracy of the VTR αV (S) can be computed as follows:
αV (S) =
∑
∀s∈S
E(V c1 (s), G(s))
|S|(2)
The ambiguity detection function D(S, V ) = SD ⊂ S is
designed to return SD, a subset of characters from the set Sthat are categorized as ambiguous based on various conditions
imposed on the output of the VTR V . Various methods for
ambiguity detection have been described in Section 4. Here,
S − SD represents the set of non-ambiguous characters.
We do not present the model for the ATR, referred to as A,
as it is not relevant to this paper but one may refer to [10] for
details. Now, we model the combination/fusion function for
AVC as F (s, S, V,A) = c ∈ C where s ∈ SD and makes use
of the set of segmented characters S, the output of the VTR
V and the ATR A to return the recognized character corre-
sponding to the segmented character element s. The AVC
algorithm used in our system is described in Section 5. The
character recognition accuracy of the AVC algorithm αF for
the set of ambiguous characters SD can now be expressed as:
αF (S, V,A, SD) =
∑
∀s∈SD
E(F (s, S, V,A), G(s))
|SD|(3)
Finally, the character recognition accuracy α computed
on the entire test set S with AVC employed on the set of am-
biguous characters SD and only VTR employed on the set of
non-ambiguous characters S − SD can be expressed as fol-
lows:
α(S, V,A,D) =αV (S − SD) × |S − SD| + αF (S, V,A, SD) × |SD|
|S|(4)
The above equation for character recognition accuracy
can be modified to show (in Equation 5) the contribution of
each stage of ambiguity detection such as mapping, thresh-
olding and option selection. Here, the sets S − SD and
SD are the result of thresholding and they determine which
characters are passed only through the VTR and which are
���������
������
�� ��� ����������
��������������
����������
�������������
��������
�����
��������
���������
����
����������
�����������������������
�����������
������� �������������
������������������ ���� �����!������������������������ ���� �����!����������������������
������� ��
�����������
"������ �����������#������
���������
���� �����
����������
�������������������
�����������������������
������� ��
�����������
�
��
��� ��
Fig. 2. End-to-end system overview
forwarded to the AVC. αV+M represents the recognition ac-
curacy achievable at the end of the mapping stage. In case
of identity mapping (i.e. no mapping), αV+M is the same as
αV . αNF
′represents the constant recognition accuracy of the
AVC, given that it is given a set of N options out of which
one is the correct option and RNAC represents the ratio of am-
biguous characters SD that have the correct option in the Noptions forwarded to the AVC. Therefore, the final character
recognition accuracy when passing N options to the AVC is
shown below:
αN(S) =
αV +M (S − SD) × |S − SD| + αNF
′× RN
AC(SD) × |SD|
|S|(5)
4. AMBIGUITY DETECTION
In our two-stage character recognition system, one of the most
important decisions is when to use the first stage alone and
when to make use of the second stage. If this task is done
well, we may see an end-to-end system accuracy that is higher
than the recognition accuracy achieved by passing the entire
test set through any one of the recognizers or through both
(without ambiguity detection). As can be seen in Figure 2, the
ambiguity detection module involves three main operations:
mapping, thresholding and option selection. These operations
are followed by a suitable A/V combination (AVC) strategy.
4.1. Thresholding
4.1.1. Simple Thresholding
A very simple ambiguity detection method (in Equation 6)
considers the segmented character element s ∈ S to be am-
biguous if the score of the first video option V p1 (s) is less than
a predetermined absolute threshold value T1.
D(S, V ) = {s ∈ S | Vp1 (s) < T1} (6)
Another simple condition for ambiguity (as shown in
Equation 7) compares the difference (or the ratio) between the
scores of first and second video options to a relative threshold
value T2. Note that both the methods discussed above make
use of a suitable single threshold for all characters.
D(S, V ) = {s ∈ S | Vp1 (s) − V
p2 (s) < T2} (7)
4.1.2. Character-specific thresholding
The use of a single threshold for all characters may not be the
most optimal thresholding technique. A natural progression
from a single threshold value for all characters would be the
use of character-specific thresholds T (c) (as shown in Equa-
tion 8) in the place of a single absolute threshold.
D(S, V ) = {s ∈ S | Vp1 (s) < T (V
c1 (s))} (8)
For each character c ∈ C, we compute a threshold T (c)(shown in Equation 18) that maximizes the system accuracy
α(S(c), T (c)) (defined in Equations 16 and 17) computed
over the set S(c) which consists of all characters in the train-
ing set ST that claim to be a c, i.e. S(c) = {s ∈ ST | V c1 (s) =
c}. Each character in S(c) falls into one of the three sets:
S1(c), SN (c) and S∞(c). Here, S1(c) refers to those ele-
ments of S(c) whose first option is correct, SN (c) refers to
those that have the correct option within the first N options
but not the first option and S∞(c) refers to those that do not
have the correct option in the first N options and therefore
cannot be corrected by our system as they will not be passed
on to the AVC.
S1(c) = {s ∈ S(c) | Vc1 (s) = G(s)} (9)
SN (c) = {s ∈ S(c) | ∃x : Vcx (s) = G(s) ∧ 1 < x ≤ N} (10)
S∞(c) = {s ∈ S(c) | ∃x : Vcx (s) = G(s) ∧ x > N} (11)
Our aim would be to pass as many elements from SN (c)to the AVC by tagging them as ambiguous and for the el-
ements in S1(c), the VTR output seems to be correct and
therefore we do not wish to introduce errors by passing it
to the AVC. For this we need to find the character-specific
threshold T (c) value that maximizes the end-to-end charac-
ter recognition system α(S(c), T ). To compute this accu-
racy α(S(c), T ), we define four subsets of S(c) (see Equa-
tions 12, 13, 14 and 15) which are referred to as True Posi-
tive TP (c), False Negative FN(c), False Positive FP (c) and
True Negative TN(c) given that our aim is to tag elements of
S1(c) as non-ambiguous and those of SN (c) as ambiguous.
TP (c) = {s ∈ S1(c) | Vp1 (s) ≥ T} (12)
FN(c) = {s ∈ S1(c) | Vp1 (s) < T} (13)
FP (c) = {s ∈ SN (c) | Vp1 (s) ≥ T} (14)
TN(c) = {s ∈ SN (c) | Vp1 (s) < T} (15)
In order to calculate the character-level accuracy of the
end-to-end character recognition system α(S(c), T ) (Equa-
tions 16 and 17), we weight each of the above four subsets
by the chances of selecting the correct option from them. It
is this α(S(c), T ) value that we need to maximize to obtain
the optimal value of the character-specific threshold T (c) .
For TP (c), the correct option is in the first place and it is not
passed to the AVC, therefore the chances of finally getting the
correct option from it is 1 and for FN(c) and TN(c), since
they are passed to the AVC and have the correct option within
the top N options, the weight used is αNF
′. Finally for FP (c),
the first option is not the correct option and since it has not
been passed to the AVC this error cannot be corrected.
α(S(c), T ) =1 × TP (c) + αN
F
′× FN(c) + 0 × FP (c) + αN
F
′× TN(c)
TP (c) + FN(c) + FP (c) + TN(c)(16)
α(S(c), T ) =TP (c) + αN
F
′× (FN(c) + TN(c))
TP (c) + FN(c) + FP (c) + TN(c)(17)
T (c) = argmaxT
(α(S(c), T )) (18)
4.2. Mapping
Often, the scores generated by the VTR for the video options
are not the best basis for ambiguity detection and classifier
combination. We may map/rescore these values to a new set
of values that help in better detection of ambiguous charac-
ters, better selection of options to forward to the AVC and
better combination with the ATR confidences as the new nor-
malized scores used for sum-rule or as weights for weighted
sum. Some commonly used score normalization techniques
have been discussed in [11].
4.2.1. Score to conditional probability mapping
One mapping (in this case, rescoring) technique that we have
used to recompute the score for each video option is to com-
pute the value of the conditional probability (defined in Equa-
tions 19 and 20) for the video option and replace its score by
this conditional probability. Let us assume we are rescoring
for a segmented character s and let us consider the ith video
option (V ci (s) = X,V p
i (s) = Y ) to explain the conditional
probability function. The new score/conditional probability
value V pi
′(s) is the conditional probability (computed on the
training set ST ) that the segmented character s is really an Xgiven that the VTR score corresponding to X is Y .
Vpi
′(s) = Prob
(
G(s) = X
Vpi (s) = Y
)
(19)
Vpi
′(s) =
|{s ∈ ST | ∃j : G(s) = X ∧ V cj (s) = X ∧ V
pj (s) = Y }|
|{s ∈ ST | ∃k : V ck(s) = X ∧ V
p
k(s) = Y }|
(20)
Since the scores of the options generated by the VTR have
changed, the set needs to be reordered again in decreasing
order of the new scores. Although the mapping stage is an
optional stage and may be replaced by an identity mapping
V pi
′(s) = V p
i (s), the use of a suitable mapping technique
may lead to significant improvements (results discussed in
Section 6). In some cases, such as in the case of a limited
training set, it may be advisable to divide the entire range
of scores into sub-ranges and then compute the value of the
conditional probability given that V pi (s) falls in a certain sub-
range instead of a specific value Y .
4.3. Option selection
Using a suitable thresholding method, we may determine
which characters are ambiguous and need to be passed to the
AVC. The next question is how many video options should
be passed to the AVC for each of these ambiguous characters.
The discussion for finding the optimal number of options can
be found in Section 6.
Simple option selection techniques (defined in Equa-
tions 21, 22 and 23) include selecting the top NumOptnumber of options, selecting options whose score is greater
than an absolute threshold AbsThr and selecting options
whose score relative to that of the top option is less than a rel-
ative threshold RelThr. In most of our experiments, we have
used a fixed number of options. A good mapping technique
may improve the chances of choosing the correct options.
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ x ≤ NumOpt} (21)
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V
px (s) > AbsThr} (22)
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V
p1 (s) − V
px (s) < RelThr} (23)
5. A/V CLASSIFIER COMBINATION
Although the primary focus of this paper is to motivate, de-
scribe and evaluate the ambiguity detection methods used to
improve the character recognition accuracy, for complete-
ness we briefly explain the audio-video based combination
techniques that were used along with the ambiguity detec-
tion methods to perform character disambiguation. The AVC
component (see Figure 2) consists of two stages: synchro-
nization followed by combination. The synchronization is
necessary because for a single video option the ATR may
return multiple audio options from which we must select
(synchronize) the most likely audio option corresponding to
the video option. This synchronization is based on audio fea-
tures such as the difference between the time of occurrence of
the character in the video component and the audio compo-
nent, the presence of neighboring characters from the video
in the neighborhood of the audio occurrence, and the audio
match score returned by the ATR. The combination stage
generates the final recognized output based on the match
scores generated by the two recognizers.
We have experimented with several rank and measurement-
level combination techniques [12] such as Borda count (with
a suitable tie breaking strategy), simple sum-rule, weighted
sum-rule using classifier-level weights and weighted sum-rule
using character-level weights and the results of preliminary
experiments with these combination techniques have been
reported in Figure 8. Further details of audio-video combina-
tion techniques used may be found in [5] and [10].
6. EXPERIMENTS
Implementation & Setup. The current implementation of the
system makes use of the GNU Optical Character Recogni-
tion tool [6] and the Nexidia word spotter [7]. Our video data
set [13] has been recorded in a classroom-like environment
and consists of a set of character-specific videos (for esti-
mating character-specific thresholds and weights) and a set of
test videos (used for evaluation). While the set of character-
specific videos is fairly extensive, with more than 4000 char-
acters, the test videos, although substantial in number, are still
a work in progress. The character set includes capital and
small alphabets, numbers and basic algebraic operators.
Ambiguity Detection. Figure 3 shows the results for a sys-
tem that uses identity mapping followed by thresholding us-
ing a single absolute threshold, Figure 4 refers to a system that
uses identity mapping followed by character specific thresh-
olds and Figure 5 refers to one that uses the score to condi-
tional probability mapping followed by thresholding with a
single absolute threshold. The option selection for all three
experiments involves the use of the top N options.
Simple Thresholding. Figure 3 shows that as the threshold in-
creases, the ratio of ambiguous characters |SD|/|S| increases
while the set of non-ambiguous characters S − SD passed
only through the VTR reduces to a smaller set with a higher
αV+M (S−SD). For a fixed value of αNF
′= 0.7, the character
recognition accuracy when using the top 3 and top 4 options
is shown as α3(S) and α4(S) respectively. The maximum
accuracy achieved using this technique is α4(S) = 0.717
for α4F
′= 0.7 (an improvement of 0.018 over the original
αV +M(S) = 0.699). Here, the mapping used is an identity
mapping and therefore this figure purely reflects the improve-
ment that can be achieved by simple (absolute) thresholding.
Character-specific Thresholding. The character-specific
thresholding plot (Figure 4) is quite similar to Figure 3 except
that we cannot plot against the threshold (as we use a differ-
ent threshold for each character) and since we have computed
the character-specific thresholds by maximizing α(S(c)) for
the same value of αNF
′, we use αN
F
′as the x-axis for the plot.
We generate the plot for discrete values of αNF
′between 0.4
and 0.9, represented by the point markers in the plot. Here,
the maximum value of α3(S) (with identity mapping and
αNF
′= 0.7) was 0.744 while that of the simple thresholding
technique was α3(S) = 0.717.
Mapping. Figure 5 has been generated for a system that uses
the mapping/rescoring method described in Section 4.2 along
with simple thresholding. The plots show a very similar be-
havior except for two things: the value of αV+M and α3
corresponding to maximum value of α show a significant in-
crease compared to the simple thresholding case with values
of 0.906 and 0.754 respectively. Note that the value of αV+M
corresponding to a threshold of 0 is the improvement caused
by mapping alone is 0.744 and is significantly higher than the
accuracy without mapping 0.699.
Option Selection. Figure 1 shows that using an additional
classifier (an oracle, in this case), which can select the correct
recognition output from the video options, has the potential to
improve recognition accuracy. It also shows that the use of an
optimal number of video options and an ambiguity detection
method prevents degradation in the recognition accuracy. In
Figures 3, 4 and 5, we can see that, as expected, the accu-
racy improves when using more options due to more chances
of having the correct option in the selected options (shown in
Figure 7) but we must note that this is for the same αNF
′value.
The αNF
′term is defined as the probability of picking the cor-
rect option given N options with one of them being the correct
option and it decreases with an increase in N because increas-
ing the number of options leads to too many incorrect options
for the AVC to choose from and there is a higher chance that
one of these incorrect options may actually be spoken in the
immediate neighborhood of the character under consideration
and therefore lead to errors from the ATR. If a purely random
(uniform distribution) selection system is used in place of the
AVC, the value of αNF
′= 1/N . These values of αN
F
′depend
on the discriminative power of the ATR for the test set used.
For our system, we have observed a decrease in αNF
′with an
increase in N (as expected) but were unable to extensively
test it to provide the exact values. Since increasing the value
of N tends to increase the RNAC value (shown in Figure 7) and
decreases the αNF
′, it becomes necessary to find the optimal
value for N .
Figure 6 shows that for all techniques discussed above,
the accuracy αV+M (S − SD) improves with a smaller and
more confident subset of non-ambiguous characters. Fur-
thermore, both character-specific thresholding and mapping
methods perform better than the simple thresholding in terms
of αV+M (S − SD). The relative flattening of the curve for
lower |S − SD|/|S| values is because these non-ambiguous
characters have top options with almost similar accuracy and
a lot of values tend to be equally distributed in this range.
For instance, in case of the simple thresholding method with
|S − SD|/|S| ≈ 0.6 there are only two distinct values of
match scores (0.99 and 1.00) remaining in the set for the top
video option. Therefore, the accuracy remains the same while
the fraction |S − SD|/|S| decreases to ≈ 0.4 on account of
removal of a large set of characters with top scores equal to
0.99.
Figure 7 shows that as the ratio of ambiguous charac-
ters increases, many characters that could have been non-
ambiguous start to fall into the ambiguous set thereby increas-
ing RNAC(SD) which is the ratio of characters in the ambigu-
ous set SD with the correct option in the top N options. Also,
as N increases, the ratio RNAC(SD) increases as more options
0.4
0.5
0.6
0.7
0.8
0.9
1
0.7 0.75 0.8 0.85 0.9 0.95 1 0
0.2
0.4
0.6
0.8
1
Accuracy
Fraction
Threshold
|S-SD|/|S|
|SD|/|S|
αV+M(S-SD)
α3(S)
α4(S)
Fig. 3. Simple thresholding
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
Fraction
Accuracy αFN’
|S-SD|/|S|
|SD|/|S|
αV+M(S-SD)
α3(S)
α4(S)
Fig. 4. Character-specific thresholding
0.4
0.5
0.6
0.7
0.8
0.9
1
0.7 0.75 0.8 0.85 0.9 0.95 1 0
0.2
0.4
0.6
0.8
1
Accuracy
Fraction
Threshold
|S-SD|/|S|
|SD|/|S|
αV+M(S-SD)
α3(S)
α4(S)
Fig. 5. Mapping
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracy
αV+M(S-SD)
Fraction |S-SD|/|S|
Simple Threshold (for αFN’=0.7)
Mapping (for αFN’=0.7)
Character Specific Threshold
Fig. 6. Accuracy of VTR+Mapping
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6
RACN(SD)
Fraction |SD|/|S|
Simple Threshold RAC3(SD)
Simple Threshold RAC4(SD)
Fig. 7. The ratio RNAC(SD)
0.5
0.6
0.7
0.8
0.9
Accuracy
Different weighted combination methods
Video recognizer only
Sum-rule
Classifier-specific weights
Character-specific video weights
Character-specific audio weights
Fig. 8. Weighted combination
are being considered, leading to a higher chance of having the
correct option within the top N options.
Weighted Combination. Figure 8 shows how the recognition
accuracy α(S) varies for different A/V combination weight-
ing strategies such as sum-rule, classifier-specific weights
and character-specific weights. For details on the weight-
ing schemes used here, refer to [10]. Although the results
reported in Figure 8 are based on a fairly small data set
and therefore cannot be considered conclusive, they open up
interesting possibilities for future work.
7. FUTURE WORK
In future, we plan to extend the ambiguity detection and
A/V classifier combination methods by taking into account
the penalties associated with incorrectly labeling the video
recognizer output as non-ambiguous or ambiguous, enabling
automatic estimation of optimal values for ambiguity thresh-
olds and employing better mapping and option selection
strategies. We are also interested in investigating better clas-
sifier combination strategies and methods that can use the
audio content to disambiguate the structure of the recognized
mathematical content.
8. REFERENCES
[1] R. Zanibbi et al., “Recognizing mathematical expres-
sions using tree transformation,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 24, no. 11, 2002.
[2] R. H. Anderson, “Two-dimensional mathematical nota-
tions,” Syntactic Pattern Recognition Applications (K.S.
FU), 1977.
[3] M. Wienecke et al., “Toward automatic video-based
whiteboard reading,” IJDAR, vol. 7, no. 2–3, 2005.
[4] L.-W. He and Z. Zhang, “Real-time whiteboard capture
and processing using a video camera for remote collabo-
ration,” IEEE Trans. on Multimedia, vol. 9, no. 1, 2007.
[5] S. Vemulapalli and M. H. Hayes, “Using audio based
disambiguation for improving handwritten mathemati-
cal content recognition in classroom videos,” in DAS,
2010.
[6] “GOCR,” http://jocr.sourceforge.net/.
[7] “Nexidia,” http://www.nexidia.com/.
[8] S. Madhvanath and V. Govindaraju, “Serial classifier
combination for handwritten word recognition,” in IC-
DAR, 1995.
[9] Kr. Ianakiev and V. Govindaraju, “Architecture for clas-
sifier combination using entropy measures,” in Multiple
Classifier Systems, vol. 1857 of Lecture Notes in Com-
puter Science. Springer Berlin / Heidelberg, 2000.
[10] S. Vemulapalli and M. H. Hayes, “Character disam-
biguation for audio-video based handwritten mathemat-
ical content recognition,” in submission to ICIP, 2011
(Under review).
[11] A. Jain et al., “Score normalization in multimodal bio-
metric systems,” Pattern Recognition, vol. 38, no. 12,
2005.
[12] S. Tulyakov et al., “Review of classifier combination
methods,” in Studies in Computational Intelligence:
Machine Learning in Document Analysis and Recogni-
tion. 2008.
[13] “Classroom Videos Dataset,” http://users.ece.
gatech.edu/˜smita/dataset/.
Top Related