[IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) -...
Transcript of [IEEE 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) -...
Synchronization and Combination Techniques for Audio-Video Based HandwrittenMathematical Content Recognition in Classroom Videos
Smita VemulapalliCenter for Signal & Image Processing
Georgia Institute of Technology, Atlanta, GA, USA
Monson HayesAdvanced Imaging Science, Multimedia and Film
Chung-Ang University, Seoul, Korea
Abstract—Recognizing handwritten mathematical content isa challenging problem, and more so when such content appearsin classroom videos. However, given the fact that in suchvideos the handwritten text and the accompanying audiorefer to the same content, a combination of a video andan audio based recognizer has the potential to significantlyimprove the content recognition accuracy. In this paper, usinga combination of video and audio based recognizers, we focuson improving the character recognition accuracy in such videosand propose: (1) synchronization techniques for establishinga correspondence between the handwritten and the spokencontent, and (2) combination techniques for combining theoutputs of the video and audio based recognizers. The currentimplementation of the system makes use of a modified opensource text recognizer and a commercially available phoneticword-spotter. For evaluation purposes, we use videos recordedin a classroom-like environment and our experiments demon-strate the significant improvements (≈ 24% relative increase ascompared to the baseline video based recognizer) in characterrecognition accuracy that can be achieved using our techniques.
Keywords-handwriting recognition, speech recognition,audio-video classifier combination;
I. INTRODUCTION
Driven by recent technological advances, and multiple
social and economic factors, e-learning and distance edu-
cation initiatives have witnessed a rapid proliferation and
acceptance into the mainstream. Such e-learning initiatives
often rely on pre-recorded or live videos from a classroom as
the primary means of delivering the content to the students,
and the videos, almost invariably, are also made available
online for reference by the students. As the volume of
such recorded video content increases, efficient navigation
through this content is becoming an important concern,
which motivates the need for techniques that can be used
to index and summarize the content in such videos. The
techniques proposed in this paper, given the fact that math-
ematical and scientific concepts are more effectively taught
using a whiteboard, focus on the problem of recognizing
handwritten mathematical content in classroom videos.
There is a significant body of research devoted to the
topic of mathematical content recognition [1], [2], and to the
extraction and recognition of textual content from videos [3],
[4]. Although our work is closely related and dependent on
the advances made in the aforementioned fields, its focus on
the use of an audio based recognizer in combination with a
primary, video based recognizer to improve the recognition
of handwritten mathematical content in classroom videos of-
fers a range of new and interesting challenges. In this paper,
we focus on improving the character recognition accuracy
in such videos, and propose techniques for establishing a
correspondence between the outputs of the two recognizers
and techniques for combining such synchronized output.
Our approach (shown in Figure 1) makes use of a video
text recognizer (VTR) which extracts, segments and outputs
a set of video options for every segmented character from the
input classroom video. Each video option generated by the
VTR consists of a character and the corresponding match
score, where the match score represents the recognizer’s
belief in the correctness of the video option. In some cases
the match score for the top video option may be high enough,
in absolute and/or relative terms, to accept the corresponding
character as the recognized output, but in other cases the
choice for the final recognized output may not be clear and
we consider such characters to be ambiguous [5]. For am-
biguous characters (determined by the ambiguity detection
stage) our approach makes use of an audio text recognizer
(ATR) and synchronization techniques to locate, for each
video option, a subset of candidate audio options, which
together with the video options are used by the combina-
tion stage to generate the set of recognized characters (an
example is shown in Figure 2). In an end-to-end setup, the
recognized characters are routed to an A/V based structure
analysis stage which generates the final output. In this paper,
we focus on the synchronization and combination stages and
make the following specific contributions:
• Audio-Video Synchronization - In classroom videos,
given the occlusions, shadows and the fact that every
character that appears on the whiteboard may not appear
in the spoken content or may appear several seconds
later, establishing correspondence between the audio
and the video occurrence of a character can be very
challenging. We propose a number of heuristics for
establishing this correspondence.
• Audio-Video Combination - The correctness of the final
output determined by combining the synchronized video
and audio options is dependent on a large number of
941978-1-4577-1676-8/11/$26.00 c©2011 IEEE
��������
��� ��
�������� �
������� ���
��� ��� ��� ������
�������� � ���
� ��� ����
��������
�������� �
������� ���
�����
��������
�����
��������
����������
����
� ��� ����
���������� ���
�������� �������� ���
������ ���
������� ���
�����
����������������!�����"����
������ ���������� ���
����!�����"����� ��� ����
��������
������� ��
�� �� ��� ������������ ��������
������ ������������� ���
�
��
��! ��
Figure 1. End-to-end system overview
factors e.g. video and audio options’ match scores, syn-
chronization accuracy, recognizers’ accuracy for specific
characters, etc. For improving the A/V combination
accuracy, we investigate various rank and measurement-
based techniques, both at the classifier and the character
level, and also explore the use of an ensemble of audio-
video based recognizers.
II. MATHEMATICAL MODEL
Data Set. Let C represent the character set recognizable by
the VTR and S be the set of segmented character elements
s from the given test set of videos such that each element
s = (si, st, sl) contains the segmented character’s image si,
timestamp st and location coordinates sl, and G is a function
that returns it’s ground truth G(s) ∈ C. For future use, we
define an Equals function E(c1, c2) which maps to 1 for
c1 = c2 and 0 otherwise. Here, c1, c2 ∈ C.
VTR. For any segmented character s, the VTR function V is
defined as V (s) = ((V cj (s), V
pj (s)) | ∀V
cj (s) ∈ C∧V p
j (s) ∈[0, 1]). The function V (s) returns an ordered set of video
options arranged in decreasing order of Vpj (s). Here, V c
j (s)represents the jth video option’s recognized character name
and Vpj (s) is the corresponding match score generated by
the VTR. In the absence of other inputs, V c1 (s) is the final
recognized output for the segmented character s.
ATR. The output of the ATR for the jth input video
option Vj(s) is an ordered set of audio options and is
represented as Aj(s) = (Aj,k(s)) where Aj,k(s) is the
kth audio option corresponding to the jth video option.
Each audio option can further be represented as Aj,k(s) =((Ac
j,k(s), Apj,k(s), A
tj,k(s)) | Ac
j,k(s) ∈ C ∧ Apj,k(s) ∈
[0, 1]). Here, Acj,k(s) represents the character name of jth
video option’s kth audio option which is the same as V cj (s)
in our implementation, Apj,k(s) is the corresponding match
score generated by the ATR and Atj,k(s) is the corresponding
time of occurrence in the audio segment. The above set is
arranged in decreasing order of the score Apj,k(s).
Ambiguity Detection. The ambiguity detection function
D(S) = SD ⊂ S is designed to return SD, a subset of
characters from the set S that are categorized as ambiguous
based on various conditions imposed on the VTR output V .
The set of non-ambiguous characters is S − SD.
Synchronization. The A/V synchronization function Y is
defined as Y (s) = (Aj,k′(s)) where k′ is the index of the
audio option that we consider to be best synchronized with
the jth video option of the segmented character s. Section III
explains some synchronization techniques and also shows
how k′ may be determined.
Combination. The A/V combination function Z is defined as
Z(s) = V cj′(s) ∈ C where j′ is the index of the video option
that is considered to be correct by the system at the end of
the A/V combination process. Section IV describes how j′
may be computed using simple rank and measurement-level
combination techniques.
Evaluation. Finally, the character recognition accuracy of the
end-to-end system α computed on the entire test set S with
A/V combination employed on SD and purely video based
character recognition employed on S−SD can be expressed
as follows:
α(S) =|S − SD| × αV (S − SD) + |SD| × αZ(SD)
|S|(1)
αV (S − SD) =
∑
∀s∈S−SD
E(V c1 (s), G(s))
|S − SD|(2)
αZ(SD) =
∑
∀s∈SD
E(Z(s), G(s))
|SD|(3)
III. A/V SYNCHRONIZATION
The accuracy of the proposed end-to-end recognition
system is critically dependent on the ability of the solution
to identify the segment in the audio which corresponds to
the handwritten content, termed A/V synchronization. In
this paper, we assume that everything that is handwritten
is spoken and focus on handling synchronization issues that
arise due to occlusions, shadows and the skew between the
writing and the utterance of a character. In this section, we
first discuss the factors that affect synchronization and then
describe the A/V synchronization techniques.
A. Factors Affecting Synchronization
Video Timestamping Accuracy. Our recognition system
relies heavily on an automatically determined (by the pre-
processing stage) video timestamp, TSaV (s) associated with
each handwritten character s. The video timestamping ac-
curacy is a measure of how often the TSaV (s) falls within
a preset time window around the TSmV (s), where TSm
V (s)
942 2011 11th International Conference on Intelligent Systems Design and Applications
9.6
11.0
16.8
12.3
13.7
14.7
Video Frame
Extracted & Segmented Text
[“)”,0.98]
[“1”,0.96]
[“l”,0.88]
[19.1,0.11] [26.6,0.07]
[10.0,0.63] [22.7,0.38] [27.0,0.26]
[10.3,0.06] [19.7,0.09] [21.7,0.19]
[“b”,1.00]
[“6”,0.99]
[“L”,0.81]
[15.3,0.35] [22.4,0.62] [19.0,0.18]
[14.5,0.71] [22.4,0.11]
[10.6,0.06] [16.1,0.14] [26.7,0.29]
“1”
“6”
“1”
“6”
“1”
“b”
Video Options
[character, match score]
Audio Options & Synchronized Option
[audio time, match score]
Weighted
[0.5,0.5]
Weighted
[0.8,0.2]Rank-Sum
A/V Combination Output
Vid
eo T
ime (
Seconds)
17.2
Figure 2. Example: A/V Synchronization and Combination . The figure depicts the intermediate output from various components of the A/V based recognition system.Given an input video, the preprocessing stage outputs a set of segmented and timestamped characters that form the input to the character recognizer, which, for each segmentedcharacter, produces a set of video options. The figure then shows the various audio options that are generated by the ATR for each video option, one such audio option is chosenby the synchronization stage and is forwarded for combination. We show the output of three different A/V combination techniques for two sample characters.
is the manually determined video timestamp and it roughly
corresponds to the time instant when the character becomes
fully visible to a student watching the video. The timestamp-
ing accuracy depends on the quality of the preprocessing
algorithms and tends to deteriorate due to shadows and
occlusions caused by the instructor. We have observed that
a very simple (poor) timestamping technique that uses a
single binarization threshold tends to have a much higher
timestamping difference, δT =| TSaV (s) − TSm
V (s) |, in
regions where there are shadows caused by the instructor.
While using a more sophisticated timestamping technique
(good TS) with multiple binarization thresholds , we observe
that δT is low for the entire region of the whiteboard. The δT
value discussed here is shown in Figure 3 for both good TS
and poor TS for a single video file (between two complete
erasures of the whiteboard) with good recording alignment
(RA). Here, in the poor TS plot, the high δT that we observe
in the second half of the video corresponds to characters
written in the lower half of the whiteboard and it is due to
the shadows caused by the instructor.
0
5
10
15
20
25
|TS
a V(s
) -
TS
m V(s
)| (
sec)
Character Index
Poor timestampingGood timestamping
Figure 3. Timestamping
A/V Recording Alignment Factor. The A/V recording
alignment factor is a measure of how well the audio and
video in the recording are aligned. It is inversely propor-
tional to the average value of the recording time difference,
δR =| TSmV (s)−TSm
A (s) |, where TSmA (s) is the manually
determined audio timestamp and it corresponds to the time
instant when the character s is spoken by the instructor. We
define TSwV (s) to represent the actual time instant when the
character s is actually written on the whiteboard. This is de-
fined for theoretical purposes and it cannot be determined by
observing the video due to the presence of occlusions. The
δR value depends on how close in time the actual writing
and the utterance of the characters is | TSwV (s)− TSm
A (s) |and also on the amount of occlusion, which introduces an
additional delay | TSwV (s) − TSm
V (s) |. Figure 4 plots δR
against the character index for two separate video files, one
with good RA and the other with poor RA. Both these videos
have been timestamped using the good TS.
0
5
10
15
20
25
|TS
m V(s
) -
TS
m A(s
)| (
sec)
Character Index
Poor A/V recording alignmentGood A/V recording alignment
Figure 4. Alignment
VTR Accuracy. Since neighbors are a very important
feature for synchronization, the reliability of these neighbors
from the VTR output is also important. If the accuracy of
the VTR is low, then the neighbors have a higher chance of
being incorrectly recognized and this impacts the correctness
of the neighbor based feature used by the synchronization
techniques, resulting in a wrong A/V synchronization.
ATR Accuracy. The ATR may fail to recognize an utterance
corresponding to a character or in some cases the ATR
may incorrectly detect an utterance when one is not present.
Such ATR errors may lead to synchronization errors, either
directly for the character in question or indirectly for the
neighboring characters.
B. Synchronization Techniques
The basic audio features that form the basis for the
proposed synchronization techniques are described below.
Other audio features related to the order of neighbors and
the character repetitions have not been discussed here as
they do not appear in the results reported in this paper.
F1 - the audio match score generated by the ATR.
F2 - the automatic video timestamp minus the automatic
audio timestamp. The audio pruning window used is repre-
sented as [PB , PA] where PB and PA refer to the amount
of time before and after the audio timestamp of the audio
option under consideration.
2011 11th International Conference on Intelligent Systems Design and Applications 943
F3 - the number of neighbors from a certain window
[NB , NA] in the video which are also found in a certain
audio time window [WB ,WA] where NB and NA refer to
the number of video neighbors before and after the video
option under consideration and WB and WA refer to the
amount of time before and after the audio timestamp of the
audio option under consideration.
A/V Time Difference Based Technique. This technique
operates by first performing simple pruning of the audio
options based on the audio features F1 and F2, followed
by the selection of the audio option with the lowest value
for F2. Mathematically, this can be represented as follows.
Y (s) = ((Aj,k′ (s)) | k′= argmin
k(F2(Aj,k(s)))
∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (4)
A/V Neighbor Based Technique. This technique, like the
one described above, performs simple pruning followed by
the selection of the audio option with the highest value for
F3. It is mathematically expressed as follows.
Y (s) = ((Aj,k′ (s)) | k′= argmax
k(F3(Aj,k(s)))
∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (5)
Feature Rank Sum Based Technique. After initial pruning,
feature ranks (R1, R2, R3) are assigned to the different
audio features (F1, F2, F3) based on the rank of the audio
feature in the pruned set of audio options for that particular
video option. For instance, a value of R1 = 1 for an audio
option indicates that the corresponding feature F1 has the
maximum value in the pruned set. Once the ranks have been
assigned, the audio option with the minimum value of the
rank sum (i.e. R1 + R2 + R3) is chosen as the output. R2has been used for tie-breaking.
IV. A/V COMBINATION
After A/V synchronization, each of the video options
for a character has at most one corresponding audio op-
tion. The A/V combination stage generates the final recog-
nized output based on the match scores generated by the
two recognizers as well as the computed audio features.
Classifier combination approaches have also been applied
in several related contexts, for instance, to improve the
recognition accuracy of handwriting recognizers [6], speech
recognizers [7] and a combination of the two [8]. We
have experimented with several rank-level and measurement-
level decision making/combination techniques [9] such as
rank sum and weighted sum rule using classifier-specific
weights and character-specific weights, and also combined
an ensemble of A/V combination classifiers.
A. Combination Techniques
Rank Based Techniques. Simple rank sum or weighted
rank sum may be used for the purpose of combination. This
needs to be accompanied by a suitable tie-breaking strategy.
One of the advantages of using a rank based technique here
is that there is no need to compute weights for the audio
and video components or normalize the scores generated
by different classifiers. We have implemented rank sum for
different subsets of the audio and video recognizer scores
and the audio features.
Classifier-Specific Weight Based Techniques. Apart from
simple sum rule, we have made use of a set of classifier-
specific weights [wV , wA] to combine the scores generated
by each of the recognizers. Th weight wV is for the VTR
and wA is for the ATR. We have shown the results of
weighted combination for different combinations of the
classifier specific weights in Table II.
Z(s) = (Vcj′ (s) | j
′= argmax
j(wV × V
pj (s) + wA × A
p
j,k′ (s))
∧Vcj (s) ∈ O(s)) (6)
Character-Specific Weight Based Techniques. As the
accuracy of the VTR and the ATR may be significantly
different for each character, the use of a different set of
weights for every character could prove to be advantageous.
One possible way to compute the video weight wV (Vcj (s))
(in Equation 7) for the jth video option for the character s
would be to use the VTR’s accuracy-related metrics, such
as precision and sensitivity, that correspond to the character
label V cj (s). The value of the audio weight wA(V
cj (s)) can
either be computed in a similar fashion for the ATR and
both the weights normalized to return a sum of 1 or we can
assign wA(Vcj (s)) = 1− wV (V
cj (s)).
Z(s) = (Vcj′ (s) | j
′= argmax
j(wV (V
cj (s)) × V
pj (s)
+wA(Acj,k′ (s)) × A
p
j,k′ (s)) ∧ Vcj (s) ∈ O(s)) (7)
Classifier Ensemble Based Techniques. This is a two-
level combination technique. First, we create an ensemble
of classifiers i.e. the first stage of combination, by using
different audio and video combination weights, different
subsets of the audio feature set and different combination
techniques. After this we prune away those classifiers that
show no potential to improve the final accuracy when eval-
uated on a training set. Finally, we perform the second level
of combination i.e. combining the outputs of the ensemble
of classifiers. A number of combination techniques such as
majority voting, rank sum and character-specific classifier
selection may be employed. We have used rank sum as
the second level combination technique and the results are
shown in SectionV.
V. EXPERIMENTAL RESULTS
Implementation & Setup. While the system is not tied
to a specific character recognizer or speech recognizer, our
current implementation makes use of an extended version
of the GNU Optical Character Recognition or GOCR [10]
tool and the Nexidia word-spotter [11]. Our video data
set [12] has been recorded in a classroom-like environment
944 2011 11th International Conference on Intelligent Systems Design and Applications
Table ISYNCHRONIZATION TECHNIQUES
Technique Pruning Time Neighbor Audio Synchronization Accuracy (%)Window Time Window Good Timestamping Poor Timestamping Good Timestamping[PB , PA] [WB ,WA] Good Alignment Good Alignment Poor Alignment
A/V Time Difference [-1,5] - 86.4 56.6 23.1A/V Time Difference [-6,20] - 77.3 60.2 41.0A/V Neighbor [-6,20] [-4,4] 63.6 47.0 28.2A/V Neighbor [-6,20] [-8,8] 65.9 55.4 43.6Feature Rank Sum [-1,5] [-8,8] 86.4 56.6 23.1Feature Rank Sum [-6,20] [-8,8] 81.8 65.1 46.2
and consists of a fairly large training and test dataset with
more than 6000 characters and more than 100 videos. The
character set includes capital and small alphabets, numbers
and basic algebraic operators.
Synchronization. We evaluated the synchronization tech-
niques (Table I) proposed in the paper using videos with a
good audio-video recording alignment (RA) that were times-
tamped using both the poor and the good timestamping (TS)
methods. While videos with poor TA or poor RA resulted
in regions with a large values for δT and δR (≈ 10-20
seconds), the videos with both good TS and good RA have
fairly small values for these time differences (≈ 2-5 seconds).
For all the experiments discussed here, we have used the
number of neighbors window [NB , NA] = [2, 2] i.e. two
neighbors before and two neighbors after the character under
consideration. We have used different pruning time windows
and neighbor audio time windows such as [PB , PA]=[-6,20]
which refers to 6 seconds before and 20 seconds after and
[WB ,WA]=[-8,8] refers to 8 seconds before and after the
audio timestamp of the character under consideration.
A/V Time Difference Based Technique. The A/V time differ-
ence based technique outperformed the A/V neighbor based
technique for videos with good TS and good RA as the
best audio option is most likely to be in the vicinity due to
very small δT and δR values. For videos with poor RA (i.e.
large value of δR), we notice a significant deterioration in
synchronization accuracy. For videos with good RA but poor
TS, we saw a deterioration in accuracy but not as significant
as the videos with poor RA. This is due to the fact that
only the characters in the lower half of the whiteboard were
affected by the shadows. Notice that a small pruning window
[PB , PA]=[-1,5] for this technique (and the feature rank sum
technique) results in a very high synchronization accuracy
(≈ 86.4%), but significantly deteriorates the accuracy for
other techniques. This is because most of the correct audio
options were pruned away by the small pruning window.
A/V Neighbor Based Technique. For videos with poor RA,
neighbors become very important and we can see that this
technique outperforms the time difference technique. The
improvement was not very significant due to the fact that
the VTR accuracy for the system is 55.0%, so the neighbors
were not very reliable. Therefore, the next step for neighbor
based techniques would be to select the best neighbors for
synchronization. Non-ambiguous characters and characters
that are usually well recognized by the ATR (without too
many false positives or negatives) would be good choices.
One should also observe that for this technique, the neighbor
audio window [WB ,WA] size is also crucial and the syn-
chronization deteriorates if this window is too small ([-4,4]).
Feature Rank Sum Based Technique. The feature rank sum
based technique outperforms both the time difference and
neighbor based techniques for most videos. For example,
feature rank sum gave a synchronization accuracy of 81.8%
compared to 77.3% (time difference based) and 65.9%
(neighbor based) for good TS and good RA with [PB , PA]=[-
6,20] and [WB ,WA]=[-8,8]. We utilize this synchronization
technique for the combination experiments below.
Combination. Table II shows the results for various A/V
combination techniques described in Section IV.
Baseline and Ambiguity Detection. We observe that the VTR
(without the ambiguity detection and the A/V combination)
provides a character recognition accuracy of 55.0%. This
forms the baseline for future comparisons. For each of the
reported experiments, we performed ambiguity detection and
option selection [5], both with a relative threshold of 0.9.
This divides the test set S into two subsets of ambiguous
characters SD (27.9 + 8.6 + 8.8 + 32.0 = 77.3% of S)and
non-ambigiuous characters S − SD (18.5 + 4.1 = 22.6% of
S). The VTR accuracy for the non-ambiguous character set
αv(S − SD) is 81.7% and for the ambiguous character set
αV (SD) is 47.1%. It is this αV (SD) value that we hope to
boost by replacing it with αZ(SD) to improve α(S).
Rank Based Techniques. Table II shows several different
rank based techniques. The rank sum for these experiments
was based on different subsets of audio features F1, F2,
F3 and the VTR match score V. We observe that the rank
sum that is based on V and F1 has the highest end-to-end
system accuracy of 59.9% for this technique. Notice that the
other audio features do not seem to significantly impact the
combination stage, which can be attributed to the fact that
the test set used here is reasonably well aligned.
Classifier-Specific Weight Based Techniques. For classifier-
specific weights (Cfr. Wts.), we used different combinations
of [wV , wA] and the results tabulated in Table II show that
the optimal value of weights is close to [0.8, 0.2], with
α(S) = 61.9%. It is interesting to see how the four ratios of
ambiguous characters that are correctly/wrongly recognized
by the VTR and the combined recognizer (AV), change when
2011 11th International Conference on Intelligent Systems Design and Applications 945
Table IICOMBINATION TECHNIQUES
Technique Non-Ambiguous Chars. S − SD Ambiguous Chars. SD All Chars. S
#VC|S|
#VW|S|
αV (S − SD)#(VC,AVC )
|S|#(VC,AVW )
|S|#(VW ,AVC )
|S|#(VW ,AVW )
|S|αV (SD) αZ (SD) α(S)
% % % % % % % % % %
VTR � 55.0 45.0 55.0 – – – – – – 55.0RankSum[V, F1, 2, 3] 18.5 4.1 81.7 27.9 8.6 8.8 32.0 47.1 47.5 55.2[V, F1, 2] " " " 30.9 5.5 8.6 32.3 " 51.1 58.0[V, F1]� " " " 29.8 6.6 11.6 29.3 " 53.6 59.9Cfr. Wts.[0.1, 0.9] 18.5 4.1 81.7 26.8 9.7 12.2 28.7 47.1 50.4 57.5[0.2, 0.8] " " " 27.1 9.4 12.2 28.7 " 50.7 57.7[0.3, 0.7] " " " 27.9 8.6 12.2 28.7 " 51.8 58.6[0.4, 0.6] " " " 28.2 8.3 12.2 28.7 " 52.1 58.8[0.5, 0.5] " " " 28.2 8.3 12.2 28.7 " 52.1 58.8[0.6, 0.4] " " " 29.3 7.2 12.2 28.7 " 53.6 59.9[0.7, 0.3] " " " 30.9 5.5 11.9 29.0 " 55.4 61.3[0.8, 0.2]♠ " " " 32.3 4.1 11.0 29.8 " 56.1 61.9[0.9, 0.1] " " " 34.8 1.7 7.5 33.4 " 54.6 60.8Chr. Wts.Sensitivity 18.5 4.1 81.7 28.2 8.3 12.2 28.7 47.1 52.1 58.8Precision ♣ " " " 28.5 8.0 17.4 23.5 " 59.3 64.4Ensemble�,♠,♣ 18.5 4.1 81.7 – – – – 47.1 57.1 62.7�,�,♠,♣ " " " – – – – " 61.1 65.7�,♠,♣ " " " – – – – " 64.3 68.2
VC : Correct VTR Output, VW : Wrong VTR Output, AVC : Correct A/V Combination Output, AVW : Wrong A/V Combination Output
the weights are varied. For high values of wV , the final AV
output tends to be very similar to the VTR output and results
in a high ratio for (VC , AVC) and (VW , AVW ) and a much
smaller ratio of charaters change from VC to AVW or VW to
AVC . As expected, the reverse trend is true when wA takes
a higher value. The improvement in the final result comes
from reducing the number of characters in (VC , AVW ).
Character-Specific Weight Based Techniques. We have gen-
erated character-specific weights (Chr. Wts.) in two ways.
The first uses the VTR sensitivity of each character as wV
and the second used the VTR precision of each character as
wV . Here, wA = 1 − wV . Between these two techniques,
the precision based weights seem to perform better with an
α(S) of 64.4%. Here, the improvement mostly comes from
increasing the number of characters in (VW , AVC).
Classifier Ensemble Based Techniques. Finally for ensemble
based techniques, Table II shows that we need to select the
correct set of classifiers to combine and that increasing the
number of classifiers does not necessarily increase the final
system accuracy. For ensemble based techniques, we see a
very significant improvement in the end-to-end system char-
acter recognition accuracy. The maximum accuracy achieved
was 68.2% which is 13.2% absolute improvement and a
24.0% relative improvement compared to the VTR.
VI. CONCLUSIONS & FUTURE WORK
In this paper, we focused on techniques for A/V synchro-
nization and A/V combination that can assist in improving
the character recognition accuracy for handwritten mathe-
matical content in classroom videos. Going forward, we plan
to carefully examine the impact of errors introduced by each
component in this multi-stage recognition system, develop
synchronization techniques that take in consideration both
the accuracy and the order of the neighbors and techniques
for audio-assisted structure disambiguation.
REFERENCES
[1] R. Zanibbi et al., “Recognizing mathematical expressionsusing tree transformation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 11, 2002.
[2] R. H. Anderson, “Two-dimensional mathematical notations,”Syntactic Pattern Recognition Applications (K.S. FU), 1977.
[3] M. Wienecke et al., “Toward automatic video-based white-board reading,” IJDAR, vol. 7, no. 2–3, 2005.
[4] L.-W. He and Z. Zhang, “Real-time whiteboard capture andprocessing using a video camera for remote collaboration,”IEEE Transactions on Multimedia, vol. 9, no. 1, 2007.
[5] S. Vemulapalli and M. Hayes, “Ambiguity detection methodsfor improving handwritten mathematical character recognitionaccuracy in classroom videos,” in DSP, 2011.
[6] W. Wang et al., “Combination of multiple classifiers forhandwritten word recognition,” in IWFHR, 2002.
[7] J. Fiscus, “A post-processing system to yield reducedword error rates: Recognizer output voting error reduction(ROVER),” in ASRU, 1997.
[8] J. Hunsinger and M. Lang, “A speech understanding modulefor a multimodal mathematical formula editor,” in ICASSP,2000.
[9] S. Tulyakov et al., “Review of classifier combination meth-ods,” in Studies in Computational Intelligence: MachineLearning in Document Analysis and Recognition, 2008.
[10] “GOCR,” http:// jocr.sourceforge.net/ .
[11] “Nexidia,” http://www.nexidia.com/ .
[12] “Classroom Videos Dataset,” http://users.ece.gatech.edu/~smita/dataset/ .
946 2011 11th International Conference on Intelligent Systems Design and Applications