[IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) -...
Transcript of [IEEE 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) -...
Towards Audio-Video Based Handwritten Mathematical Content Recognition in
Classroom Videos
Smita Vemulapalli
Center for Signal & Image Processing
Georgia Institute of Technology, Atlanta, GA
Monson Hayes
Advanced Imaging Science, Multimedia and Film
Chung-Ang University, Seoul, Korea
Abstract
Recognizing handwritten mathematical content in class-
room videos poses a range of interesting challenges. In this
paper, we focus on improving the character recognition ac-
curacy in such videos using a combination of video and au-
dio based text recognizers. We propose a two step assembly
consisting of a video text recognizer (VTR) as the primary
character recognizer and an audio text recognizer (ATR) for
disambiguating, if needed, the output of the VTR. We pro-
pose techniques for (1) detecting ambiguity in the output of
the VTR so that a combination with the ATR may be trig-
gered only for ambiguous characters, (2) synchronizing the
output of the two recognizers for enabling combination, and
(3) combining the options generated by the two recogniz-
ers using measurement and rank based methods. We have
implemented the system using an open source implementa-
tion of a character recognizer and a commercially available
phonetic word-spotter. Through experiments conducted us-
ing video recorded in a classroom-like environment, we
demonstrate the improvement in the character recognition
accuracy that can be achieved using our approach.
1 Introduction
Recent years have witnessed a rapid increase in the num-
ber of e-learning and advanced learning initiatives that ei-
ther use classroom videos as the primary medium of in-
struction or make them available online for reference by the
students. As the volume of such recorded video content
increases, it is amply clear that in order to efficiently navi-
gate through the available classroom videos, there is a need
for techniques that can help extract, identify and summa-
rize the content in such videos. In this context, and given
the fact that the whiteboard continues to be the preferred
and effective medium for teaching complex mathematical
and scientific concepts, this paper focuses on improving the
character recognition accuracy associated with handwritten
mathematical content in classroom videos.
Our solution for improving the character recognition ac-
curacy makes use of a combination of video and audio
based text recognizers. The video text recognizer (VTR)
acts as the primary character recognizer and the subse-
quent step of combination with the audio text recognizer
(ATR) is triggered only for the VTR outputs that are de-
termined to be ambiguous. While our research utilizes
the advances made across several areas of signal process-
ing research like extraction and recognition of textual con-
tent from videos [12, 5], mathematical content recognition,
speech recognition and classifier combination, our focus on
the recognition of handwritten mathematical content from
classroom videos and the use of accompanying audio con-
tent to assist in recognition, poses a range of new and inter-
esting challenges. Specifically, to address the challenge of
improving the character recognition accuracy using a com-
bination of video and audio based text recognizers, we make
the following contributions:
• Ambiguity Detection - In the proposed solution, trigger-
ing a combination with the audio for every output of the
VTR may sometimes result in degrading the recognition
accuracy due to errors in the ATR output. We propose
techniques for restricting the combination to VTR out-
puts that have a higher chance of being erroneous.
• Audio-Video Synchronization - Synchronizing an occur-
rence of a character in the video to its occurrence in the
audio stream is needed for enabling combination and
can be a very challenging problem. We propose and
evaluate a number of synchronization techniques.
• Audio-Video Combination - Determining the final out-
put from the options generated by the VTR and the ATR
is dependent on a large number of factors such as the
option match scores, synchronization, character-specific
VTR and ATR accuracies, etc. We investigate various
measurement and rank-level combination techniques.
2 System Overview
The end-to-end recognition system (block diagram
shown in Figure 1) performs handwritten mathematical
content recognition from classroom videos in three dis-
774 978-1-4577-0253-2/11/$26.00 ©2011 IEEE
���������
��� ��
�������� �
������� ���
��� ��� ��� ������
�������� � ���
� ��� ����
��������
�������� �
������� ���
�����
��������
�����
��������
����������
����
� ��� ����
���������� ��������� ���
������� ���
�������� �������� ���
�����
������������
����!�����"����
������ ���������� ���
����!�����"����� ��� ����
��������
������� ��
�� �� ������������� ��� ������� ���
9.6
11.0
16.8
12.3
13.7
14.7
17.2
[“)”,0.98] [“1”,0.96] [“l”,0.88]
[“q”,0.98] [“9”,0.90] [“1”,0.84]
[“4”,0.94] [“h”,0.81] [“+”,0.76]
[“+”,0.96] [“i”,0.84] [“f”,0.77]
[“b”,1.00] [“6”,0.99] [“L”,0.81]
[“8”,1.00] [“g”,0.90] [“S”,0.78]
[“Z”,0.98] [“7”,0.90] [“J”,0.89]
Y
Y
N
Y
N
Y
Y
[“1”,10.0,0.63] [“l”,10.3,0.06]
[“q”,15.3,0.08] [“9”,10.4,0.59] [“1”,10.0,0.63]
[“b”,15.3,0.35] [“6”,14.5,0.71] [“L”,16.1,0.14]
[“Z”,14.5,0.05] [“7”,20.9,0.77] ] [“J”,15.3,0.09]
“1”
“9”
“8”
“h”
“+”
“6”
“7”
[“4”,10.8,0.05] [“h”,15.4,0.28] [“+”,12.8,0.29]
Segmented Characters & Timestamp
Video Text Recognizer Output [character, video match-score]
Ambiguous? (Y/N)
Synchronized Audio Text Recognizer Output [character, audio time, audio match-score]
Disambiguation Output
������ ������������� ���
�
��
��! ��
Video Frame
Extracted & Segmented Text
Figure 1. System overview. The lower half of the figure depicts the intermediate output from various components of the system. Given an input video,the preprocessing stage outputs a set of segmented and timestamped characters that form the input to the character recognizer, which, for each segmented character,produces a set of video options. The ambiguity detection stage is used to determine the set of ambiguous video options, which undergo a combination with the audiooptions generated by the audio text recognizer and selected by the synchronization module to produce the final output. The tick marks denote the correct output.
tinct stages: the video preprocessing stage, the audio-video
(A/V) based character recognition stage and the A/V based
structure analysis stage. The video preprocessing stage in-
cludes all processing required to extract text regions (here,
mathematical content) from the video, segment characters
using a component labeling algorithm and also generate in-
formation such as the timestamp and the location of each
segmented character. The A/V based character recognition
stage performs video based character recognition followed
by audio-assisted character disambiguation. Similarly, the
A/V based structure analysis stage (briefly described here
for completeness), first performs video based structure anal-
ysis (example techniques include [13, 1]) and then audio-
assisted structure disambiguation. In this paper, we focus
almost exclusively on the character disambiguation compo-
nent which involves the three main tasks of ambiguity de-
tection, A/V synchronization and A/V combination which
are discussed in detail in the following sections.
3 Mathematical Model
Data Set. Let C represent the character set recognizable by
the VTR and S be the set of segmented character elements
s from the given test set of videos such that each element
s = (si, st, sl) contains the segmented character’s image si,
timestamp st and location coordinates sl, and G is a func-
tion that returns it’s ground truth G(s) ∈ C. For future use,
we define an Equals function E(c1, c2) which maps to 1 for
c1 = c2 and 0 otherwise. Here, c1, c2 ∈ C.
VTR. For any segmented character s, the VTR function
V is defined as V (s) = ((V cj (s), V
pj (s)) | ∀V
cj (s) ∈
C ∧ Vpj (s) ∈ [0, 1]). The function V (s) returns an or-
dered set of video options arranged in decreasing order of
the score Vpj (s). Here, V c
j (s) represents the jth video op-
tion’s recognized character name and Vpj (s) is the corre-
sponding match score generated by the recognizer. In ab-
sence of other inputs, V c1 (s) is the final recognized output
for the segmented character s.
ATR. The output of the ATR for the jth input video op-
tion Vj(s) is an ordered set of audio options and is rep-
resented as Aj(s) = (Aj,k(s)) where Aj,k(s) is the kth
audio option corresponding to the jth video option. Each
audio option can further be represented as Aj,k(s) =((Ac
j,k(s), Apj,k(s), A
tj,k(s)) | A
cj,k(s) ∈ C ∧ A
pj,k(s) ∈
[0, 1]). Here, Acj,k(s) represents the character name of jth
video option’s kth audio option which is the same as V cj (s)
in our implementation, Apj,k(s) is the corresponding match
score generated by the ATR and Atj,k(s) is the correspond-
ing time of occurrence in the audio segment. The above set
is arranged in decreasing order of the score Apj,k(s).
Ambiguity Detection. The ambiguity detection function
D(S) = SD ⊂ S is designed to return SD, a subset of
characters from the set S that are categorized as ambiguous
based on various conditions imposed on the VTR output V .
The set of non-ambiguous characters is S − SD. Methods
for ambiguity detection are described in Section 4.
Synchronization. The A/V synchronization function Y is
defined as Y (s) = (Aj,k′(s)) where k′ is the index of the
audio option that we consider to be best synchronized with
the jth video option of the segmented character s. Section 5
775
explains some synchronization techniques and also shows
how k′ may be determined.
Combination. The A/V combination function Z is defined
as Z(s) = V cj′(s) ∈ C where j′ is the index of the video op-
tion that is considered to be correct by the system at the end
of the A/V combination process. Section 6 describes how j′
may be computed using simple rank and measurement-level
combination techniques.
Evaluation. Finally, the character recognition accuracy of
the end-to-end system α computed on the entire test set S
with A/V combination employed on the set of ambiguous
characters SD and purely video based character recognition
employed on the set of non-ambiguous characters S − SD
can be expressed as:
α(S) =|S − SD| × αV (S − SD) + |SD| × αZ(SD)
|S|(1)
The character recognition accuracy of the VTR for the
set of non-ambiguous characters S − SD is represented as
αV (S) and it can be computed as:
αV (S − SD) =1
|S − SD|
∑
∀s∈S−SD
E(Vc1(s), G(s)) (2)
The character recognition accuracy αZ of the A/V com-
bination based system for the set of ambiguous characters
SD can be expressed as:
αZ(SD) =1
|SD|
∑
∀s∈SD
E(Z(s), G(s)) (3)
4 Ambiguity Detection
Ambiguity detection involves the detection of characters
whose VTR output has a higher chance of being erroneous
and passing their video options (a selected subset) to the
A/V synchronization and A/V combination (AVC) compo-
nents for improving the recognition accuracy. Our aim here
is to increase the possibility of correcting VTR errors as
well as to limit errors that may be introduced at the combi-
nation stage. Ambiguity detection can be divided into the
three main tasks of mapping, thresholding and option selec-
tion which are described below.
4.1 Mapping
When the VTR generated video option scores are not the
best basis for ambiguity detection and classifier combina-
tion, we may map/rescore these values to a new set of val-
ues. Some commonly used score normalization techniques
have been discussed in [7].
Score to conditional probability mapping. Assume we
are rescoring a segmented character s and consider the ith
video option (V ci (s) = X,V
pi (s) = Y ). The new score,
Vpi′(s), is the conditional probability that the segmented
character s is X given that the VTR score correspondingto X is Y ,
Vpi′(s) = Prob
(G(s) = X|V
pi (s) = Y
)(4)
which is estimated from a video training set as follows
Vpi′(s) =
|{s ∈ ST | ∃j : G(s) = X ∧ V cj (s) = X ∧ V p
j (s) = Y }|
|{s ∈ ST | ∃k : V ck(s) = X ∧ V p
k(s) = Y }|
(5)
When the training set is limited, we compute Vpi′(s) for
sub-ranges of values instead of specific values of Y .
4.2 Thresholding
Simple Threshold Based Methods. Simple threshold tech-
niques are used to determine if a segmented character s ∈ S
is ambiguous or not based on one or more thresholding con-
ditions that are applied to all the characters. The use of an
absolute threshold T1 and a relative threshold T2 are given
in Eqs. 6 and 7, respectively, as follows
D(S, V ) = {s ∈ S | Vp1(s) < T1} (6)
D(S, V ) = {s ∈ S | Vp2(s)/V
p1(s) > T2} (7)
Character-Specific Threshold Based Methods. To exploit
the fact that different characters are recognized with dif-
ferent accuracies, we perform thresholding using different
threshold values T (V c1 (s)) for each character as follows,
D(S, V ) = {s ∈ S | Vp1(s) < T (V
c1(s))} (8)
For each character c ∈ C, the character-specific thresh-
old T (c) is found that maximizes the character-level ac-
curacy of the end-to-end character recognition system
α(S(c), T ) computed over the set S(c) which consists of
all characters in the training set ST that claim to be a c, i.e.,
S(c) = {s ∈ ST | V c1 (s) = c}. For mathematical sim-
plicity, we assume that only the top N options are passed to
the AVC. We calculate α(S(c), T ) as the weighted sum of
the accuracies of the four disjoint subsets of S(c) namely
TP (c), FN(c), TN(c) and FP (c). The set of True Posi-
tives TP (c) contains those characters that are correctly rec-
ognized by the VTR and also tagged as non-ambiguous and
not passed to the AVC, therefore the accuracy for this subset
is 1. The set of False Negatives FN(c) refers to those char-
acters that are correctly recognized by the VTR but tagged
as ambiguous and the set of True Negatives TN(c) refers
to those that are not correctly recognized by VTR but con-
tain the correct option within the top N options and tagged
as ambiguous. Both these sets are passed to the AVC and
have an αNF′
chance of finally getting the correct video op-
tion chosen. Finally, the set of False Positives FP (c) refers
to those that are not correctly recognized but have the cor-
rect video option within the top N options and have been
tagged as non-ambiguous and therefore not passed to the
AVC. Here, αNF′
represents the constant recognition accu-
racy of the AVC, assuming that it is given a set of N options
out of which one is the correct option. Those characters for
which the correct option is not found in the top N options
cannot be corrected by our system as they are not passed
to the AVC. Good mapping and option selection techniques
776
reduce the number of characters that fall in this category.
We calculate α(S(c), T ) and the character-specific thresh-
old T (c) as follows:
α(S(c), T ) =1× |TP (c)|+ αN
F
′× (|FN(c)|+ |TN(c)|) + 0× |FP (c)|
|TP (c)|+ |FN(c)|+ |TN(c)|+ |FP (c)|(9)
T (c) = argmaxT
(α(S(c), T )) (10)
4.3 Option Selection
For each ambiguous character, we need to find the right
subset of video options that should be passed to the AVC.
One needs to determine a subset which has a high chance of
having the correct output, but limits the possibility of an in-
correct recognition by the AVC. Some simple option selec-
tion techniques include selecting the top NumOpt number
of options and selecting options whose absolute (or rela-
tive) score is greater than an absolute threshold AbsThr (or
a relative threshold RelThr). These different option selec-
tion techniques can be expressed as:
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ x ≤ NumOpt} (11)
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V
px (s) > AbsThr} (12)
O(s) = {c ∈ C | ∃x : Vcx (s) = c ∧ V
px (s)/V
p1(s) > RelThr} (13)
5 A/V Synchronization
A/V synchronization, which takes place in the combi-
nation stage, refers to the identification of the audio seg-
ment in the classroom video that corresponds to the hand-
written content. The factors that affect synchronization are
described below followed by a description of the synchro-
nization techniques.
5.1 Factors Affecting Synchronization
Video Timestamping Accuracy. The video time stamping
accuracy is a measure of how often the automatically gener-
ated video timestamps for the characters fall within a preset
time window around the manually generated video times-
tamps. The manually generated video timestamp refers to
the instant when the character is fully visible to an observer.
The timestamping accuracy depends on the quality of the
preprocessing/timestamping algorithms and tends to deteri-
orate due to shadows (excluding the occlusion) caused by
the instructor standing in front of the whiteboard.
A/V Recording Alignment Factor. The A/V recording
alignment factor is a measure of how well the audio and
video in the recording are aligned. It is inversely propor-
tional to the average time difference between when the char-
acter is first observed by the student and when the corre-
sponding audio is spoken by the instructor. This factor de-
pends on the time difference between the writing and the
speaking of the characters and the amount of occlusion.
VTR Accuracy. The characters in the video that are in the
temporal vicinity of the character under investigation are
referred to as its neighbors. VTR accuracy, which deter-
mines the reliability of the neighbors, is very important as
the neighbors can be used to synchronize the video occur-
rence of a character to its audio occurrence.
ATR Accuracy. The ATR may cause false positives or false
negatives that increase the synchronization error. False pos-
itives, which are high for small search strings, refer to the
hits generated by the ATR when the character has not been
spoken. False negatives refer to the case when the ATR is
unable to find the correct audio occurrence of the character.
5.2 Synchronization Techniques
We now describe the set of audio features that form the
basis for the proposed synchronization techniques.
F1 - audio match score generated by the ATR. The pruning
threshold used for this feature is represented as PP .
F2 - the automatic video timestamp minus the automatic
audio timestamp. The audio pruning window used is repre-
sented as [PB , PA] where PB and PA refer to the amount
of time before (usually a negative value) and after the audio
timestamp of the audio option under consideration.
F3 - number of video neighbors from a window [NB , NA]that are also found in an audio time window [WB ,WA]where NB and NA refer to the number of neighbors before
and after the video option under consideration and WB and
WA refer to the amount of time before and after the audio
timestamp of the audio option under consideration.
A/V Time Difference Based Technique. This technique
operates by first performing a simple pruning of the audio
options based on features F1 and F2, followed by a se-
lection of the audio option with the lowest value for F2.
Mathematically, this is represented as:
Y (s) = ((Aj,k′ (s)) | k′= argmin
k(F2(Aj,k(s)))
∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (14)
Basic A/V Neighbor Based Technique. This technique,
like the previous one, performs a simple pruning followed
by a selection of the audio option with the highest value for
F3. This technique may be described mathematically as:
Y (s) = ((Aj,k′ (s)) | k′= argmax
k(F3(Aj,k(s)))
∧F1(Aj,k(s)) > PP ∧ PB ≤ F2(Aj,k(s)) ≤ PA) (15)
Feature Rank Sum Based Technique. After an initial
pruning, feature ranks (R1, R2, R3) are assigned to each
audio feature based on the rank of the audio feature in the
pruned set. For instance, R1 = 1 for an audio option in-
dicates that it’s feature F1 has the maximum value in the
pruned set. After rank assignment, the audio option with
the minimum value of rank sum, R1 +R2 +R3, is chosen
as the output. R2 and R3 may be used to break ties.
777
Table 1. Ambiguity detection techniquesMapping Thresholding Option System
Selection Accuracy(%)
No No NumOpt=1 57.5No No NumOpt=all 53.9No Simple (T1=0.90) NumOpt=3 60.4No Simple (T1=0.90) RelThr=0.85 62.9No Relative (T2=0.90) RelThr=0.90 64.1Yes Simple (T1=0.90) RelThr=0.90 65.2
6 A/V Combination
After A/V synchronization, each video option has at
most one corresponding audio option. The A/V combina-
tion stage generates the final recognized output based on
the match scores generated by the two recognizers as well
as the computed audio features. Classifier combination
approaches have also been applied in several related con-
texts, for instance, to improve the recognition accuracy of
handwriting recognizers [11], speech recognizers [3] and a
combination of the two [6]. We have experimented with
several rank-level and measurement-level decision mak-
ing/combination techniques [9] discussed below.
6.1 Combination Techniques
Rank Based Techniques. We have used rank sum (Borda
count [9]), accompanied by a suitable tie-breaking strategy,
for the purpose of combination. One of the advantages of
using a rank based technique is that there is no need to com-
pute weights for the audio and video components or to nor-
malize the scores generated by different classifiers.
Classifier-Specific Weight Based Techniques. Apart from
simple sum rule, we have made use of a set of classifier-
specific weights [wV , wA] to combine the scores generated
by each of the recognizers. We have used the accuracy val-
ues of the two classifiers as their weights for combination.
Z(s) = (Vcj′ (s) | j
′= argmax
j(wV × V
pj (s) + wA × A
p
j,k′(s))
∧Vcj (s) ∈ O(s)) (16)
Character-Specific Weight Based Techniques. As the
accuracy of each of the classifiers (in this case, a sin-
gle VTR and a single ATR) may be significantly differ-
ent for each character, the use of a different set of weights
[wV (c), wA(c)] for every character has the potential to fur-
ther improve the recognition accuracy.
Z(s) = (Vcj′ (s) | j
′= argmax
j(wV (V
cj (s))× V
pj (s)
+wA(Acj,k′ (s))× A
p
j,k′(s)) ∧ V
cj (s) ∈ O(s)) (17)
7 Experimental Results
Implementation & Setup. While the system is not tied to a
specific character recognizer or speech recognizer, our cur-
rent implementation uses an extended version of the GNU
Optical Character Recognition or GOCR [4] tool and the
Nexidia word spotter [8]. Our video data set [2] has been
Table 2. Synchronization techniquesTechnique Pruning Audio Neighbor Sync. Accuracy (%)
[PB , PA] [WB ,WA] Good RA Poor RA
Time Diff. [-1,5] - 86.4 23.1Time Diff. [-6,20] - 77.2 41.0Neighbor [-6,20] [-4,4] 63.6 28.2Neighbor [-6,20] [-8,8] 65.9 43.5Rank Sum [-1,5] [-8,8] 86.3 23.1Rank Sum [-6,20] [-8,8] 81.8 46.1
recorded in a classroom-like environment and consists of a
fairly large training and test datasets with more than 6000
characters from more than 100 videos. The character set
includes capital and small alphabets, numbers and basic al-
gebraic operators.
Ambiguity Detection. Table 1 lists some combinations
of mapping, thresholding and option selection techniques
that were evaluated [10]. While the results are largely self-
explanatory, some things to note are as follows. The base-
line system (VTR) has an end-to-end character recognition
accuracy of 57.5% and this degrades to 53.9% if all of the
characters are passed through the A/V combination system
with all of the options. A simple thresholding technique
shows significant improvement in system accuracy (60.4%)
and the use of a relative threshold for option selection is
much better than using a fixed number of options. Also,
relative thresholding appears to give better performance for
this set of videos. Finally, we can see that the best system
accuracy (65.2%) is obtained when using the conditional
probability mapping technique described in Section 4.1.
Synchronization. Figure 2 plots the difference between
the automatic video timestamp TSaV and the manual video
timestamp TSmV against the character index (character’s po-
sition in an arrangement ordered by TSaV ) for a single video
file with good recording alignment (RA) using two differ-
ent timestamping techniques, one good and the other poor
in terms of timestamping accuracy. The poor timestamping
technique used a single binarization threshold during text
extraction and timestamping. As a result, it performs well
in the first half of the video (upper half of the whiteboard)
where there were no shadows caused by the instructor; but
in the second half of the video, due to the shadows (exclud-
ing occlusions) caused by the instructor, the single thresh-
old is insufficient. The good timestamping technique, on
the other hand, makes use of a more sophisticated multiple
threshold based text extraction technique and is therefore
able to perform well even in presence of shadows.
Figure 3 plots the difference between the manual video
timestamp TSmV and the manual audio timestamp TSm
A
against the character index for two separate audio files (with
good video timestamping), one with good RA and the other
with poor RA. This time difference is significantly larger in
case of poor RA.
The performance of the synchronization techniques pro-
posed in this paper using videos with good and poor video
778
0
5
10
15
20
25TSa V - TSm V (sec)
Character Index
Poor timestampingGood timestamping
Figure 2. Timestamping
RA are shown in Table 2. The videos were timestamped
using a good timestamping technique. For the purpose
of evaluation, we considered a character to be correctly
synchronized if |TSmA − TSa
A| ≤ 2, where TSaA is the
automatic audio timestamp determined after synchroniza-
tion. First let us compare the synchronization accuracies of
the proposed techniques for a pruning window [PB , PA]=[-
6,20], number of neighbors considered [NB , NA]=[2,2] and
a neighbor audio window [WB ,WA]=[-8,8]. As expected,
the time difference based technique (77.2%) outperforms
the neighbor based technique (65.9%) for good RA, and
the neighbor based technique (43.5%) outperforms the time
difference based technique (41.0%) for poor RA while the
feature rank sum based technique outperforms the other
two techniques for both good RA (81.8%) and poor RA
(46.1%). We therefore utilize, for synchronization, the fea-
ture rank sum based technique in the classifier combination
experiments reported next. Also, for good RA and good
timestamping, when using techniques based on time dif-
ference or feature rank sum (which includes the time dif-
ference feature), a small pruning window [PB , PA]=[-1,5]
may prove to be very beneficial (86.3%) but for poor RA
and poor timestamping is poor, the accuracy greatly dete-
riorates (23.1%). For neighbor based techniques, note that
the window [WB ,WA] size is also crucial and the synchro-
nization accuracy deteriorates if this window is too small.
Combination. The A/V combination techniques described
in Section 6.1 were implemented and evaluated, and the re-
sults are tabulated in Table 3. For each of the A/V combina-
tion techniques, we perform ambiguity detection with a rel-
ative threshold of 0.9 for thresholding and a relative thresh-
old of 0.9 for option selection. Note that a significant part of
the improvement observed in the system accuracy can be at-
tributed to the ambiguity detection techniques that precede
A/V combination. The character-specific weighted sum
technique with the weights wV =0.55 and wA=0.45 shows
a 17.9% (relative) improvement compared to the VTR.
Table 3. Combination techniquesTechnique System Accuracy (%)
Video Only 57.5Rank Sum 65.1Classifier-Specific Weights 67.4Character-Specific Weights 67.8
0
5
10
15
20
25
|TSm V - TSm A| (sec)
Character Index
Poor A/V recording alignmentGood A/V recording alignment
Figure 3. Alignment
8 Conclusions & Future Work
We have proposed, implemented and evaluated tech-
niques for ambiguity detection, A/V synchronization and
A/V combination. Going forward, we plan to extend these
by taking into account the penalties associated with incor-
rect ambiguity detection, improve the set of audio features
and implement an intelligent neighbor based synchroniza-
tion. As much of this research relies on the existence of
large labeled data sets, we plan to extend our data set with
videos recorded by different subjects. Finally, we intend
to explore the use of audio information to disambiguate the
output of the structure analysis component.
References
[1] R. H. Anderson. Two-dimensional mathematical notations.
Syntactic Pattern Recognition Applications (K.S. FU), 1977.
[2] Classroom Videos Dataset. http://users.ece.
gatech.edu/˜smita/dataset/.
[3] J. Fiscus. A post-processing system to yield reduced word er-
ror rates: Recognizer output voting error reduction (ROVER).
In ASRU, 1997.
[4] GOCR. http://jocr.sourceforge.net/.
[5] L.-W. He and Z. Zhang. Real-time whiteboard capture and
processing using a video camera for remote collaboration.
IEEE Transactions on Multimedia, 9(1), 2007.
[6] J. Hunsinger and M. Lang. A speech understanding mod-
ule for a multimodal mathematical formula editor. Proc. Int.
Conf. on Acoust., Speech, Signal Proc., 2000.
[7] A. Jain et al. Score normalization in multimodal biometric
systems. Pattern Recognition, 38(12), 2005.
[8] Nexidia. http://www.nexidia.com/.
[9] S. Tulyakov et al. Review of classifier combination methods.
In Studies in Computational Intelligence: Machine Learning
in Document Analysis and Recognition. 2008.
[10] S. Vemulapalli and M. Hayes. Ambiguity detection meth-
ods for improving handwritten mathematical character recog-
nition accuracy in classroom videos. Proc. 17th Int. Conf. on
DSP, 2011.
[11] W. Wang et al. Combination of multiple classifiers for hand-
written word recognition. IWFHR, 2002.
[12] M. Wienecke et al. Toward automatic video-based white-
board reading. IJDAR, 7(2–3), 2005.
[13] R. Zanibbi et al. Recognizing mathematical expressions us-
ing tree transformation. IEEE Trans. Pattern Analysis and
Machine Intelligence, 24(11), 2002.
779