Hidden Tree Markov Models for Document Image …ai.dinfo.unifi.it/paolo/ps/pami02-HTMM.pdfHidden...
Transcript of Hidden Tree Markov Models for Document Image …ai.dinfo.unifi.it/paolo/ps/pami02-HTMM.pdfHidden...
Hidden Tree Markov Models for Document
Image Classification
Michelangelo Diligenti∗ Paolo Frasconi† Marco Gori∗
∗ Dept. of Information Engineering, Universita di Siena, Italy
† Dept. of Systems and Computer Science, Universita di Firenze, Italy
Abstract
Classification is an important problem in image document process-
ing and is often a preliminary step towards recognition, understanding,
and information extraction. In this paper, the problem is formulated in
the framework of concept learning and each category corresponds to the
set of image documents with similar physical structure. We propose a
solution based on two algorithmic ideas. First, we obtain a structured
representation of images based on labeled XY-trees (this representa-
tion informs the learner about important relationships between image
sub-constituents). Second, we propose a probabilistic architecture that
extends hidden Markov models for learning probability distributions de-
fined on spaces of labeled trees. Finally, a successful application of this
method to the categorization of commercial invoices is presented.
Keywords Document classification, Machine Learning, Markovian models,
Structured information.
1 Introduction
In spite of the explosive increase of electronic communication in recent years,
a significant amount of documents is still printed on paper. Image document
processing aims at automating operations normally carried out by individuals
on printed documents, including reading, understanding, extracting informa-
tion, and organizing documents according to given criteria. Operations such
1
as information extraction can be greatly simplified if some background knowl-
edge about physical layout is available. Layout classification is therefore an
important preliminary step for guiding the subsequent recognition task [6].
One important application is the automatic management of commercial invoice
forms. Here a classifier can be employed to determine which company issued
the invoice and/or which type of form was employed. Besides other obvious
applications (e.g. organizing invoices in separate folders), classification can
also help the construction of information extraction modules, since the pres-
ence, the position, and the type of various pieces of information significantly
varies with issuing company and with the specific form that was employed.
Digital libraries are also a target application domain. In this case, layout
classification can partition scanned pages according to useful criteria such as
presence of titles, advertisements, detect table of contents pages, etc. Again,
data organization or automatic metadata extraction can greatly benefit from
knowing the class of the document under analysis.
In this paper we formulate layout classification as a multiclass classifica-
tion problem. The choice of document representation is very important for this
learning problem. Flat representations (e.g., pixel based), do not carry robust
information about the position and the number of basic constituents of the
image (such as logos, headers, text fields). More appropriate representations
in this domain involve either the extraction of global features, or the use of
recursive data structures. When global features are extracted, vector-based
machine learning algorithms can be employed for classification. For exam-
ple, the methods in [19, 20] collect several statistics from the bitmap image
and subsequently use oblique decision trees or self-organizing maps for clas-
sification. On the other hand, the use of recursive representations allows to
preserve relationships among the image constituents. For example in [6, 7]
documents are represented as structured patterns, while in [3] both the phys-
ical and the logical structure are used to recognize documents. The structured
representations suggested in this paper is obtained by extracting an XY-tree
from the document image [17]. Nodes are associated with specific boxes of
the document and edges represent a containment relation. For each box we
run several feature extractors, in order to get a real vector of attributes. The
2
resulting representation aims at capturing local features of constituents, along
with structural relationships among constituents, and is therefore less sensitive
to interclass variability.
A second important practical aspect in document classification is that many
systems must deal with a relatively large number of classes, and classes nor-
mally vary over time. Discriminant approaches (including multilayer neural
networks and support vector machines) can achieve very good accuracy on
many learning tasks. However, if the set of classes changes, the classifier must
be retrained from scratch. Generative probabilistic models avoid this problem
by learning a class conditional distribution P (Y|C = c) for each class c. In
this way, given an arbitrary set of classes C (realizations the class variable
C can take on) and a generative model for each class c ∈ C, we can obtain
h(Y) = arg maxc∈C P (Y|C = c)P (C = c).
Based on the above considerations, we suggest a classification method that
(1) uses structured data representations and (2) is based on a generative model.
The combination of these two features requires a new machine learning tool,
since generative models are abundant in the literature for both attribute-value
and sequential representations, but not for hierarchical data structures. On
the other hand, compositional approaches to pattern recognition have been
mainly proposed in conjunction with early structural grammars [10] (lacking
sound methods for training) or discriminant models, such as recursive neural
networks [9].
The generative classifier we propose in this paper is based on a gener-
alization of hidden Markov models (HMM), a powerful tool for probabilis-
tic sequence modeling. The recent view of the HMM as a particular case of
Bayesian networks [21, 2, 15] has provided new theoretical insights and helped
conceiving extensions of the standard model in a sound and formally elegant
framework. Whereas standard HMMs are commonly employed for learning in
sequential domains, our extension can learn probability distributions on labeled
trees. We call the novel architecture Hidden Tree Markov Model (HTMM)1.
This extension was first suggested in [9] as one of the possible architectures
1 The model should not be confused with Hidden Markov Decision Trees [14], a graphical
model for learning in sequential domains using a tree structured architecture.
3
for connectionist learning of data structures. In this paper, we develop the
architectural details not included in [9] and we present a solution to a real
commercial invoices classification problem.
2 Hidden tree-Markov models
We now develop a general theoretical framework for modeling probability dis-
tributions defined over spaces of trees. In the following we assume that each
instance, denoted Y, is a labeled m-ary tree. Each node, v, is labeled by a
K−tuple of (discrete or continuous) random variables, denoted Y (v).
2.1 Model definition
Compositionality in a generative stochastic model can be achieved using proba-
bilistic independence. For each tree Y, we can write P (Y) = P (Y (r), Y1, . . . ,Ym)
where Y (r) is the label at the root node r and Yi are random sub-trees. We
say that the first-order tree-Markov property is satisfied for Y if
P (Yi|Yj, Y (r)) = P (Yi|Y (r)) ∀i, j = 1, . . . ,m. (1)
From this property one can prove by induction that
P (Y) = P (Y (r))∏
v∈V \{r}P (Y (v)|Y (pa[v])). (2)
where pa[v] denotes the parent of vertex v and V denotes the set of vertices of
Y. Unfortunately, such a compositional formulation of a model has two main
disadvantages. First, the conditional independence assumption (1) is very
unrealistic and does not allow the model to capture correlations that might
exist between any two non adjacent nodes. Second, the parameterization of
P (Y (v)|Y (pa[v])) might be problematic since the number of parameters grows
exponentially with the number of attributes in each label. Like in HMM for
sequences, we thus introduce hidden state variables in charge of “storing” and
“summarizing” distant information. More precisely, let X = {x1, . . . , xn} be a
finite set of states. We assume that each tree Y is generated by an underlying
hidden tree X, a data structure defined as follows. The skeleton of X is
identical to the skeleton of Y. Nodes of X are labeled by hidden state variables,
4
denoted X(v), taking realizations on X . In this way, P (Y) =∑
X P (Y, X),
where the sum over X indicates the marginalization over all the hidden trees.
The first-order tree-Markov property in the case of hidden tree models
holds if the two following conditions are met: (1) the first-order tree-Markov
property must hold for the hidden tree X, and (2) ∀v the observation Y (v)
is independent of the rest given X(v). These condition imply the following
global factorization formula:
P (Y, X) = P (X(r))P (Y (r)|X(r)) ·∏
v∈V \{r}P (Y (v)|X(v))P (X(v)|X(pa[v]))
(3)
that can be graphically represented by the belief networks as shown in Fig-
ure 1. The associated stochastic model will be called hidden tree-Markov
model (HTMM).
X(r)
X(v) Y(v)
Y(r)
Figure 1: Belief network for a hidden tree-Markov model. White nodes belong
to the underlying hidden tree X. Shaded nodes are observations
2.2 Parameterization
According to Eq. (3) we see that parameters for the HTMM are P (X(r)), a
prior on the root hidden state, P (Y (v)|X(v)), the emission parameters, and
P (X(v)|X(pa[v])), the transition parameters. Unless sharing mechanisms are
defined, these parameters vary with the node being considered, yielding a large
total number of parameters that may quickly lead to overfitting problems. In
HTMM, several forms of stationarity can be assumed, each associated with a
different form of parameter sharing. We say that a HTMM is fully stationary
if it is both transition and emission stationary, i.e. neither P (X(v)|X(pa[v])
5
nor P (Y (v)|X(v)) depend on v. In this case the total number of parameters
is n(1 + n + K) being K the number of parameters in each emission model
P (Y |X = x). We say that the model is locally stationary if it is emission
stationary and P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) whenever v and w are
both the i-th children of their parent in the tree. This assumption yields
n(1 + nm + K) parameters. Finally, we say the model is level stationary
if P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) and P (Y (v)|X(v)) = P (Y (w)|X(w))
whenever v and w belong to the same level of the tree. In this case the number
of parameters is n + Ln(n + K), being L the maximum allowed height of a
tree.
2.3 Inference and Learning
Since the model is a special case of Bayesian network, the two main algorithms
(inference and parameter estimation) can be derived as special cases of corre-
sponding algorithms for Bayesian networks. Inference consists of computing all
the conditional probabilities of hidden states, given the evidence entered into
the observation nodes (i.e. the labels of the tree). Since the network is singly
connected, the inference problem can be solved either by π-λ propagation [18]
or by the more general junction tree algorithm [12]. In our experiments we
implemented a specialized version of the junction tree algorithm. Given the
special structure of the HTMM network (see Figure 1), no moralization is re-
quired and cliques forming the junction tree are of two kinds: one containing
X(v) and X(pa[v]), the other one containing Y (v) and X(v). In both cases,
only two variables are contained in a clique. It should be remarked that P (Y)
is simply obtained as the normalization factor in any clique after propagation
in the junction tree. Inference in HTMMs is very efficient and runs in time
proportional to the number of nodes in Y and to the number of states in X .
Each model λi is trained using examples belonging to class ci. Learning is
formulated in the maximum likelihood framework and the presence of hidden
variables requires an iterative algorithm for solving the optimization problem.
In our implementation we have chosen Expectation-Maximization (EM), a
general family of algorithms for maximum likelihood estimation in problems
with missing data [5]. In the case of Bayesian networks (and thus also in
6
Figure 2: An invoice instance for each issuing company is shown. Invoices are
highly-structured patterns, composed by both graphical and textual fields.
the case of HTMMs), it can be shown that the E-step reduces to computing
expected sufficient statistics for the parameters (i.e., expected transition and
emission counts), and the M-step simply consists of updating parameters with
the normalized expected sufficient statistics (see, e.g., [11] for details).
3 Dataset preparation
3.1 Dataset
We chose commercial invoices as a test bed for document classification. We
collected 889 commercial invoices, issued by 9 different companies. The class
of an invoice corresponds to the issuing company. The number of available
instances of different classes in our dataset was rather unbalanced ranging
between 46 and 191. Some instances are shown in Figure 2. Since the dataset
contains classified information, images in Figure 2 have been edited and critical
fields (such as logos, addresses, and other elements that could lead to the
identification of the involved companies) have been covered by black strips.
It can be seen than invoices are complex patterns, which are composed by
many semantic fields, separated either by lines or white spaces. Although
the selected invoices contain both graphical and textual information, we made
no attempt to extract text with optical character recognition systems. The
7
purpose of our experiments is to demonstrate that document images can be
discriminated using layout information only.
3.2 XY-trees extraction
The XY-tree representation is a well known approach for describing the phys-
ical layouts of documents that can preserve relationships among single con-
stituents [17, 16]. The root of an XY-tree is associated with the whole doc-
ument image. The document is then recursively split into regions that are
separated by cut lines. Horizontal and vertical white spaces are alternatively
used for cutting, so regions at even levels in the tree are cut along the X
axis while regions at odd levels are cut along the Y axis. Since the classic
space-based version of the algorithm cannot properly handle lines across the
formatted area, our implementation allows both space and line cuts [4]. Line-
based splits are rejected if the cut line is shorter than an given fraction of the
region size. Moreover, lines always have priority over spaces.
A region is not further divided if neither a horizontal or vertical white space
can be found, or if the area of a region is smaller than a given resolution. In our
experiments this resolution was set to 1% of the total document area, yielding
an average tree size of 28 nodes on our dataset. Children are ordered left-to-
right for vertical cuts and top-to-bottom for horizontal cuts. Since the XY-tree
extraction can be strongly affected by salt-and-pepper noise, document images
were preprocessed in order to remove small connected components.
3.3 Local feature extraction
The shape of the XY-tree reflects the document layout, but it may be not dis-
criminant enough for classification. A more informative representation can be
obtained by labeling each node with a set of features (or attributes) extracted
from the region associated with the node. For each region of the document we
used a vector of six features:
1. a flag indicating whether the parent was cut using a line or a space;
2. the four coordinates of the bounding box (in normalized document space
coordinates).
3. the average grey level of the region.
8
Real variables were quantized by running the maximum entropy discretization
algorithm described in [8], obtaining discrete attributes with 10 possible real-
izations each. Note that since the discretization algorithm collects statistics
over the available data, it must be run
4 Implementation and results
We compared HTMMs to the document decision trees (DDT) algorithm de-
scribed in [1]. DDT is an instance based learner that extends the hierarchical
classifier proposed by Dengel [6, 7]. DDT stores training documents (repre-
sented as XY-trees) into a tree data structure built to perform an efficient
classification. Each leaf of a DDT contains a training example, whereas each
internal node contains the XY-subtrees shared by all the training examples
that are dominated by that node. Like in traditional decision trees, classifica-
tion is carried out descending the tree top-down and performing tests at each
internal node. Tests in this case consist of computing similarities between the
input XY-tree and the XY-subtrees contained in the children of the internal
node. Once a leaf is reached, the learner outputs the class of the training
example stored into the leaf.
We used one HTMM, λk, for each class, and each HTMM had 11 states
x1, . . . , x11. A document was classified as belonging to class i if P (Y|λk) >
P (Y|λj) for j 6= k. Transitions were restricted by a left-to-right topology, i.e.
we forced P (X(v) = xi|X(pa[v]) = xj) = 0 if |i − j| > d. In our experi-
ments we set d = 2. This helps us to reduce the number of parameters to
estimate. Moreover, we used a fully stationary model (see Section 2) as a form
of parameter sharing.
One practical difficulty for estimating classification accuracy is that datasets
in the domain of commercial invoices are typically small. Moreover, we are in-
terested in evaluating how accuracy decreases with the number of available
training examples. In the case of small datasets, well known approaches for
obtaining good estimates of classification accuracy are leave-one-out and k-
fold cross validation. In our experiments, in order to take advantage of all the
available instances and, at the same time, to evaluate accuracy as a function
of the training set size, we opted for a modified version of the cross validation
9
algorithm. The dataset was divided into M groups of instances and for each
N < M we trained the classifier on N groups and tested generalization on
the remaining M −N groups. These operations were iterated with N varying
between 1 and M − 1 and each time all the(
MN
)combinations of groups were
used to carry out an experiment on a new test and training set. In this way
the total number of single training/test experiments is 2M − 2.
7-fold
88%
90%
92%
94%
96%
98%
100%
14%28%43%57%71%86%
Fraction of examples in the training set
Acc
ura
cy
HTMMDT
10-fold
84%
86%
88%
90%
92%
94%
96%
98%
100%
10%20%30%40%50%60%70%80%90%
Fraction of examples in the training set
Acc
ura
cy
HTMMDT
(a) (b)
Figure 3: Results of the cross-validation technique, when the available exam-
ples are divided into 7 and 10 sets ((a) and (b), respectively).
Two experimental sessions were performed. In the first one, we set M = 7
while in the second one, we set M = 10. The results of the first session
are shown in Figure 3a and the results of the second one are shown in Fig-
ure 3b. HTMMs perform very well on the classification task: using 800 exam-
ples (about 90 examples for class) as a training set, the recognition percentage
was as high as 99.28% (a 34% relative error reduction compared to the DDT
algorithm). Accuracy does not rapidly decrease when a smaller number of
examples are used to train the models. The 10-fold results show that when
only 20% of the examples are used to train the models (and the remaining
80% of examples are included in the test set), the accuracy is 95%. DDTs
do not perform as well as HTMMs when the training set is large and it can
cover the statistical variability of the patterns: if 800 examples are inserted
into the DDT, then the correct classification rate is equal to 98.26%. On the
other hand, when the training set is small, DDT classification capabilities out-
perform HTMM ones. DDTs can perform well even if the training set is small
because they can focus their attention on small portions of the input patterns
10
original image state 1 state 2
state 2 state 3 state 4
state 6 state 7 state 8
state 9 state 10 state 11
Figure 4: State activation maps for a trained HTMM. Maps are computed on
the document image shown in the upper-left corner
without trying to model the whole pattern. On the other hand HTMMs take
into account the entire input pattern. This turns out to be advantageous if
enough training data is provided but may lead to overfitting if the training set
is too small.
4.1 Analysis of hidden states
In the case of sequence modeling, HMM states often have an explicit inter-
pretation in terms of temporal events and the Viterbi algorithm can be used
to “decode” these events by computing the most likely state sequence, given
the observed data. In HTMMs, a similar decoding algorithm can be easily ob-
tained by running the max-calibration procedure [13] after inserting evidence
11
Originalinvoice
1
2 3
5
7
1110
89
64
Figure 5: Averaged state activation maps. The document image shown on
the left is one instance of the class associated with the model. Activation maps
are averaged over the test set. Arrows between states indivate the most likely
state transitions.
on the observed variables Y (v). This procedure yields the most likely state
configuration given the observed XY-tree. By observing the decoded state
trees, we noted that several states tend to activate on specific types of image
regions. Thus max-calibration can be helpful to perform image segmentation.
In Figure 4 we show the state activation maps for a particular invoice image.
The activation map for state xi is two-dimensional image superimposed on the
document image that reveals on which regions state xi is active. In this context
a region R is a rectangular box contained in an XY-tree leaf u. By definition of
XY-tree, R is dominated by all the nodes in the path from the root to R. Maps
were then obtained as follows. First, we ran the max-calibration algorithm,
obtaining for each node v the most likely state x∗(v). Then, for each state xi
12
and for each region R, the activity of xi in R was defined as
a(xi, R) = {v : x∗(v) = xi and v dominates R} .
By inspecting Figure 4 we note how several states specialize to activate on
specific invoice fields. For example, state 9 is mostly active on the region
containing the customer address, state 2 on the invoice header (logo and in-
formation about the issuing company), states 10 and 11 on the purchases list,
state 3 on some bottom boxes containing subtotals, and so on. So, besides
classification, HTMMs can be used for image segmentation and automatic re-
gion labeling. Region labeling may be useful to guide subsequent information
extraction algorithms.
In Figure 5 we show a set of activation maps averaged over all the test-set
invoices of a given class. Again, it can be noted how certain states specialize
on specific invoice fields. In this example we see that state 2 corresponds to
the purchases list. The state activation decreases descending the sub-region
of its competence, reflecting that the probability of having purchased at least
n items, decreases with n, whereas at least one item is always bought. The
maps also provide interesting insights about how classification is performed.
Arrows in the diagram indicate the highest state transition probabilities after
learning. These transitions can be interpreted as a sort of grammar indicating
how a valid image should be parsed to show that it actually belongs to the
class being modeled.
5 Conclusions
In this paper, we have presented a new document categorization system in
which the documents are represented by XY-trees. The documents are classi-
fied by generalized hidden Markov models which are able to deal with labeled
trees. Some experiments have been carried out on a dataset of commercial
invoices with very promising results. We have shown that the model learns
a high-level semantic task (labelling single fields of the invoices), in order to
achieve the final document recognition. This is the result of combining the
XY-tree representation (which can effectively isolate and model the semantic
relationships among documents constituents) with the HTMM capability of
13
learning hierarchical structure in the data.
References
[1] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S. Marinai, and
G. Soda, “Automatic document classification and indexing in high-volume
applications,” International Journal on Document Analysis and Recognition,
vol. 4, no. 2, pp. 69–83, 2002.
[2] Y. Bengio and P. Frasconi, “An input output HMM architecture,” in Advances
in Neural Information Processing Systems 7, (G. Tesauro, D. Touretzky, and
T. Leen, eds.), pp. 427–434, The MIT Press, 1995.
[3] R. Brugger, A. Zramdini, and R. Ingold, “Modeling documents for structure
recognition using generalized n-grams,” in Proceedings of ICDAR, 1997.
[4] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmen-
tation and representation by the modified x-y tree,” in In Proceedings of the
International Conference on Document Analysis and Recognition, pp. 563–566,
1999.
[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from
incomplete data via the EM algorithm,” Journal of Royal Statistical Society B,
vol. 39, pp. 1–38, 1977.
[6] A. Dengel, “Initial learning of document structure,” in Proceedings of ICDAR,
pp. 86–90, 1993.
[7] A. Dengel and F. Dubiel, “Clustering and classification of document structure:
A machine learning approach,” in Proceedings of ICDAR, pp. 587–591, 1995.
[8] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuous-
valued attributes for classification learning,” in Proc. 13th Int. Joint Conf. on
Artificial Intelligence, pp. 1022–1027, Morgan Kaufmann, 1993.
[9] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive
processing of data structures,” IEEE Trans. on Neural Networks, vol. 9, no. 5,
pp. 768–786, 1998.
14
[10] R. C. Gonzalez and M. G. Thomason, Syntactic Pattern Recognition. Reading,
Massachusettes: Addison Wesley, 1978.
[11] D. Heckerman, “Bayesian networks for data mining,” Data Mining and Knowl-
edge Discovery, vol. 1, no. 1, pp. 79–120, 1997.
[12] F. V. Jensen, S. L. Lauritzen, and K. G. Olosen, “Bayesian updating in recursive
graphical models by local computations,” Computational Statistical Quarterly,
vol. 4, pp. 269–282, 1990.
[13] F. Jensen, An Introduction to Bayesian Networks. New York: Springer Verlag,
1996.
[14] M. I. Jordan, Z. Ghahramani, and L. K. Saul, “Hidden markov decision trees,”
in Advances in Neural Information Processing Systems, (M. C. Mozer, M. I.
Jordan, and T. Petsche, eds.), p. 501, The MIT Press, 1997.
[15] H. Lucke, “Bayesian belief networks as a tool for stochastic parsing,” Speech
Communication, vol. 16, pp. 89–118, 1995.
[16] G. Nagy, , and M. Viswanathan, “Dual representation of segmented and tech-
nical documents,” in In Proceedings of the First International Conference on
Document Analysis and Recognition, pp. 141–151, 1991.
[17] G. Nagy and S. Seth, “Hierarchical representation of optically scanned docu-
ments,” in Proc. Int. Conf. on Pattern Recognition, pp. 347–349, 1984.
[18] J. Pearl, Probabilistic Reasoning in Intelligent Systems : Networks of Plausible
Inference. Morgan Kaufmann, 1988.
[19] C. Shin and D. Doermann, “Classification of document page images based on
visual similarity of layout structures,” in Proc. SPIE Document Recognition
and Retrieval VII 3967, pp. 182–190, 2000.
[20] C. Shin, D. Doermann, and A. Rosenfeld, “Classification of document pages
using structure-based features,” International Journal on Document Analysis
and Recognition, vol. 3, no. 4, pp. 232–247, 2001.
[21] P. Smyth, D. Heckerman, and M. I. Jordan, “Probabilistic independence net-
works for hidden markov probability models,” Neural Computation, vol. 9, no. 2,
pp. 227–269, 1997.
15