Chapter2 Characterizinglayoutstyleusingﬁrst ...bagdanov/thesis/thesis_04.pdf · Chapter2...

Chapter 2

Characterizing layout style using firstorder Gaussian graphs∗

“We must avoid here two complementary errors: on the one hand that theworld has a unique, intrinsic, pre-existing structure awaiting our grasp;and on the other hand that the world is in utter chaos. The first error isthat of the student who marvelled at how the astronomers could find outthe true names of distant constellations. The second error is that of theLewis Carroll’s Walrus who grouped shoes with ships and sealing wax, andcabbages with kings...”

–Reuben Abel, Man is the Measure, New York: Free Press, 1997

In many pattern classification problems the need for representing the structure ofpatterns within a class arises. Applications for which this is particularly true includecharacter recognition [117, 54], occluded face recognition [3], and document type clas-sification [20, 97, 30]. These problems are not easily modeled using feature–basedstatistical classifiers. This is due to the fact that each pattern must be representedby a single, fixed–length feature vector, which fails to capture its inherent structure.In fact, most local structure information is lost and patterns with the same globalfeatures, but different structure, cannot be distinguished.Structural pattern recognition attempts to address this problem by describing pat-

terns using grammars, knowledge bases, graphs, or other structural models [86, 74].Such techniques typically use rigid models of structure within pattern instances tomodel each class.A technique that uses structural models, while allowing statistical variation within

the structure of a model, was introduced by Wong [117]. He proposes a random graphmodel in which vertices and edges are associated with discrete random variables takingvalues over the attribute domain of the graph. The use of discrete densities complicatesthe learning and classification processes for random graphs. Graph matching is acommon tool in structural pattern recognition [18], but any matching procedure forrandom graphs must take statistical variability into account. Entropy, or the incrementin entropy caused by combining two random graphs, is typically used as a distancemetric. When computing the entropy of a random graph based on discrete densities itis necessary to remember all pattern graphs used to train it. Also, some problems do

∗Published in Pattern Recognition [8].

13

14 Chapter 2. Characterizing layout style using first order Gaussian graphs

not lend themselves to discrete modeling, such as when there is a limited amount oftraining samples, or it is desirable to learn a model from a minimal amount of trainingdata.In order to alleviate some of the limitations imposed by the use of discrete densities

we have developed an extension to Wong’s first order random graphs that uses contin-uous Gaussian distributions to model the variability in random graphs. We call theseFirst Order Gaussian Graphs (FOGGs). The adoption of a parametric model for thedensities of each random graph element is shown to greatly improve the efficiency ofentropy–based distance calculations. To test the effectiveness of first order Gaussiangraphs as a classification tool we have applied them to a problem from the documentanalysis field, where structure is the key factor in making distinctions between docu-ment classes [7].The rest of the paper is organized as follows. The next section introduces first order

Gaussian graphs. Section 2.2 describes the clustering procedure used to learn a graph-ical model from a set of training samples. Section 2.3 details a series of experimentsprobing the effectiveness of first order Gaussian graphs as a classifier for a problemfrom document image analysis. Finally, a discussion of our results and indications offuture directions are given in section 2.4.

2.1 Definitions and basic concepts

In this section we introduce first order Gaussian graphs. First we describe how indi-vidual pattern instances are represented, and then how first order Gaussian graphs canbe used to model a set of such instances.

2.1.1 First order Gaussian graphs

A structural pattern in the recognition task consists of a set of primitive componentsand their structural relations. Patterns are modeled using attributed relational graphs(ARGs). An ARG is defined as follows:

Definition 1 An attributed relational graph, G, over L = (Lv, Le) is a 4-tuple(VG, EG,mv,me), where V is a set of vertices, E ⊆ V ×V is a set of edges, mv : V −→Lv is the vertex interpretation function, and me : E −→ Le is the edge interpretationfunction.

In the above definition Lv and Le are known respectively as the vertex attributedomain and edge attribute domain. An ARG defined over suitable attribute domainscan be used to describe the observed attributes of primitive components of a complexobject, as well as attributed structural relationships between these primitives.To represent a class of patterns we use a random graph. A random graph is essen-

tially identical to an ARG, except that the vertex and edge interpretation functionsdo not take determined values, but vary randomly over the vertex and edge attributedomain according to some estimated density.

2.1. Definitions and basic concepts 15

Definition 2 A random attributed relational graph, R, over L = (Lv, Le) is a4-tuple (VR, ER, µv, µe), where V is a set of vertices, E ⊆ V × V is a set of edges,µv : V −→ Π is the vertex interpretation function, and µe : E −→ Θ is the edgeinterpretation function. Π = {πi|i ∈ {1, . . . , |VR|}} and Θ = {θij |i, j ∈ {1, . . . , |VR|}}are sets of random variables taking values in Lv and Le respectively.

An ARG obtained from a random graph by instantiating all vertices and edges iscalled an outcome graph. The joint probability distribution of all random elementsinduces a probability measure over the space of all outcome graphs. Estimation of thisjoint probability density, however, becomes quickly unpleasant for even moderatelysized graphs, and we introduce the following simplifying assumptions:

1. The random vertices πi are mutually independent.

2. A random edge θij is independent of all random vertices other than its endpointsvi and vj .

3. Given values for each random vertex, the random edges θij are mutually inde-pendent.

Throughout the rest of the paper we will use R to represent an arbitrary randomgraph, and G to represent an arbitrary ARG. To compute the probability that G isgenerated by R requires us to establish a common vertex labeling between the verticesof the two graphs. For the moment we assume that there is an arbitrary isomorphism,φ, from R into G serving to “orient” the random graph to the ARG whose probabilityof outcome we wish to compute. This isomorphism establishes a common labelingbetween the nodes in G and R, and consequently between the edges of the two graphsas well. Later we will address how to determine this isomorphism separately for trainingand classification.Up to this point, our development is identical to that of Wong [117]. In the original

presentation, and in subsequent work based on random graph classification [54, 3],discrete probability distributions were used to model all random vertices and edges. Formany classification problems, however, it may be difficult or unclear how to discretizecontinuous features. Outliers may also unpredictably skew the range of the resultingdiscrete distributions if the feature space is not carefully discretized. Furthermore, ifthe feature space is sparsely sampled for training the resulting discrete distributionsmay be highly unstable without resorting to histogram smoothing to blur the chosenbin boundaries. In such cases it is preferable to use a continuous, parametric model forlearning the required densities. For large feature spaces, the adoption of a parametricmodel may yield considerable performance gains as well.To address this need we will use continuous random variables to model the random

elements in our random graph model. We assume that each πi ∼ N(µπi,Σπi

), andthat the joint density of each random edge and its endpoints is Gaussian as well. Wecall random graphs satisfying these conditions, in addition to the three first orderconditions mentioned earlier, First Order Gaussian Graphs, or FOGGs.Given an ARG G = (VG, EG,mv,me) and a FOGG R = (VR, ER, µv, µe), the task

is now to compute the probability that G is an outcome graph of R. To simplify ournotation we let pvi

denote the probability density function of µv(vi) and peijthe density


of µe(eij). Furthermore, let vi = mv(φ(vi)) and eij = me(φ(eij)) denote the observedattributes for vertex vi and edge eij respectively under isomorphism φ.We define the probability that R generates G in terms of a vertex factor:

VR(G,φ) =∏

vi∈VR

pvi(vi), (2.1)

and an edge factor:

ER(G,φ) =∏

eij∈ER

peij(eij |vi,vj), (2.2)

The probability that G is an outcome graph of R is then given by:

PR(G,φ) = VR(G,φ)× ER(G,φ). (2.3)

Applying Bayes rule, we rewrite (2.2) as:

ER(G,φ) =∏

eij∈ER

peij(vi,vj |eij)peij

(eij)pvi(vi)pvj

(vj),(2.4)

where we may write the denominator as the product of the two vertex probabilitiesdue to the first order independence assumption. Letting δR(vi) denote the degree ofvertex vi, we can rewrite equation (2.4) as:

ER(G,φ) =

∏eij∈ER


(eij)∏vi∈VR

pvi(vi)δR(vi)

(2.5)

After substituting equations (2.1) and (2.5) into (2.3) and noting that the vertexprobabilities in (2.1) cancel with the denominator of (2.5) we have:

PR(G,φ) =

∏eij∈ER


(eij)∏vi∈VR

pvi(vi)δR(vi)−1

(2.6)

and by substituting the joint density for the conditional above:

PR(G,φ) =

∏eij∈ER

1peij

(eij)peij

(vi,vj , eij)peij(eij)∏

vi∈VRpvi(vi)δR(vi)−1

=

∏eij∈ER

peij(vi,vj , eij)∏

vi∈VRpvi(vi)δR(vi)−1

. (2.7)

Recalling that we assume each random vertex is Gaussian we write:

pvi(vi) =

1

(2π)d12 |Σvi

| 12e−

12 (vi−µvi

)Σ−1vi

(vi−µvi)t . (2.8)

Letting pwijdenote the (Gaussian) joint probability of edge eij and its endpoints vi

and vj , and denoting the concatenation of feature vectors vi, vj , and eij with xij wehave:

pwij(xij) =

1

(2π)d22 |Σwij

| 12e− 1

2 (xij−µwij)Σ−1

wij(xij−µwij

)t. (2.9)


π1 π2

π3π4

π0

θ10 θ20

θ03θ04

θ34

Model graph R

v1 v2

v3v4

v0e10 e20

e03e04e34

v1v2

v3v4

v0

e10 e20

e03e04

High probabilityoutcomes

v1 v2

v3v4

v0e01 e02

e03

e34

Low probability

outcome

Figure 2.1: The probability metric induced by PR(G,φ) over the space of outcomegraphs. The probability depends not only on structural variations, but deviation fromthe expected value of each corresponding random variable πi and θi. This is illustratedspatially, however the vertex and edge attribute domains need not have a spatial in-terpretation.

Substituting these into (2.6) and taking the log we arrive at:

lnPR(G,φ) =∑

vi∈VR

(δ(vi)− 1)[12(vi − µvi

)Σ−1vi(vi − µvi

)t + ln(2π)d12 |Σvi

| 12]

−∑

eij∈ER

[12(xij − µwij

)Σ−1wij(xij − µwij

)t + ln(2π)d22 |Σvi

| 12]. (2.10)

This probability and corresponding log-likelihood are central to the use of first orderGaussian graphs as classifiers. Note that we can promote an ARG to a FOGG byreplacing each deterministic vertex and edge with a Gaussian centered on its measuredattribute. The covariance matrix for each new random element is selected to satisfysome minimum criterion along the diagonal and can be determined heuristically basedon the given problem. Figure 2.1 conceptually illustrates the probability metric inducedby equation 2.10 over the space of outcome graphs.

2.1.2 Technicalities

Before continuing with our development of the clustering and classification proceduresfor first order Gaussian graphs, it is necessary to first address a few details that willsimplify the following development. These details center primarily around the need tocompare and combine two FOGGs during the clustering process.


Null extension

During clustering and classification the need will eventually arise to compare, andpossibly combine, two FOGGs of different order. Let R1 = (V 1, E1, µ1v, µ

1e) and R2 =

(V 2, E2, µ2v, µ2e) be two first order Gaussian graphs with n = |V 1| and m = |V 2|.

Furthermore, let V 1 = {v1, . . . vn} and V 2 = {u1, . . . um}. Assume without loss ofgenerality that m < n.We will use the same technique as Wong[117] to extend R2 by adding null vertices

to V 2. Thus we redefine V 2 as:

V 2 = V 2 ∪ {um+1, . . . , un},

where the um+1, . . . , un are null vertices, i.e. they have no attribute values, but ratheract as place holders so that R1 and R2 are of the same order.Once R2 has been extended in this fashion, both R1 and R2 may be extended to

complete graphs through a similar addition of null edges to each graph until edges existbetween all vertices. By adding these null graph elements we can now treat R1 andR2 as being structurally isomorphic, so that we are guaranteed that an isomorphismexists and we must only search for an optimal one.Our probabilistic model must also be enriched to account for such structural modi-

fications to random graphs. First, note that our densities pvi(x) modeling the features

of each random vertex are actually conditional probabilities:

pvi(x) = pvi

(x|φ(vi) is non–null)

After all, we can only update the feature distribution of a vertex, or indeed evenevaluate it, when we have new information about an actual non–null outcome. Toaccount for the possibility of a vertex or edge not being instantiated, we will additionallykeep track of a priori probabilities of a random element generating a non–null outcome:

pR(vi) = probability vi ∈ VR is non–nullpR(eij) = probability eij ∈ ER is non–null. (2.11)

Thus, whenever we wish to evaluate the probability of a random element λ takingthe value x we will use p(λ)pλ(x), which is intuitively the probability that λ existsand takes the value x. Whenever a probability for a random vertex or edge must beevaluated on a null value, we will fall back to the prior probability of that element.This is done by optimistically assuming that the null element results from a detectionfailure, and that the missing feature is the expected value of the random element it isbeing matched with.Through the use of such extensions, we can compare graphs with different sized

vertex sets. For the remainder of the paper we will assume that such an extension hasbeen performed, and that any two graphs under consideration are of the same order.

Entropy of first order Gaussian graphs

In order to measure the quality of a model for a class we require some quantitativemeasure that characterizes the outcome variability of a first order Gaussian graph. Asvariability is statistically modeled, Shannon’s entropy is well suited for this purpose [95].


We can write the entropy in a first order Gaussian graph as the sum of the contri-butions of the vertices and edges:

H(R) = {H(VR) +H(ER)}. (2.12)

Because of the first order assumptions of independence, we can write the vertex andedge entropies as the sum of the entropy in each component. The entropy in eachrandom vertex and edge may be written as the sum of the entropy contributed by thefeature and prior entropy:

H(vi) = H(µv(vi)) +H(p(vi))H(eij) = H(µe(eij)) +H(p(eij)). (2.13)

Equation 2.12 then becomes

H(R) =∑

vi∈VR

H(vi) +∑

eij∈ER

H(eij). (2.14)

For clustering we are primarily interested in the increment of entropy caused byconjoining two graphs. We denote by R1⊕2(φ) the graph resulting from conjoiningthe Gaussian random graphs R1 and R2 according to the isomorphism φ between R1

and R2. Assuming without loss of generality that H(R1) ≤ H(R2), we can write theincrement in entropy as:

∆H(R1, R2, φ) = H(R1⊕2(φ))−H(R1).

and then substitute the sum of the component entropies:

∆H(R1, R2, φ) =∑

vi∈VR1⊕2(φ)

H(∆µv(vi))−∑

vi∈VR1

H(µ1v(vi)) +

∑eij∈ER1⊕2(φ)

H(∆µe(eij))−∑

eij∈ER1

H(µ1e(eij)). (2.15)

We use ∆µλ(x) to denote the density of the random variable λ updated to reflect theobservations of the corresponding random variable as dictated by the isomorphism φ.From equations (2.13) and (2.14) we can express the increment in entropy as the

sum of the increment in the feature density and the prior distribution. Since we areusing Gaussian distributions to model each random element, the entropy of a Gaussian:

H(X) = ln(2πe)N2 |ΣX |

12 , (2.16)

for Gaussian random variable X will prove useful as this will allow us to computecomponent entropies directly from the parameters of the corresponding densities. Thetechnique for estimating the parameters of the combined distribution is described next.


Parameter estimation

Given two Gaussian random variables X and Y , and samples {x1, . . . , xn}, {y1, . . . , yn}from each distribution, we estimate the Gaussian parameters in the normal fashion:

x =1n

n∑i=1

xi, ΣX =1

n− 1∑i=1

nxixti − nxxt

y =1m

m∑i=1

yi, ΣY =1

m− 1

m∑i=1

yiyti −myyt (2.17)

Assuming that the samples from X and Y are generated by a single Gaussian Z, wecan compute the Gaussian parameters for Z directly from the estimates of X and Y :

z =1

n+m(nx+my)

ΣZ =1

m+ n− 1(n∑

i=1

xixti +m∑i=1

yiyti − (m+ n)zzt)

= (n− 1)ΣX + nxxt + (m− 1)ΣY +myyt − (m+ n)zzt. (2.18)

Equation (2.18) gives us a fast method for computing the entropy arising fromcombining two random vertices. It also allows us to compute the parameters of thenew distribution without having to remember the samples that were used to estimatethe original parameters. When there are too few observations to robustly estimate thecovariance matrices, ΣX is chosen to reflect the inherent uncertainty in a single (orvery few) observations. This also allows us to promote an ARG to a FOGG by settingeach mean to the observed feature, and setting the covariance matrices to this minimalΣ.

2.1.3 Reflections

At this point it is useful to take a step back from the mathematical technicalitiespresented in the previous subsections and examine the practical importance they rep-resent. By replacing the original discrete random variables with continuous ones, wehave eliminated the need to discretize our feature space. This, in conjunction with theadoption of a Gaussian model for each random variable, additionally minimizes thecomplexity of updating the estimated distribution and entropy of a random element.Consider the process of conjoining two discrete distributions. In the worst case,

every bin in the resulting distribution must be updated. The complexity of this pro-cedure will be proportional to the size of the quantized feature space. Computing theincrease in entropy caused by joining two discrete distributions will have the samecomplexity. Using Gaussian distributions, however, equations (2.17) and (2.18) allowus to compute the parameters of a new distribution, and equation (2.16) to computethe increment in entropy directly from the parameters of the new distribution. Thisreduces the complexity to d2, where d is the dimensionality of the feature space.

2.2. Clustering and classification 21

2.2 Clustering and classification

In this section we describe the technique for synthesizing a first order Gaussian graphto represent a set of input pattern ARGs. The approach uses hierarchical clusteringof the input ARGs, which yields a clustering minimizing the entropy of the resultingFOGG(s). Entropy is useful in that it characterizes the intrinsic variability in thedistribution of a first order Gaussian graph over the space of possible outcome graphs.

2.2.1 Hierarchical clustering of FOGGs

The concepts of increment in entropy introduced in section 2.1.2 can now be usedto devise a clustering procedure for FOGGs. The first step is to derive a distancemeasure between first order Gaussian graphs that is based on the minimum incrementin entropy. Using equation (2.15), the minimum increment of entropy for the mergingof two FOGGs can be written:

∆H(R1, R2) = minφ{∆H(R1, R2, φ)} (2.19)

where the minimization is taken over all possible isomorphisms φ between R1 and R2.At last we have arrived at the need to establish an actual isomorphism between two

graphs. Unfortunately this problem is NP-hard, and we must settle for an approxima-tion to the optimal isomorphism. We choose to optimize only over the vertex entropyH(VR, φ). This approximation is acceptable for problems where much of the structuralinformation is present in the vertex observations. Edge probabilities are still used in theclassification phase, so gross structural deviations will not result in misclassfications.There are two ways in which the entropy of a vertex may be changed by conjoining

it with a vertex in another FOGG. The feature density of the first vertex may bemodified to accommodate the observations of the random vertex it is matched withaccording to φ. Or, when φ maps vi to a null vertex, the entropy may be changed dueto a decrease in its prior probability pvi

of it being instantiated in an outcome graph.Using equation (2.16) we may write the increment in vertex entropy due to the

feature distribution as:

∆Hf (µv(vi), R2, φ) = ln(2πe)N2 (|Σ∆µ(vi)|

12 − |Σµ1(vi)|

12 ). (2.20)

Equation 2.18 gives us a method for rapidly computing the covariance matrix for eachrandom element in the new graph R1⊕2, and thus the increment in entropy.For the increment in prior entropy, we first note that the prior probabilities pvi

for each vertex will be of the form ni

Ni, where ni is the number of times vertex vi

was instantiated as a non–null vertex, and Ni is the total number of training ARGscombined thus far in a particular cluster. We can then write the change in prior entropyas:

∆Hp(p(vi), R2, φ) = − [p′(vi) ln p′vi+ (1− p′(vi)) ln(1− p′(vi))]

− [p(vi) ln pvi+ (1− p(vi)) ln(1− p(vi))] (2.21)

where

p′(vi) ={ ni+1

Ni+1if φ(vi) is non–null

ni

Ni+1if φ(vi) is null

(2.22)


We solve the optimization problem given by equation (2.19) using the maximumweight matching in a bipartite graph. Given two FOGGs R1 and R2 of order n,construct the complete bipartite graph Kn,n = Kn × Kn. A matching in Kn,n is asubset of edges such that the degree of each node in the resulting graph is exactly one.The maximum weight matching is the matching that maximizes the sum of the edgeweights in the matching. By weighting each edge in the bipartite graphs with:

wij = −∆H(µv(vi), R2, φ)

with ∆H(µv(vi), R2, φ) as given in equation (2.20) and solving for the maximum weightmatching, we solve for the isomorphism that minimizes the increment in vertex en-tropy. There exist efficient, polynomial time algorithms for solving the maximumweight matching problem in bipartite graphs [36]. The complexity of the problemO(n3) using the Hungarian method, where n is the number of vertices in the bipartitegraph.Now we may construct a hierarchical clustering algorithm for first order Gaussian

graphs. For a set of input ARGs, we desire, on the one hand, a minimal set of firstorder Gaussian graphs that may be used to model the input ARGs. On the other hand,we also wish to minimize the resulting entropy of each FOGG by preventing unnaturalcombinations of FOGGs in the merging process.

Algorithm 1 Synthesize FOGG(s) from a set of ARGsInput: G = {G1, . . . , Gn}, a set of ARGs, and h, a maximum entropy threshold.Output: R = {R1, . . . , Rm}, a set of FOGGs representing G.

Initialize R = G, promoting each ARG to a FOGG (section 2.1.2).Compute H = [hij ], the n× n distance matrix, with hij = ∆H(Ri, Rj).Let hkl = minhij .while (|R| > 1 and H(Ri) + hkl < h) doForm the new FOGG Rk⊕l, add it to R, remove Rk and Rl from R.Update distance matrix H to reflect the new and deleted FOGGs.Re-compute hkl, the minimum entropy increment pair.

end while

The algorithm should return a set of first order Gaussian graphs,Ri = {Ri1, . . . R

imi},

that represent the original set of attributed graphs. We will call this set of randomgraphs the graphical model of the class of ARGs. An entropy threshold h controls theamount of variability allowed in a single FOGG. This threshold parameter controls thetradeoff between the number of FOGGs used to represent a class and the amount ofvariability, i.e. entropy, allowed in any single FOGG in the graphical model. Algo-rithm 1 provides psuedo code for the hierarchical synthesis procedure. In the supervisedcase, the algorithm may be run on each set of samples from the pre-specified classes.For unsupervised clustering, the entire unlabeled set of samples may be clustered. Theentropy threshold h may be used to control the number of FOGGs used to representthe class by limiting the maximum entropy allowed in any single FOGG. Figure 2.2graphically illustrates the learning process for first order Gaussian graphs.Note that this clustering procedure requires, for the creation of the initial distance

matrix alone, the computation of n(n − 1) isomorphisms, where n is the number of

2.2. Clustering and classification 23

e10e30

e13

e02

v0

v1

v2

v3

e10e20

v0

v1v2

e10

e20

e10

v0

v1

v2

...

Hierarchical Clustering

Training Graphs Gi

θ10

θ02

θ13

θ03π0

π1

π2

π3

Model Graph(s) Ri

θ10

θ02θ03

θ40

π1

π2

π3

π4

Figure 2.2: The learning process for first order Gaussian graphs. A set of samplegraphs are synthesized into a set one or more random graphs which represent the class.The hierarchical clustering process described in Algorithm 1 chooses combinations andisomorphisms that minimize the increment in vertex entropy arising from combiningtwo random graphs.

input ARGs. Subsequent updates of the distance matrix will demand a total of O(n2)additional isomorphism computations in the worst case. Each isomorphism additionallyrequires the computation of m2 entropy increments as given in equation 2.20, where mrepresents the number of vertices in the graphs being compared. Using the techniquesderived in subsection 2.1.2 we can exploit the use of Gaussian distributions to greatlyimprove the efficiency of these computations. This, combined with the use of thebipartite matching approach as an approximation to finding the optimal isomorphismwill enhance the overall efficiency of the clustering procedure.

2.2.2 Classification using first order Gaussian graphs

Given a set of graphical models for a number of classes, we construct a maximumlikelihood estimator as follows. Let Ri = {Ri

1, . . . , Rimi} be the graphical model for

class i. Our classifier, presented with an unknown ARG G, should return a class label


Algorithm 2 Classification with first order Gaussian graphsInput: G, an unclassified ARG,R = {R1, . . . ,Rn}, graphical models representing pattern classes ω1, . . . , ωn,with each Ri = {Ri

1, . . . , Rimi} a set of FOGGs.

Output: ωi for some i ∈ {1, . . . , n}

for all Ri ∈ R doSet Pi = 0.for all Ri

j ∈ Ri doCompute orientation φ of Ri

j w.r.t. G that maximizes PRij(G,φ) (figure 2.3).

Pi = max{PRij, Pi}

end forend fork = argmaxi(Pi)return ωk

wij

φ

21

v22

v32

v42

vv11

v21

v31

v41

v11

v21

v31

v41

v21

v22

v32

v42

Figure 2.3: Computation of the suboptimal isomorphism φ for two first order Gaussiangraphs. Each edge is weighted with wij , whose value is the increment in entropycaused by conjoining the estimated distributions of random vertices v1i and v2j . Thesame technique will be used for determining an isomorphism for classification as well,but with wij representing the probability that random vertex v1i takes the value v

2j .

w, from a set of known classes {w1, . . . , wn}. The maximum likelihood classifier returns:

wi, where i = argmaxi{ max1≤j≤mi

maxφ

PRij(G,φ)},

with PRij(G) as defined in equation (2.3). This procedure is described in more detail

in Algorithm 2.In establishing the isomorphism φ for classification, it is useful to use the log like-

lihood function given in equation (2.10). We can then use the same bipartite graphmatching technique introduced in section 2.2 and shown in figure 2.3. Instead of weight-ing each edge with the increment in entropy, however, we weight each edge with the

2.3. Experiments 25

IJNM

JACM

STDV

TNET

Table 2.1: Sample document images from four of the document genres in the testsample.

log of the vertex factor from equation (2.8):

wij = −12(vj − µvi

)Σ−1vi(vj − µvi

)t − ln(2π)d12 |Σvi

| 12

Determining the maximum weight matching then yields the isomorphism that maxi-mizes the likelihood of the vertex densities.The entropy threshold h required by the training procedure described in Algorithm 1

has quite an influence over the resulting classifier. Setting h = 0 would not allow anyFOGGs to be combined, resulting in a nearest neighbor classifier. For h → ∞ allclasses will be modeled with a single FOGG, with arbitrarily high entropy allowed inan individual FOGG. It is important to select an entropy threshold that balances thetradeoff between the complexity of the resulting classifier and the entropy inherentwithin each class.

2.3 Experiments

We have applied the technique of first order Gaussian graphs to a problem from thedocument analysis field. In many document analysis systems it is desirable to identifythe type, or genre, of a document before high–level analysis occurs. In the absence ofany textual content, it is essential to classify documents based on visual appearancealone. This section describes a series of experiments we performed to compare theeffectiveness of FOGGs with traditional feature–based classifiers.

2.3.1 Test data

A total of 857 PDF documents were collected from several digital libraries. The samplecontains documents from five different journals, which determine the classes in ourclassification problem. Table 2.1 gives some example images from four of the genres.All documents in the sample were converted to images and processed with the

ScanSoft TextBridge† OCR system, which produces structured output in the XDOC

†TextBridge is a registered trademark of ScanSoft, inc.


format. Only the layout information from the first page of a document is used since itcontains most of the genre-specific information. The importance of classification basedon the structure of documents is immediately apparent after a visual inspection ofthe test collection. Many of the document genres have similar, if not identical, globaltypographical features such as font sizes, font weight, and amount of text.

2.3.2 Classifiers

To compare the effectiveness of genre classification by first order random graphs withtraditional techniques, a variety of statistical classifiers were evaluated along with theGaussian graph classifier. The next two subsections detail the specific classifiers stud-ied.

First order Gaussian graph classifier

In this section we develop our technique for representing document layout structureusing attributed graphs, which naturally leads to the use of first order Gaussian graphsas a classifier of document genre. For representing document images, we define thevertex attribute domain to be the vector space of text zone features. A document Di

is described by a set of text zone feature vectors as follows:

Di = {zi1, . . . , zini},

where zij = (xij , yij , w

ij , h

ij , s

ij , t

ij). (2.23)

In the above definition of a text zone feature vector,

• xij , yij , w

ij and hij denote the center, width and height of the textzone.

• sij and tij denote the average pointsize and number of textlines in the zone.

Each vertex in the ARG corresponds to a text zone in the segmented document image.Edges in our ARG representation of document images are not attributed. The presenceof an edge between two nodes is used to indicate the Voronoi neighbor relation [25]. Weuse the Voronoi neighbor relation to simplify our structural representation of documentlayout. We are interested in modeling the relationship between neighboring textzonesonly, and use the Voronoi neighbor relation to identify the important structural rela-tionships within a document.Given training samples from a document genre, we construct a graphical model

according to Algorithm 1 to represent the genre. The entropy threshold is particularlyimportant for this application. The threshold must be selected to allow variability indocument layout arising from minor typographical variations and noisy segmentation,while also allowing for gross structural variations due to common typesetting tech-niques. For example, one genre may contain both one and two column articles. Thethreshold should be selected such that the FOGGs representing these distinct layoutclasses are not combined while clustering.

2.3. Experiments 27

Statistical classifiers

Four feature–based statistical classifiers were evaluated in comparison with the firstorder Gaussian graph classifier. The classifiers considered are the 1–NN, linear–obliquedecision tree [76], quadratic discriminant, and linear discriminant classifiers. Globalpage-level features were extracted from the first page of each document. Each documentis represented by a 23 element feature vector as:

( np, nf , niz, ntz, niz, ntl︸︷︷︸global document features

,

proportional zone features︷︸︸︷pta, pi, pte, pin, pit, pr, pb , h0, h1, . . . , h9︸︷︷︸

text histogram

).

The features are categorized as follows:

• Global document features, which represent global attributes of the document.The global features we use are the number of pages, fonts, image zones, text zones,and textlines in the document.

• Proportional zone features, that indicate the proportion of document pagearea classified by the layout segmentation process as being a specific type of imageor text zone. The feature vector includes the proportion of image area classifiedas table, image, text, inverse printing, italic, text, roman text, and bold text.

• Text histogram, which is a normalized histogram of pointsizes occurring in thedocument.

This feature space representation is similar to that used by Shin, et al. [97], fortheir experiments in document genre classification. We do not include any texturefeatures from the document image, however. Note that the features for the vertices inthe FOGG classifier discussed in the previous subsection is essentially a subset of thesefeatures, with a limited set of features collected locally for each text zone rather thanfor the entire page.

2.3.3 Experimental results

The first set of experiments we performed was designed to determine the appropriateentropy threshold h for our classification problem and test set. Figure 2.4 gives thelearning curves for the first order Gaussian graph classifier over a range of trainingsample sizes and for several entropy thresholds.The learning curves indicate that our classifier performs robustly for all but the

highest thresholds. This implies that there is intrinsic structural variability in mostclasses, which cannot be represented by a single FOGG. This is particularly true forsmall training samples. Note, however, that a relatively high threshold (h = 0.05) maybe used with no performance loss when sample size is more than 25. This indicatesthat a smaller and more efficient graphical model may be used to represent each genreif the training sample is large enough.The second set of experiments provides a comparative evaluation of our classifier

with the statistical classifiers described above. Figure 2.5 gives the learning curves of


5 10 15 20 25 30 35 40 45 500.93

0.94

0.95

0.96

0.97

0.98

0.99

1

# of Training Samples

Cla

ssifi

catio

n A

ccur

acy

h = 0.00005h = 0.0005h = 0.05h = 0.5

Figure 2.4: Classification accuracy over a range of training sizes for various entropythresholds. The x axis represents the number of training samples selected randomly perclass. The y axis represents the estimated classification accuracy for the correspondingnumber of training samples.

all classifiers evaluated. The curves were obtained by averaging the classification resultsof 50 trials where n samples were randomly selected from each genre for training, andthe rest used for testing. These results indicate that the first order Gaussian graphclassifier consistently outperforms all of the other classifiers.

2.3.4 Computational efficiency

Figure 2.6 indicates the computation time required for clustering and classificationusing FOGGs. The times were estimated by averaging over 50 experimental trials.Figure 2.6a shows the time to learn a graphical model of each class as a function of thenumber of training samples. A plot of n2

100 is also shown for reference. The n2 bound

on clustering time indicates that the time required for clustering is dominated by thecomputation of the initial n × n distance matrix, with an additional constant factorrepresenting the intrinsic complexity of each class. The constant factor of 1

100 shownin the figure was determined experimentally, but is clearly class specific and dependson the variance in size of the training ARGs being clustered.

2.3. Experiments 29

5 10 15 20 25 30 35 40 45 500.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Cla

ssifi

catio

n A

ccur

acy

# of Training Samples

FOGG (h = 0.005)1−NNLinearQuadraticDecision Tree

Figure 2.5: Learning curves for all classifiers. The curves indicate the estimated classi-fication accuracy of each classifier over a range of training sample sizes. Classificationaccuracy was estimated by averaging over 50 trials for each sample size. The x axisrepresents the number of training samples selected randomly per class.

Figure 2.6b shows the time required to classify an unknown ARG as a functionof the number of FOGGs used to represent each class. As expected, there is a lineardependence between the number of FOGGs in a class and the time required for clas-sification. Again, the constant factor affecting the slope of each performance curve isdetermined by the complexity of each class.

2.3.5 Analysis

The experimental results presented above indicate that the FOGG classification tech-nique outperforms statistical classifiers on our example application. In understandingthe reasons for this it is useful to analyze some specific examples from these experi-ments. Table 2.2 gives an average confusion table for a linear discriminant classifier onour dataset.Note that the class IJNM contains the highest percentage of confusions. The pri-

mary reason for this is that the class contains many structural sub–classes within it.Figure 2.7 provides examples from the three primary structural sub–classes within the


5 10 15 20 25 30 35 40 45 50# Training samples

5.0

10.0

15.0

20.0

25.0T

ime

to

clu

ste

r (s

eco

nds

)

TNETIJNMSTDVTOPOJACM

2

100

5 10 15 20 25 30# FOGGs per class

0.05

0.10

0.15

0.20

0.25

0.30

Tim

e t

o c

lass

ify (

seco

nds)

TNETIJNMSTDVTOPOJACM

(a) (b)

Figure 2.6: Computation time for clustering and classification. (a) gives average timerequired for clustering for each class as a function of the number of training samplesper class. A plot of n2

100 is also shown for reference. (b) shows the average time requiredto classify an unknown ARG as a function of the number of FOGGs representing eachclass.

IJNM JACM STDV TNET TOPOIJNM 0.8524 0.0464 0.0250 0.0750 0.0012JACM 0 0.9718 0 0 0.0282STDV 0.0540 0.0016 0.9127 0.0310 0.0008TNET 0.0199 0.0013 0.0584 0.9195 0.0009TOPO 0.0724 0.0444 0.0047 0 0.8786

Table 2.2: Average confusion matrix for a linear discriminant classifier. Classificationaccuracy is estimated over 50 random trials.

IJNM class. As shown in the figure, the class contains document images with one,two, and three columns of text. The variance in structural layout composition makesit difficult for statistical classifiers to learn decision boundaries distinguishing this classfrom others. Similar structural sub–classes can also be seen in the TOPO class, whichalso accounts for the large number of confusions.

The main advantage of the FOGG approach over purely statistical classifiers isin its ability to learn this type of subclass structure. Figure 2.8 gives an examplesnapshot of the clustering procedure on the IJNM class for a random training sample,with the clustering halted at a low entropy threshold of 0.00005. The figure showshow the distinct structural subclasses are represented by individual FOGGs. It is thisindependent modeling of subclasses that enables the FOGG classifier to outperformstatistical methods.

2.4. Discussion 31

Single Column Two Column Three Column

Figure 2.7: Example document images from different structural categories within theIJNM class.

2.4 Discussion

We have described an extension to discrete first order random graphs which uses con-tinuous Gaussian distributions for modeling the densities of random elements in graphs.The technique is particularly appealing for its simplicity in learning and representa-tion. This simplicity is reflected in the ability to learn the distributions of randomgraph elements without having to worry about the discretization of the underlying fea-ture space, and the ability to do so using relatively few training samples. The learneddistributions are also effectively “memoryless” in that we do not have to remember theobserved samples in order to update density estimates or compute their entropy.Since we can quickly compute the increment in entropy caused by joining two dis-

tributions directly from the Gaussian parameters, the hierarchical clustering algorithmused to learn a class model is very efficient. The use of an approximate matching strat-egy for selecting an isomorphism to use when comparing random graphs also enhancesthe efficiency of the technique while preserving discriminatory power.The experimental results in the previous section establish the effectiveness of first

order Gaussian graphs as a classification technique on real–world problems. The needfor modeling structure is evident, as the statistical classifiers fail to capture the sub–structural composition of some classes in our experiments. While it might be possibleto enrich the feature space in such a way that allows statistical classifiers to handlesuch structural information, the FOGG classifier is already capable of modeling thesestructural relationships. Moreover, the FOGG classifier uses a small subset of thefeature space used by the statistical classifiers in our experiments. The importantdifference is that the FOGG classifier models the structural relationship between localfeature measurements.The FOGG classifier requires the selection of an appropriate entropy threshold

for training. This threshold is problem and data dependent, and was more or lessarbitrarily chosen for our experiments. The results shown in Figure 2.4 indicate thatan adaptive entropy thresholding strategy might be effective for balancing the intrinsicentropy in a class with the desire for minimizing the complexity of its graphical model.It is also possible that the use of a measure of cross–entropy, such as the Kullback-


Graphical Model Clustered ARGs

FOGG 12 ARGsnot shown

FOGG 26 ARGsnot shown

FOGG 3

FOGG 4

FOGG 6

Unclustered3 ARGsnot shown

Figure 2.8: Example clusters learned from 30 randomly selected samples from theIJNM class of documents. FOGGs 1 and 2 represent the most commonly occurringtwo–column documents, while FOGG 6 represents the three–column layout style. Avery low entropy threshold was used to halt the clustering procedure at this point forillustrative purposes. The unclustered ARGs have not been allowed to merge with anyothers at this point because they represent segmentation failures and their inclusion inanother FOGG would drastically increase the overall entropy.

2.4. Discussion 33

Leibler divergence [65], might permit the learning of models which simultaneouslyminimize the intra–class entropy, while maximizing inter–class entropy. More researchis needed on the subject of entropy thresholding in random graph clustering.While the use of entropy as a clustering criterion has been shown to be effective,

the precise relationship between the entropy of a distribution and its statistical prop-erties is not well understood. It is possible that alternate distance measures, such asdivergence or JM-distance [108], could be effectively applied to the clustering problem.Such distance metrics have precise statistical interpretations, and might allow for theestablishment of theoretical bounds on the expected error of learned graphical models.

Chapter2 Characterizinglayoutstyleusingﬁrst ...bagdanov/thesis/thesis_04.pdf · Chapter2...

Documents

Transcript of Chapter2 Characterizinglayoutstyleusingﬁrst ...bagdanov/thesis/thesis_04.pdf · Chapter2...