Context Dependent Segmentation and Matching in Image …hayit/publications/EMD_revisedfinal.pdf ·...

Context Dependent Segmentation and Matching inImage Databases

Hayit GreenspanFaculty of Engineering,

Tel-Aviv University,Tel-Aviv 69978, Israel

Guy DvirFaculty of Engineering,

Tel-Aviv University,Tel-Aviv 69978, Israel

Yossi RubnerApplied Materials, Israel

[email protected]

July 31, 2003

Abstract

The content of an image can be summarized by a set of homogeneous regions inan appropriate feature space. When exact shape is not important, the regions can berepresented by simple “blobs”. Even for similar images, the blob representation of thetwo images might vary in shape, position, the number of blobs, and the representedfeatures. In addition, separate blobs in one image might correspond to a single blobin the other image and vice versa. In this paper we present the BlobEMD frameworkas a novel method to compute the dissimilarity of two sets of blobs while allowing forcontext-based adaptation of the image representation. This results in representationthat represent well the original images but at the same time are best aligned withrespect to the representation of the context images.

We compute the blobs by using Gaussian mixture modeling and use the EarthMover’s Distance (EMD) to compute both the dissimilarity of the images and theflow matrix of the blobs between the images. The BlobEMD flow-matrix is used tofind optimal correspondences between source and target image representations and toadapt the representation of the source image to that of the target image. This allowsfor similarity measures between images that are insensitive to the segmentation processand to different levels of details of the representation. We show applications of thismethod for content-based image retrieval, image segmentation, and matching modelsof heavily dithered images with models of full resolution images.

1 Introduction

Many content-based retrieval works rely on an initial segmentation of the input and archived

images. Yet, image segmentation remains one of the more challenging problems in computer-

vision and often is not well defined, as different contents entail different segmentations of

the same image. For example, in some contexts it is more appropriate to segment together

all the trees in an image of a forest while in other contexts, each tree should stand by its

own. In this work we address the challenge of comparing similar images that are segmented

differently and/or are represented at varying level of resolution, as is the case in dithered

images.

The “BlobEMD” framework is proposed in this work as a simultaneous solution to both

the image representation problem and the estimation of the distance between images. This

coupling allows for context-based model adaptation where the representation of one image

is adjusted based on the representation of a second image - the context. The framework

combines an initial transition from image pixels to representative image regions (segments

or blobs) via Gaussian mixture modelling (GMM) [2], followed by utilizing the Earth mover’s

distance measure (EMD) [19] for finding the optimal correspondences between regions in the

two images, and extracting an overall image matching measure between two input images.

The correspondences between the regions in the two images are used to merge and spit the

regions, so they still represent well the images but at the same time bring the two representa-

tions to a common context. For example, the problem of image segmentation is treated here

as an image pair (source-target) task. Thus, an image will be segmented differently based

on the target image. The suggested framework provides for image representations that are

more uniform and best aligned between the two images to be matched.

The overall framework of the image representation and matching phases is represented in

Figure 1. In section 2 we review some of the related work and motivate the proposed scheme.

The BlobEMD framework is presented in section 3. In addition to the distance between two

sets of blobs, the BlobEMD results in a flow matrix with correspondences between blobs.

In section 4 we focus on the flow-matrix and provide a set of rules for extracting region-

correspondences between images and for image model adaptation. Experimental evaluation

of the BlobEMD framework, along with its application to context-based image segmentation

2

Figure 1: A block diagram of the BlobEMD matching system

and robust image matching are presented in section 5.

2 Related Work

Histograms are the classical means of representing image content and are widely used as

the chosen image representation [8, 1]. A histogram is a discrete representation of the

continuous feature space, generated by a partitioning of the feature space. The partitioning is

determined by the feature space chosen (e.g. the color space representation), the quantization

scheme chosen (such as uniform or vector quantization), as well as computational and storage

considerations. Color histograms advantages and disadvantages are well studied [23] and

many variations exist [16, 22, 13].

Several measures have been proposed for the dissimilarity between two histograms. In

general they can be divided into two categories [20, 17]: “bin-by-bin” measures, that compare

contents of corresponding histogram bins, and “cross-bin” measures that enable compar-

isons across non-corresponding bins. In the first category are included the Minkowski-form

distance, the histogram intersection (H.I.) measure [23, 20], the χ2 statistics, the Kullback-

Leibler (KL) divergence [14, 4], and others. “Cross-bin” measures combine also the feature

3

space information of the bins (e.g. the dissimilarities between colors represented by the his-

togram bins). Such measures include the Quadratic-form distance [11] in which a similarity

matrix is included to represent similarity between bins. The Earth mover’s distance measure

[19] extracts dominant modes from a histogram, as a signature, and defines a measure of sim-

ilarity between signatures. Additional distance measures between histogram representations

in an image matching task are evaluated and compared in [19, 17, 20].

The histogram representation has been extended recently to include additional features

as well as spatial information. In [16] each entry of a “joint” histogram contains the number

of pixels in the image that are described by a particular combination of feature values. In

[22] local information is included by dividing an image into five fixed overlapping blocks and

extracting the first three color moments of each block to form a feature vector for the image.

In [13] correlograms are proposed to take into account the local color spatial correlation as

well as the global distribution of the spatial correlation.

Other works in image representation include “region-based” approaches. Image regions

are the basic building blocks in forming the visual content of an image, and thus have great

potential in representing the image content and enabling image matching. In [21] Smith

and Chang store the location of each color that is present in a sufficient amount in regions

computed using histogram backprojection. Ma and Manjunath [15] perform retrieval based

on segmented image regions. The segmentation is not fully automatic as it requires some

parametric tuning and hand pruning of regions. Unsupervised segmentation of an image

into homogeneous regions in feature space, such as the color and texture space, can be found

in the “Blobworld” image representation [2, 3]. In [2] a naive Bayes algorithm is used to

learn image categories from the blob representation in a supervised learning scheme. The

framework suggested entails learning blob-rules per category. Thus, one may argue that

there is a shift to a high-level image description (image labeling). Each query image is next

compared with the extracted category models, and associated with the closest matching

category. In [3] the user composes a query by viewing the Blobworld representation and

selecting the blobs to match along with possible weighting of the blob features. A query may

include a combination (conjunction) of two blobs. In essence, the image matching problem

is shifted to a (one or two) blob matching problem. Each blob is compared with all blobs in

each database image. Spatial information is thus included, yet in a very concise manner. It

4

should be noted that each blob is represented by a color histogram, thus the representation

is a discrete representation (in the image plane as well as in feature space).

An extension to the Blobworld system, termed “GMM-KL” framework, has recently

been proposed [10]. The set of regions in an image is represented by a continuous Gaussian

mixture model (GMM). Images are next compared and matched via the continuous and

probabilistic KL distance between distributions. The GMM-KL framework achieves strong

matching results between images while addressing the problem of ‘multiple-blob’ to ‘multiple-

blob’ matching. In the current work we similarly extend the Blobworld system to address

the ‘multiple-blob’ matching problem. The continuous GMM representation is used in the

image representation stage following which we utilize the EMD distance measure in the

matching stage. In addition to providing a distance measure between multiple blob sets,

the BlobEMD framework generates a flow-matrix which provides correspondences between

individual source and target blobs. Thus the BlobEMD flow-matrix addresses the region-

correspondence problem between the two images. This information is used for context-based

image model adaptation, as will be exemplified in the following sections.

3 The BlobEMD Framework

In order to measure similarities between images that are represented by homogeneous regions,

we need to define an appropriate dissimilarity measure. This problem is harder when the two

sets of regions don’t have clear correspondences, and often, a region in one image matches

the union of several regions, or parts of regions in the second image. An example for this

can seen in Figure 8 (a). Both images show a lake and two trees. However, in the left image

the lake is represented by a single region while in the right image it is represented by three

regions. Similarly, the tree-tops in the right image are combined into a single region. In

order for the dissimilarity measure to perform properly, it should solve these cases. This is

done by the BlobEMD framework.

The BlobEMD framework [9] consists of three main steps (see Figure 1): First, each

input image is modeled as a Gaussian mixture distribution in a selected feature space. The

EMD is next utilized for measuring similarity between the respective models of two images.

In addition to the similarity measure between sets of regions, the EMD also returns the

5

correspondence (flow) between them. The third step uses these correspondences to adapt

one (source) image model based on the model of the second (target) image. Adaptation of the

image models achieves context based modeling and segmentation, and provides better overall

image similarity measures. The three steps are described in more detail in the following

sections.

3.1 Image representation via Gaussian mixture modeling

In the representation phase, each homogeneous region in the image is represented by a

Gaussian distribution and the set of regions in the image is represented by a Gaussian mixture

model (GMM). Pixels are grouped into homogeneous regions in the image plane by grouping

feature vectors in a selected feature space. We use the five-dimensional feature space of color

and space (L, a, b, x, y), where (L, a, b) is the 3-dimensional CIE-Lab color space [24], and

(x, y) is the spatial image plane. We use the CIE-Lab color space as it was designed so that

(short) Euclidean distances between two colors match perceptual similarity. The underlying

assumption is that the image colors and their spatial distribution in the image plane are

generated by a mixture of Gaussians. It should be noted that the representation model is

general, and can incorporate any desired feature space (such as color, texture, shape, etc) or

combination thereof.

The distribution of a random variable X ∈Rd is a mixture of k Gaussians if its density

function is:

f(x|θ) =k∑

j=1

αj1√

(2π)d|Σj|exp{−1

2(x− µj)

T Σ−1j (x− µj), (1)

such that the parameter set θ = {αj, µj, Σj}kj=1 consists of: αj > 0,

∑kj=1 αj = 1, µj ∈Rd

and Σj is a d×d positive definite matrix.

Given a set of feature vectors x1, ..., xn, the maximum likelihood estimation of θ is :

θML = arg maxθ

f(x1, ..., xn|θ). (2)

Since a closed form solution for this maximization problem is not possible, we utilize the

Expectation-Maximization (EM) algorithm [5] as an iterative method to obtain θML (similar

to [3]).

6

The iterative EM algorithm is initialized via the K-means algorithm [7], and is repeated

until the log-likelihood measure is increased by less than a predefined threshold (1%) from

one iteration to the next. The MDL principle [4] is used to select the number of mixture

components (or number of means), k, as best suits the natural number of groups present in

the image.

Once we associate a Gaussian mixture model to an image, the image can be viewed as a

set of independently identically distributed samples from the Gaussian mixture distribution.

Examples of images with their respective models are shown in Figures 8 - 11. Each localized

Gaussian mixture is shown as a set of ellipsoids, with each ellipsoid representing the support,

mean color and spatial layout, of a particular Gaussian in the image plane. The variability

in the number of regions, their layouts and colors for similar context input images, is evident

in the GMM representation as in the image plane.

3.2 The Earth Mover’s Distance (EMD)

In [19] the concept of the Earth Mover’s Distance is introduced as a flexible similarity measure

between multidimensional distributions, and is described in detail therein. Intuitively, given

two distributions represented by sets of weighted features, one can be seen as a mass of

“earth” properly spread in the feature space, the other as a collection of “holes” in that

same space. The EMD measures the least amount of work needed to fill the holes with earth.

Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance

which is a distance in the feature space. The EMD is based on the transportation problem

[12] and can be solved efficiently by linear optimization algorithms that take advantage of

its special structure.

Formally, let S = {(s1, ws1), . . . , (sm, wsm)} be the first set with m regions, where si is the

region descriptor and wsiis the weight of the region; T = {(t1, wt1), . . . , (tn, wtn)} the second

set with n regions; and DIST = [dist(si, tj)] the ground distance matrix where dist(si, tj) is

the distance between regions si and tj. The EMD between sets S and T is then

EMD(S, T ) =

∑mi=1

∑nj=1 fijdist(si, tj)∑mi=1

∑nj=1 fij

, (3)

where F = [fij], with fij ≥ 0 the flow between si and tj, is the optimal admissible flow from

7

S to T that minimizes the numerator of (3) subject to the following constraints:

n∑

j=1

fij ≤ wsi,

m∑

i=1

fij ≤ wtj

m∑

i=1

n∑

j=1

fij = min(m∑

i=1

wsi,

n∑

j=1

wtj) .

Notice that the two sets can have different total weights. This allows for partial matches

[19]. The EMD results both in a distance measure and with the actual flow. Both are used

in our framework.

3.3 Combining the EMD distance with GMM representation

The EMD distance is combined with the GMM image representation in the BlobEMD frame-

work. The source and target sets (S and T ) are the blob sets (GMMs) per source and target

image and the EMD is used to find correspondences between the blobs, or regions. These cor-

respondences are optimal in the sense that they minimize the overall EMD distance (equation

3) between the images.

Figure 2 shows the bi-partite graph with which the EMD problem is defined and solved.

The source and target images yield two sets of blobs {s1...sn} and {t1...tm}. The source blobs

comprise the vertices of the left-hand side of the bi-partite graph. The target blobs comprise

the right-hand vertices of the graph. Note that each of the two images can be represented

by a different number of blobs. Each connecting arc is weighted by the ground-distance

between the corresponding source and target blob pair. This ground distance, dist(s, t),

can be defined in several ways. Here we use the Frechet distance [6] which is a closed-form

solution to the EMD in the case of two equal weight Gaussians and therefore is a natural

distance for the Gaussian blob representation (see Appendix A). In the EMD algorithm,

each vertex has a description and a weight. In our case the vertex description corresponds

to the feature vector (blob description) and the weight of a vertex is defined by the relative

weight of the corresponding Gaussian, in other words, the relative number of pixels that

correspond to the Gaussian (blob). The source and target weights determine how much flow

can be transferred from the source blob and to the target blob, respectively.

The EMD provides an optimal solution to the minimization problem defined on the bi-

partite graph, with the constraint that the maximum possible flow is transferred from the

8

Figure 2: Feature vectors (blob) correspondence using a fully-connected bi-partite graph

source to the target image. The generated solution yields the best match between source and

target blobs of the corresponding source and target images, along with an overall minimal

distance between the images, as defined by equation (3). Solving the minimization problem

results in a generated flow matrix. The flow matrix represents the amount of flow on each

arc of the fully-connected bi-partite graph. Examples of flow-matrices can be seen in Figures

8 - 11. The flow value is in the interval [0..1], where 0 indicates no flow exists through an

arc and 1 indicates that the entire weight of the source image is transferred through the arc

(this situation can occur in the trivial situation in which the source and the target images

consist of a single region each). The flow matrix shows the transformation of each blob in

the source image (rows) to blobs of the target image (columns).

3.4 Image model adaptation

Adaptation of an image model is useful when images are represented in inconsistent ways.

For example, under- and over-segmented images in the space domain or dithered images in

the color domain. The resultant flow-matrix is used next for context-dependent image model

adaptation.

Model adaptation can be applied in one of two possible adaptation modes: (1) Adapt

the representation model of a source image with respect to a second, target image, while

still maintaining similarity to the original model. Here only the source image represen-

9

tation is modified while the target image is unaffected. We hereon refer to this mode as

“source-to-target adaptation”; (2) Adapt both image models to reach the best common mu-

tual representation, keeping their similarities with the respective images. This mode will be

referred to as “mutual adaptation”.

The model adaptation is performed by an iterative process on the GMM models of the

two images by applying a series of merging and splitting steps on the source image GMM,

or on both the source and target image GMMs, depending on the mode used. The rules for

blob merging and blob splitting are based on the BlobEMD flow-matrix and are defined in

detail in the following section. In general, two blobs from one image will be considered for

merging if they flow (almost) entirely to a single blob in the other image. A blob will be

considered for splitting if it flows to several blobs in the other image, and these blobs also

receive flow from other blobs in the first image. Without the second condition the merging

rule would be applicable in the opposite direction - merge the blobs in the other image to

match the blob in the first image. Merging is always preferable over splitting to simplify the

resulting models.

4 Model Adaptation Rules

The candidate blobs for the merging and splitting are chosen based on the flow matrix that

results from the BlobEMD computation. Candidate blobs for a merge are characterized by

rows (or columns) with a single large value in the same column (or row) in the flow matrix

A candidate blob for splitting is characterized by a row (column) with multiple values such

that for each value, its respective column (row) contains additional non-zero entries.

For blobs in the candidate list to qualify for merging or splitting, three additional condi-

tions need to be met:

1. Similarity in feature space. The BlobEMD finds correspondences between all blobs in

the source and target images in a way that minimizes the global distance between the

two sets of blobs. However, since the EMD process is forced to match all blobs, it often

needs to compromise and match blobs, or parts of blobs, that are rather dissimilar from

each other. We require the respective candidate blobs in the two images to exhibit good

similarity in the feature space. For that we use the same ground distance GDF (·, ·),

10

that was used for the BlobEMD computation. In this work we usually use the Frechet

distance in L, a, b color space (see Appendix A). In the case of dithered images the

Frechet distance is used in x, y space (as will be shown in section 5.4).

2. Significant spatial overlapping. Even when respective candidate blobs are similar in the

feature space, they might not be spatially close enough. Merging and splitting require

significant spatial overlap of the blobs. For this purpose we define a second ground

distance, GDS(·, ·), which ignores the similarity in the feature space and measures only

the spatial overlap. We require that this measure returns zero when spatially, one blob

completely contains the other (i.e. a small blob inside a large blob). Given two blobs

s and t, consider the corresponding sets of pixels: {pi}pi∈s2σ and {pj}pj∈t2σ , where s2σ

and t2σ are the 2σ projections of the Gaussian blobs on the x, y plane (i.e., all the

pixels in the Gaussian blobs with Mahalanobis distance of 2σ). We define this distance

as

GDS(s, t) = 1− |{pi} ∩ {pj}|min(|{pi}|, |{pj}|) , (4)

where | · | represents the size of the group.

3. Significant flow. For a merge, we require that nearly all the weights of the candidate

blobs flow to the corresponding target blob. To split a candidate blob, we require that

the resulting blobs are not too small, i.e. the candidate blob has a significant flow to

the corresponding target blobs.

The conditions for the merge and split are summarized in Table 1. In the diagrams,

the weight of blob si is denoted by wsi, and the flow between source blob si and target

blob tj, by f(si, tj). The conditions involve several empirical thresholds that are application

and domain dependent (see examples in Section 5). Notice that for the spatial similarity

condition the threshold for the merge, CS1 is different than the threshold for the split, CS2 .

For the merge we demand that the target blob overlaps the two source blobs, while for the

split, we require only partial overlap. In general, CS2 < CS1 . This reasoning also applies to

the threshold of the significant flow condition. For the merge we want Cflow1 to be close to

1, meaning that all the weights of the source blobs flow to the target blob. For the split we

11

require each of the target blobs to carry a significant amount of the source blob, therefore,

Cflow2 < Cflow1 < 1.

Merge Split

Before operation

��

��

� �� ! "#$&%')(*��

��

� ��

��!

" #$%�&'�()

After operation

��

��

� � � �� !" #��

��

� �� !"�#$�%

& '()�*+�,-

Feature space similarityGDF (si, tk) < CF

GDF (sj, tk) < CF

GDF (si, tk) < CF

GDF (si, tl) < CF

Spatial similarityGDS(si, tk) < CS1

GDS(sj, tk) < CS1

GDS(si, tk) < CS2

GDS(si, tl) < CS2

Significant flowf(si, tk)/wsi

> Cflow1

f(sj, tk)/wsj> Cflow1

f(si, tk)/wsi> Cflow2

f(si, tl)/wsi> Cflow2

Table 1: Merge and split conditions.

The model adaptation process consists of several consecutive merging and splitting steps

conducted on the source and target images. Next we describe in detail the merging and

splitting steps. A description of the entire process will follow.

4.1 Blob merging

In the merging process the mixture model is updated, resulting in a smaller set of blobs and

updated feature characteristics. The process is an iterative one, passing through all merging

candidate lists, and finalizing when no additional merging is possible.

12

(a)

(b)

(c)

Figure 3: Synthetic example of source-to-target merging process. (a) A cross image is thesource image (left) that is matched to the target, line image (right); (b) Initial image models(representation layer); (c) Final image models following source model adaptation. Noticethat the two blobs in the source image that match the line in the target image were mergedtogether.

The merging process replaces pairs of blobs from the source image with a single new blob.

The new blob’s spatial position and statistics are based on the original source blobs. Given

two blobs: bi = (wi, µi, Σi) and bj = (wj, µj, Σj), the merged blob parameters b = (w, µ, Σ)

are calculated as follows:

w = wi + wj (5)

µ =wi

wµi +

wj

wµj (6)

Σ =wi

w(Σi + µiµ

ti) +

wj

w(Σj + µjµ

tj)− µµt (7)

The derivations of these equations can be found in Appendix B.

Figure 3 shows an example of the context-based merging process. An image of a cross is

the source image (left) that is matched to the target, an image of a line (right). The initial

image source models are shown in the center row, and the resulting image models, following

source model adaptation are shown bottom row. Perceptually, the image models look more

similar following the merging process.

13

4.2 Blob splitting

Splitting occurs, for example, in images with a large uniform background that is represented

by a single large blob, or when the segmentation process results in a small number of segments

(under-segmentation). Often, splitting blobs enables the blob parts to be merged with other

blobs in a follow-up merging process.

Hereon we term the set of target blobs to which the source blob flows to as the “target-

blobs” set. Once the target-blobs set is defined per source blob, we wish to split the source

blob into a set of smaller blobs, each corresponding to one of the target blobs in the set. The

splitting process is done as follows:

1. Randomly sample the source blob according to its Gaussian distribution.

2. Each sample x, is probabilistically affiliated with each target-blob distribution gj(x|θj),

j = 1, . . . , N .

3. For each target blob j, the set of M samples from the source blob of highest affiliation

to blob j is collected.

4. A Gaussian is learned for each set of M samples.

5. The source image mixture model is updated accordingly.

Figure 4 shows an example of source-to-target context-based splitting on synthetic im-

ages. The representation of the input image (top left) is updated according to the given

target image (top right). The input representation layer is shown in the center row. The

resulting output models, following one step of source model adaptation are shown bottom

row. Note that if mutual adaptation was pursued in this case, a merging of the target model

would have preceded the splitting of the source model.

4.3 The complete adaptation process

Figure 5 shows a flow-chart of the complete adaptation process. The process is an iterative

merging and splitting process. The adaptation process modifies either the source model or

both the source and the target models according to the adaptation mode (source-to-target

14

or mutual adaptation). The update loop terminates once no change is found in the source

model (source-to-target adaptation) or in both the source and the target models (mutual

adaptation mode). An optional post-processing step follows the main update loop. The post-

processing includes an additional source-target merging step followed by an intra-merging

step.

Intra-merging is an additional blob-merging step that is pursued in the mutual adaptation

mode, for each of the source and target models. It is an image smoothing filter. The blob

set of each image is checked for pairs of blobs of high similarity. Two blobs bi and bj within

an image may be merged if they are close in both feature space and have spatial similarity.

We use the following criteria: GDF (bi, bj) < 0.05, GDS(bi, bj) < 1.0. The intra-merging step

was found to be helpful in cases that result in many small blobs, i.e. the optimal match still

entailed a very large set of blobs (such a case may occur if we start with a large set of blobs

in each image). The outcome of the adaptation process is a set of newly segmented source

and target models with a final updated distance measure between them.

5 Experimental Results of the BlobEMD Framework

We have described the BlobEMD framework which consists of three main steps: First, each

input image is modeled as a Gaussian mixture distribution in the joint (L, a, b, x, y) feature

space. The EMD is next utilized for measuring similarity between the respective models of

two images. In addition to the similarity measure between sets of regions, the EMD also

returns the correspondence (flow) between them. The third step uses these correspondences

to adapt the source and target models according to the adaptation mode chosen.

In this section we present an investigative analysis of the BlobEMD framework. We start

with the combination of the first two steps: the GMM representation and the EMD distance

without the merging and splitting steps. We investigate the framework’s robustness in the

image matching task and its application to the image retrieval task. We next illustrate

the utilization of the flow-matrix for model adaptation within several application domains.

These include context-based image segmentation and dithered image matching.

15

5.1 Robustness to fragmentation in the image representation

Images with semantically similar content may be represented by differing number of regions

via the Gaussian mixture model (parameter k). The goal is to have images compared and

matched regardless of this variability and show robustness to it. In [10] we introduced a

novel intra-inter class statistical evaluation methodology as a benchmarking measure. The

intra-class set of images corresponds with similar content image samples, and the inter-class

set corresponds to pairing of images with different content. We use the inter-intra evaluation

scheme to evaluate the robustness of the BlobEMD framework to fragmentation in the image

representation.

In this experiment we use a random set of 245 images extracted from the COREL

database. The ground-truth is generated by choosing four mixture representations (4 val-

ues of k, k = 3, 4, 5, 6) per input image. The “intra-class” distance set is computed as the

distances between all combinations of representation models per image. Note that the simi-

larity of the models within the “intra-class” set is an objective one and does not depend on

subjective labeling. We have overall a set of 12 non-zero distances per image. This process

is repeated for each of the 245 images in the database for an overall 12 × 245 distances. A

second set of distances is computed across images, with each image represented by the MDL

chosen mixture representation (the optimal k value). We term this set of distances (with

245× 244 distances) the “inter-class” distance set.

A histogram of the “intra-class” and “inter-class” distances is plotted in each of the two

graphs presented in Figure 6. The graph on the left shows results in color only feature

space, while the graph on the right shows the distances between images when compared in a

combined color-space domain. Two distinct modes are present in both graphs, demonstrating

the clear separation between the sets. The “intra-class” distances are more narrowly spread

at the lower end of the axis (close to zero), as compared to the wide-spread and larger distance

values of the “inter-class” set. The results presented indicate the strong similarity between

same class models (same image with different values of k), regardless of the variability in

the representation. The BlobEMD framework is in fact robust to fragmentation in the

representation space.

16

5.2 Statistical performance evaluation

We next demonstrate the applicability of the presented framework to the image retrieval

task. In addition to the random set of 245 images an additional set of 70 images were

hand-picked as comprising 6 different classes or categories (10 images per class). Labeled

categories include: “car”, “desert”, “field”, “monkey” , “snow” and “waterfall”. Each image

in the database is processed to extract the localized Gaussian mixture representation. The

BlobEMD with the Frechet ground distance is next computed between each of the images

and an input query image. The images are sorted based on the distance and the closest ones

are presented as the retrieval results.

Retrieval results are evaluated by precision versus recall (PR) curves. Recall measures

the ability of retrieving all relevant or similar information items in the database. It is defined

as the ratio between the number of relevant or perceptually similar items retrieved and the

total relevant items in the database (in our case 10 relevant images per each of the labeled

classes). Precision measures the retrieval accuracy and is defined as the ratio between the

number of relevant or perceptually similar items and the total number of items retrieved.

Precision vs. recall (PR) curves are extracted for each of the 6 categories. A comparison

with global histogram representation and several histogram distance measures is conducted

as well as with our earlier work on the GMM-KL framework [10]. In the GMM-KL framework

the continuous KL distance is used to measure the distance between two continuous distri-

butions, the two GMMs representing the two image inputs. The definition of the continuous

KL distance is given in Appendix A. Histogram measures include the bin-to-bin Euclidean

distance (Euc.), the histogram intersection measure (H. I.) and the discrete KL measure

(Disc. KL) [23, 20, 17]. A binning of 8× 8× 8 is used in the histogram representation. This

resolution (512 quantization levels) is commonly found in the literature. This resolution

is also in the same order of magnitude (and favorably so) with the GMM representation.

Curves are presented in Figure 7. Each plot is an average of the results of the 10 query

images in the class.

We notice the following points:

1. In most cases retrieval results are better when using color only features (dashed black

line); slightly worse when adding spatial features (dashed red line). This fact is in

17

correspondence with earlier results as shown in Figure 6, and agrees with previous

works (e.g. [23]).

2. The BlobEMD framework provides very similar results to the GMM-KL framework. In

some cases the BlobEMD is better and in some of the cases the GMM-KL framework

gets better results. This behavioral pattern is to be expected as the two schemes are

closely related (with the advantage of the BlobEMD for model adaptation).

3. In all cases, the BlobEMD method provides better performance than histogram-based

methods.

5.3 Context-based image segmentation

In this and the following sections we focus on the model adaptation task. The challenge

of image segmentation is treated in this work as an image pair (source-target) task. An

image will be segmented differently based on the context as reflected by the target image.

The model adaptation is performed by an iterative process on the GMM models of the two

images and applying a series of merging and splitting steps on the source image GMM, or

on both the source and target images GMM, depending on the adaptation mode used. The

rules for blob merging and blob splitting are based on the BlobEMD flow-matrix, as defined

in Table 1. In the experiments presented in this section the following thresholds were used:

Merging rules thresholds: CF = 0.2, CS1 = 0.75 and Cflow1 = 0.6. Splitting rules thresholds:

CF = 0.2, CS2 = 0.75 and Cflow2 = 0.01. Thresholds were selected heuristically based on

experimentation.

In Figure 8 we illustrate context-based image model adaptation for adaptive segmentation

and image-pair matching on the Lake image example. In this example, similar semantic

content (“trees next to a lake”) is represented by a different number of regions and region

colors (a). The treetops are separate in one image and merged in the other, while the lake

appears as separate blobs in one image and as a single blob in the other. The initial source

and target image models are shown in (b) with the corresponding flow matrix shown in (c).

The updated source and target image segmentation maps, image models and corresponding

flow matrix are shown in (d), (e) and (f), respectively. Note the resemblance of the two

updated image models in (e) vs. the initial representation in (b). The context-based model

18

update results in updated image distances. In this example, the BlobEMD distance is 0.08

in the initial representation phase and 0.04 in the final representation. A decrease of 50% in

the distance is achieved via the update process.

A second example is shown in Figure 9. In (a) we show two similar images of a red car.

Due to different segmentation processes, they result in very different segmentations as shown

in (b) and (c), top. The corresponding GMM models are also significantly different as shown

in (b) and (c), bottom. Using the model adaptation process (source-to-target adaptation),

the final modeling and segmentation results are shown in (d). The region-correspondence

process along with merging and splitting, provides us with an updated model that results

in a segmentation that is very similar between the two images (compare (b) and (d)). Note

also that the model adaptation results in smoother regions and similar-looking object (car)

silhouettes.

5.4 Matching dithered images

Dithered images are images with reduced resolution in the color space, where due to limita-

tions of the display or printing device or because of a compression process, only a limited set

of discrete colors is used. The perceived color is based on our ability to blend the mixture, of

sometimes very different colors, into coherent colors which are not in the set of given colors,

such as in the example of the Monkey in Figure 10(a).

When a dithered image is modeled using only the limited set of colors, the resulting model

is very different from the model of the original, non-dithered image. Classical techniques

such as histograms fail to identify the similarly of the two models. Using the BlobEMD

framework we can adapt the dithered image representation according to the target image

representation and enable a comparison among them.

The following algorithm characteristics apply for dithered images: The similarity in fea-

ture space, GDF , is the Frechet ground distance on (x,y) space only. Here we don’t use the

color information for the ground distance as the distance between dithered image colors and

their original image colors may be large, while the mixture of the dithered colors may be

in close resemblance to the desired color at that location. The merging process in the color

feature space is thus critical in this application domain. The criteria for the merging process

is in the spatial domain. The blobs to be merged overlap in space (two colored blobs in

19

the dithered image overlap and flow to the same blob in the target image). The thresholds

used are the following. Merging rules thresholds: CF = 1.0, CS1 = 0.6 and Cflow1 = 0.6.

Splitting rules thresholds: CF = 0.2, CS2 = 0.75 and Cflow2 = 0.01. Thresholds were selected

heuristically based on experimentation.

Figure 10 shows an example of comparing between a target image (top left) and a dithered

version (27 colors) as a source (query) image (top right). A zoom-in window is shown

and clearly demonstrates the differences between the two input images. Source-to-target

adaptation is used. The initial models extracted for the two images are shown on the

bottom of (b) and (c), with the corresponding segmentation maps, (b) and (c), top. The

differences between the images are again evident in their respective models. Using the

BlobEMD framework enables a model adaptation process with a final updated model that

fits the source model both in color and spatial layout (d). Note the strong resemblance

between the models of (b) and (d), especially as contrasted with (c).

A second example is presented in Figure 11. The target image is shown top left and a

dithered version (27 colors) as a query image is shown top right. An extension to mutual

adaptation is shown in (d). Here the target image model is adapted as well for a final

result that is a more compact representation of both source and target images. The updated

representation results in updated image distances. In this example the BlobEMD distance

in the initial representation phase is 0.1. Following source-to-target adaptation the distance

reduces to 0.05. In the final mutual adaptation stage the distance value is 0.036. A decrease

of more than 50% in the distance is achieved via the update process.

6 Discussion

In this work we present the BlobEMD framework for a simultaneous solution to both the

image region correspondence problem and the estimation of an image pair distance. This

coupling allows for context-based model adaptation where the representation of one image

is adjusted based on the representation of a second image - the context.

We are presenting a different approach to the image segmentation problem. Rather than

trying to estimate the “true” segmentation of an image, the BlobEMD framework provides for

context dependent image segmentation. The segmentation problem is treated in conjunction

20

with the image matching problem. An image may be segmented differently in accordance

to the target image it is being compared to. Context based image segmentation and image

matching is enabled via the EMD flow.

In the BlobEMD framework the image is represented in the continuous domain using

GMM statistical modeling. The EMD optimization enable matching of individual model

components (Gaussians or blobs) while providing an overall distance measure between the

image distributions. There are interesting distinctions from earlier work: the image is rep-

resented via a continuous and probabilistic representation as opposed to the well-known

discrete histogram representation; Global image matching is achieved along with a correspon-

dence mapping of the individual representation components. This mapping is not available

in global matching techniques such as in the GMM-KL framework recently proposed.

A comparison between the two methods of the BlobEMD and the GMM-KL has been pre-

sented in the experimentation section. The results demonstrate a strong correlation between

the performance of the two approaches. The two approaches have the same representation

of the image space with a difference in the distance measures used for image matching. The

GMM-KL framework is a continuous probabilistic framework throughout, with the continu-

ous KL distance measure used for comparing statistically between two GMM distributions.

The BlobEMD framework provides the global distance measure along with an insight into

the correspondences found between individual mixture components, or image regions. This

mapping is essential for the model adaptation purposes and any other applications that rely

on region correspondences. The price payed for the inside view is a slight decrease in the

accuracy of the global distance measure.

An open theoretical issue for investigation is the definition of an appropriate Ground

distance for Gaussian, or blob comparison. Both the KL distance as well as the Frechet

distance are defined for equal-weight Gaussians. A challenge remains to find a more exact

mathematical formalism for the comparison between nonequal-weight Gaussians, as is the

case in-hand.

Using the BlobEMD framework, we solve the region correspondence problem across an

image pair. The correspondences between the regions in the two images are used to merge

and spit the regions so they still represent well the images but at the same time bring the

two representation to a common context. The suggested framework provides for image rep-

21

resentations that are more uniform and best aligned between the two images to be matched.

We view this work as a first step in an extensive research effort ahead in which we aug-

ment the region representation vector to include features such as texture, size and shape, in

addition to the color feature chosen here. A definition of an hierarchical matching framework

is under way. Region correspondences based on low-level features such as color and texture

may provide a semantically plausible image segmentation, thus enabling the extension of

the feature space to include high-level more semantic region characteristics, such as the in-

clusion of region sizes and shapes. In Figure 9 we see that the model adaptation results

in smoother regions and similar-looking object (car) silhouettes. The BlobEMD methodol-

ogy may provide the means for the much desired transition from regions to silhouettes and

shapes.

A Frechet ground distance

The Frechet distance is a special case of the Monge-Kantorovich mass transference problem

[18] which is the basis to the EMD. The general Monge-Kantorovich problem is defined as

inf{∫

U×Uc(s, t)P (ds, dt) : P ∈ P(P1, P2)

}, (8)

where P1 and P2 are two Borel probability measures given on a separate metric space (U, d),

and P(P1, P2) is the space of all Borel probability measures P on U × U with fixed margins

P1(·) = P (· × U) and P2(·) = P (U × ·). P1 and P2 are the initial and final distributions

and P is the optimal transference plan, or the flow as we use in this work. c(s, t) is the cost

function which in our work is the euclidian distance.

The Frechet distance[6] solves the general Monge-Kantorovich problem for the case where

s and t are normal distributions with means µs, µt, and covariance matrices Σs, Σt, respec-

tively.

d2 (s, t) = |µs − µt|2 + tr[Σs + Σt − 2 (ΣsΣt)

1/2]. (9)

It is a closed-form solution to the EMD in the case of two equal weight Gaussians and is

a natural distance for the Gaussian blob representation. Unfortunately, when two Gaussian

22

blobs have different weights, the Frechet distance is not valid. An extension for the non-equal

weights case is yet to be investigated.

B Merging blob statistics

Let bi = (wi, µi, Σi) and bj = (wj, µj, Σj) be two blobs to be merged, where wi, wj are the

weights, µi, µj the means, and Σi, Σj the covariance matrices of the blobs. We look for

the blob b = (w, µ, Σ) that represents the statistics of the union of the two sets of pixels

represented by the two blobs.

Let ni and nj be the number of pixels represented by blobs bi and bj respectively. We

have,

µi =1

ni

∑

p∈bi

p , µj =1

nj

∑

p∈bj

p .

Combining the two sets of pixels bi ∪ bj, we get the combined mean

µ =1

ni + nj

∑

p∈bi∪bj

p

=1

ni + nj

∑

p∈bi

p +∑

p∈bj

p

=1

ni + nj

(niµi + njµj)

=ni/n

(ni + nj)/nµi +

nj/n

(ni + nj)/nµj

=wi

wµi +

wj

wµj ,

where w = wi + wj.

Similarly for the covariance matrix we have

Σi =1

ni

∑

p∈bi

ppt − µiµti , Σj =

1

nj

∑

p∈bj

ppt − µjµtj .

Combining the two sets of pixels bi ∪ bj, we get the combined covariance

Σ =1

ni + nj

∑

p∈bi∪bj

ppt − µ2

=1

ni + nj

∑

p∈bi

ppt +∑

p∈bj

ppt

− µµt

23

=1

ni + nj

(ni(Σi + µiµ

ti) + nj(Σj + µjµ

tj)

)− µµt

=ni/n

(ni + nj)/n(Σi + µiµ

ti) +

nj/n

(ni + nj)/n(Σj + µjµ

tj)− µµt

=wi

w(Σi + µiµ

ti) +

wj

w(Σj + µjµ

tj)− µµt .

References

[1] J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Jain, and C.F. Shu. Virageimage search engine: an open framework for image management. In Jain R. (ed) Symposiumon Electronic Imaging:Science and Technology - Storage and Retrieval for Image and Videodatabases IV, volume IV, pages 76–87, 1996.

[2] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In Proc. ofthe IEEE Workshop on Content-based Access of Image and Video libraries (CVPR’97), pages42–49, 1997.

[3] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation us-ing expectation-maximization and its application to image querying. IEEE Transactions onPattern Analysis and Machine Intelligence, 24:1026–1038, August 2002.

[4] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991.

[5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the emalgorithm. J. Royal Statistical Soc. B, 39(1):1–38, 1977.

[6] D. C. Dowson and B. V. Landau. The frechet distance between multivariate normal distribu-tions. In Journal of Multivariate Analysis, volume 12, 1982.

[7] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and SonsInc., 1973.

[8] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, and B. Dom et al. Query by imageand video content: the qbic system. IEEE Computer, 28(9):23–32, 1995.

[9] H. Greenspan, G. Dvir, and Y. Rubner. Region correspondence for image matching via emdflow. In Proceedings CVPR 2000 Workshop on Content-Based Access of Image and VideoLibraries, pages 27–31, 2000.

[10] H. Greenspan, J. Goldberger, and L. Ridel. A continuous probabilistic framework for imagematching. Computer Vision and Image Understanding, 84:384–406, December 2001.

[11] J. Hafner, H. Sawhney, W. Equitz, M. Flickner, and W. Niblacket. Efficient color histogramindexing for quadratic from distance functions. IEEE Trans. Pattern Analysis and MachineIntelligence, 17(7):729–739, 1995.

[12] F. L. Hitchcock. The distribution of a product from several sources to numerous localities. J.Math. Phys., 20:224–230, 1941.

24

[13] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih. Image indexing using colorcorrelograms. In Proc. of the IEEE Comp. Vis. And Patt. Rec., pages 762–768, 1997.

[14] S. Kullback. Information theory and Statistics. Dover, 1968.

[15] W. Ma and B. Manjunath. Netra: A toolbox for navigating large image databases. In Pro-ceedings of IEEE Int. Conf. On Image Proc., pages 568–571, 1997.

[16] G. Pass and R. Zabih. Comparing images using joint histograms. Multimedia Systems, 7:234–240, 1999.

[17] Jan Puzicha, Yossi Rubner, Carlo Tomasi, and Joachim M. Buhmann. Empirical evaluation ofdissimilarity measures for color and texture. In IEEE International Conference on ComputerVision, pages 1165–1172, 1999.

[18] S. T. Rachev. The Monge-Kantorovich mass transference problem and its stochastic applica-tions. Theory of Probability and its Applications, XXIX(4):647–676, 1984.

[19] Yossi Rubner and Carlo Tomasi. Perceptual Metrics for Image Database Navigation. KluwerAcademic Publishers, Boston, December 2000.

[20] J. R. Smith. Integrated Spatial and Feature Image Systems: Retrieval, Analysis and Compres-sion. PhD thesis, Columbia University, 1997.

[21] J. R. Smith and S-F Chang. Integrated spatial and feature image query. Multimedia Systems,7:129–140, 1999.

[22] M. Stricker and A. Dimai. Spectral covariance and fuzzy regions for image indexing. MachineVision and Applications, 10(2):66–73, 1997.

[23] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision,7(1):11–32, 1991.

[24] G. Wyszecki and W. Stiles. Color Science: Concepts and Methods, Quantitative Data andFormulae. Wiley, 1982.

25

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.71

2

3

4

1 2 3

(b)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.71

2

3

4

5

1 2 3

(c)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.71

2

3

1 2 3

(d)

Figure 4: Synthetic example of source-to-target splitting process. (a) The source image (left)is matched to the target image (right); (b) Initial image models (representation layer) andtheir flow matrix from source to target; (c) Image models after splitting; (d) Final imagemodels after merging, following source model adaptation.

26

Figure 5: Model adaptation flow chart

(a) (b)

Figure 6: Statistical analysis of intra-class distances (black) vs. inter-class distances (white).(a) (L,a,b) feature space; (b) (L,a,b,x,y) feature space. The x-axis is the BlobEMD distanceand the y-axis is the frequency of occurrence of the respective distance in each of the twofeature spaces.

27

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall

field snow car

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall

desert monkey waterfall

Figure 7: Precision vs. Recall. 315 images in database. Each plot is an average of the resultsof the 10 query images in the class. In dashed colors are the BlobEMD results. Dashed blackis results of color only. Dashed red is results of color and x, y. Solid lines are for comparison.In black is the PR curve of the GMM-KL framework. The purple, red and green curvescorrespond to histogram representation and Euc., H. I., and Disc. KL distance measures,respectively.

28

(a)

(b)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.451

2

3

4

5

6

7

8

1 2 3 4 5 6

(c)

(d)

(e)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.51

2

3

4

5

1 2 3 4 5

(f)

Figure 8: Context-based image representation and matching via BlobEMD. (a) An imagepair example; (b) Source and target image models; (c) Corresponding flow matrix; (d)Updated source image segmentation map and target image segmentation map; (e) Updatedsource and target image models; (f) Updated flow matrix.

29

Target Image Source Image

(a)Target Image Source Image Context-based adaptation

(b) (c) (d)

Figure 9: Context-based model adaptation for segmentation. (a) Input images; (b) Tar-get image: segmentation map (top) and GMM representation (bottom); (c) Source image:segmentation map (top) and GMM representation (bottom); (d) Source image after context-based segmentation: adapted segmentation map (top) and adapted GMM representation(bottom).

30

Target image Source image4.0.jpg

(a)Target Image Source Image Context-based adaptation

(b) (c) (d)

Figure 10: Context-based model adaptation for dithered image representation. (a) Targetimage (left) and a dithered version (27 colors only) as a query image (right). A zoom inwindow is shown bottom; (b) and (c) Target and source image models are shown bottom,with the corresponding segmentation maps shown on top; (d) Final updated model usingsource-to-target adaptation.

31

(a)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.11

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5

0.05

0.1

0.15

0.2

0.251

2

3

4

5

6

1 2 3 4 5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.351

2

3

4

1 2 3

(b) (c) (d)

Figure 11: Context-based model adaptation for dithered image representation. (a) Originalimages; (b) Initial images models; (c) Source model adaptation according to target model(source-to-target adaptation); (d) Mutual adaptation of both source and target models.

32

Context Dependent Segmentation and Matching in Image …hayit/publications/EMD_revisedfinal.pdf ·...

Documents

Transcript of Context Dependent Segmentation and Matching in Image …hayit/publications/EMD_revisedfinal.pdf ·...