Texture Modelling with Nested High–order Markov–Gibbs ... · Markov–Gibbs Random Fields Ralph...

Texture Modelling with Nested High–order

Markov–Gibbs Random Fields

Ralph Versteegen, Georgy Gimel’farb, Patricia Riddle

Department of Computer Science, The University of Auckland,Auckland 1142, New Zealand

Abstract

Currently, Markov–Gibbs random field (MGRF) image models which includehigh-order interactions are virtually always built by modelling the responsesof a a stack of local filters, typically by assuming the responses are indepen-dent random fields. Actual interaction structure is specified implicitly, via abank of linear or non-linear filters of fixed size. Coefficients of the filters areeither also hand selected, or optimised by gradient search. In contrast, welearn an explicit high-order MGRF structure by casting the learning processin terms of general exponential families, and investigate the relatively rapidaddition of new features by only approximately learning parameters in orderto lower running time. Realistic texture modelling is presented as a casestudy.

Two families of texture features suitable for describing high-order in-teractions are introduced which generalise local binary patterns (LBPs) —previously not applied as MGRF features — with learned offsets to the sur-rounding pixels. These are faster to compute and describe quite differentproperties than linear filters. Several schemes for selecting high-order fea-tures by composition or brute-force search are compared. Additionally amodification of the maximum likelihood is presented as a texture modelling-specific objective function which aims to improve generalisation ability.

We show the ability of the proposed method to learn texture models fora broad selection of complex textures, succeeding on much of the continuumfrom stochastic through irregularly structured to near-regular ones. Learning

Email addresses: [email protected] (Ralph Versteegen),[email protected] (Georgy Gimel’farb), [email protected](Patricia Riddle)

Preprint submitted to Computer Vision and Image Understanding October 16, 2014

interaction structure is beneficial for near-regular and stochastic textures,while those with complex irregular structure provide the most difficulties.Texture models were also quantitatively evaluated on two tasks: grading ofsynthesised textures by a panel of observers, and comparison against severalrecent filter-based MGRFs by computer evaluation on a constrained inpaint-ing task.

Keywords: texture; high-order MRF; structure learning; model nesting.

1. Introduction

Texture modelling is central or important to many computer vision andimage processing tasks such as image segmentation, inpainting, texture clas-sification or synthesis, anomaly (defect) detection, and image recognition.Although successful specialised algorithms for texture classification, synthe-sis and segmentation have been developed, generative probabilistic modelswhich offer relatively complete models of statistics of individual textures areappealling. They may be applied not only to all of the above tasks, but alsoto others where appearance priors or feature extraction are needed, and theyare of interest to understanding human vision. Generative models must cap-ture most of the features of a texture that are significant to human perceptionin order to be successful, whereas texture features used for discrimination candiffer.

The most prevalent tool for image and texture modelling are Marko-vian undirected graphical models, a.k.a. Markov random fields (MRFs).An MRF together with an explicit Gibbs probability distribution (GPD) iscalled herein a Markov–Gibbs random field (MGRF). MGRFs are partic-ularly popular for image processing task involving determining boundaries(as in segmentation) and enforcing smoothness (e.g. in stereoscopic match-ing and image denoising). In these cases the Markov networks are usu-ally sparse, with the (directly interacting) neighbours of each variable beingclosely nearby. Some high-order MGRF models have been proposed for suchtasks (e.g. [1, 2]), and for binary variables efficient maximum a posteriori(MAP) algorithms exist, such as graph cut. However things are different inthe domain of image and texture modelling, where where inference needs tobe performed on real-valued or highly multivalued image variables in denseMarkov networks. The networks used in this paper typically have Markovblankets containing 50–100 nearby and distant pixels, and even sampling

2

from the models proves to be difficult.MGRFs and other probabilistic texture models reduce images to a vector

of statistics of image features, which are assumed sufficient to describe thetexture. Historically statistics of pairs of pixels [3–5] were used. Howeverhigher-order MGRFs (which cannot be expressed in terms of lower orderones), have become more common as they are recognised to be necessaryfor more expressive models of natural images and textures (e.g. [2, 6–9]).The higher-order MGRFs in use nearly exclusively apply linear filters asfeature functions with statistics of the filter responses, such as means andvariances [10] or histograms [11], forming a description vector. Higher orderinteractions in image models allow for abstracting beyond pixels, buildingupon larger scale image attributes like edges, and for context and complexstructures to be captured. In addition, since regularly tiled textures havestrong long range correlations between nearby tiles it is natural to learn aninteraction structure specific to the texture. Yet it is still almost unheard ofin computer vision and image modelling for higher-order MRF structure (i.e.relations between tuples of pixels) to be learned rather than hand selected.

However, selection of high-order features poses significant problems. Thecardinality of a space of possible feature functions grows combinatorially inthe order, due to both freedom in the specification of the support (inputvariables), and the need to reduce or manage its high dimensional input do-main. In other words the feature should be parameterised with a reasonablenumber of parameters, hence the popularity of linear filters with fixed squaresupports. For texture classification many other methods of extracting usefulinformation out of a high dimensional pixel co-occurrence matrix have beeninvestigated (e.g. [12, 13]). Dimensions can be shed by making assumptionssuch as that images are invariant to contrast and offset changes. Howeverthis work has seemingly not been ported to MRF texture models.

The recent surge of interest in hierarchical filter-based MRFs could beconsidered a merging of MRFs, especially Boltzmann machines, and the gen-eralisation of models of single image patches using independent componentsanalysis [14] and overcomplete product of experts [15] to larger images bytiling and overlapping the filters. The FRAME [11] model was the first ex-ample of texture modelling using filter-based MRFs. Today MRFs includinglatent variables are popular, are often inspired by simple and complex cellsin the visual cortex and hence have have a strong link to artificial neuralnetworks (ANNs). Using multiple layers allows application to high levelcomputer vision tasks where the Markov property does not hold, such as ob-

3

ject recognition and face modelling. In many cases integrating out the latentvariables leads to a completely connected, non-Markovian interaction graph.We diverge from this direction to consider inclusion of higher-order featuresdirectly, which is neglected in image modelling but more common outsideof it, as often variables are qualitative and there is no analogue to (linear)filtering.

In order to tackle these problems, this paper describes a model nestingprocedure which greedily adds features to a model and can build higher orderfeatures by composing lower order ones. Texture synthesis and inpainting isused as a case study. We investigate texture features which are more ef-ficient to compute than filter responses while being able to describe quitedifferent interactions. Unlike some other works (e.g. [11, 16]) we do notattempt to provide a fixed set of texture features which can distinguish be-tween all textures, but rather learn texture-specific features. This potentiallyprovides compact representations while allowing a large and varied space ofdescriptors.

Each step of model nesting corrects statistical differences between thetraining image and samples of the previous model. Attempting to learnthe maximum likelihood parameters is very expensive but is not necessary;approximate parameters computed with persistent contrastive convergence[17]/controllable simulated annealling (CSA) [18] provide adequate improve-ments, while adjusting only a subset of the parameters at a time provides abetter chance of identifying parameters from limited training data.

Generative image models have often been evaluated by measuring per-formance of image denoising and inpainting when the model is used as theprior. But this is a very indirect and incomplete method of measurement.Instead, in this paper several variants of the nesting algorithm were appliedto modelling and synthesis of a large and varied set of greyscale textures andresults are presented here and also evaluated by a panel of observers.

Novel contributions of this paper are as follows: (i) A method of efficientlyselecting both high-order features and parameters by nesting heterogeneousmodels while coping with the difficulties of inference with dense MGRF tex-ture models is proposed.The proposed models use no hidden variables asis currently popular, which eases learning, with MLE parameter learningremaining technically convex. (ii) high-order ‘binary pattern’ (BE) and ‘bi-nary equality’ (BE) texture features extending the very popular LBP imagedescriptors are used as MGRF features. These are quite different from themainstream high-order linear filtering or Potts potentials, and faster to com-

4

pute than responses of linear filters. Even traditional LBPs have apparentlynever been used in this way despite enormous popularity as image descrip-tors. Experiments into texture synthesis using MGRFs with LBP histogramsas sufficient statistics can provide insight into the visual features actuallycaptured by LBP histograms. (iii) Several schemes for selecting high-orderfeatures are compared. The resulting texture models have heterogeneous fea-ture sets composed of pair-wise greylevel co-occurrence or greylevel differencefeatures and binary pattern and binary equality features of up to 13th order.(iv) The ability of the proposed procedure to learn characteristic featuresacross different types of textures is demonstrated with texture synthesis andinpainting results and evaluations by a group of judges. The use of learnedlong range interactions allows near-regular textures in particular to usuallybe synthesised well. (v) In order to improve generalisation and to attemptto allow training images that are not completely homogeneous a variant onthe maximum likelihood estimation (MLE) objective was used. We split thetraining image into pieces and maximise the minimum of the likelihoods ofthe pieces rather than their product.

2. Related Work

2.0.1. high-order MGRF models

Research in texture analysis has focused on describing textures using thedistributions of responses of square linear filters. MGRF texture modelsutilising filters were introduced with FRAME [8, 11], where the filters wereselected from a manually-specified bank. More recently learning the filtersthemselves (with predetermined fixed supports) has been popularised by theField-of-Experts (FoE) model [9]. These model the marginal distibution ofresponses for each filter with hand-picked potentials with few parameters; inthe original FoE model they are student-t distributions. This means that theinteraction structure is learned implicitly, as filter coefficients. FoE was ex-tended to bimodal FoE (BiFoE) which uses more informative bimodal poten-tials, and successfully applied to texture modelling by Heess et al. [7]; severalstate of the art generative texture models have been built on BiFoE, someusing various configurations of hidden variables. Kivinen and Williams [19]improved on BiFoE by using gated MRFs [20], and Luo et al. [21] investigatedconvolutional deep belief networks (DBN). However because of learning dif-ficulties and to reduce required computation all these learned-filter modelshave been restricted to relatively small filter sizes. These do not directly cap-

5

ture distant interactions; the largest size used in the mentioned works was11×11. As a result, while these filter-based MGRFs model certain classes oftextures excellently, they are not superior to all other MGRF texture models.

2.0.2. Texture features

Various high-order local texture descriptors have been applied to textureclassification (e.g. [12, 13, 22]) going back at least 20 years. For such appli-cations, invariances to contrast, offset, scale, rotation, and deformation aregiven particular weight. Many sought to directly reduce the dimensionality ofhigh-order co-occurrence histograms through traditional techniques such asspectral clustering of histogram bins [23], vector quantisation [13], Gaussianmixture models [22], and self-organising maps [12]. However such dimension-ality reduction techiniques usually require search to map new data points,which likely makes them unsuitably slow as feature functions in MGRFs.

However, perhaps the most popular texture descriptors for discriminationare local binary patterns (LBPs) [24], which compare the intensities of aring of pixels p1, . . . , pk around a central one p0, and form the bitvector(pi > p0)1≤i≤k. These are nearly contrast and offset invariant and can bestraight-forwardly extended to rotation invariance [25] by merging bins andpartial scale invariance by using multiple rings. LBPs are also very cheap tocompute, hence we generalise them in this paper to ‘binary patterns’ (BPs)by learning the pi positions (see Section 4.6). Nosaka et al. [26] introducedco-occurrence statistics of neighbouring LBPs in a way that is rotationallyinvariant, This is particularly interesting for future extension of the MGRFsin this paper to rotational invariance. LBPs have inspired a number of othervariants such as local ternary patterns [27], and local radial index [28]. Liuet al. [29] recently introduced MGRFs, applied to discrimination ratherthan generation, with features which computed the ordered partition (weakordering) of grey-levels of tuples of 3 or 4 pixels. These are considerablymore general than LBPs as the orderings p1 < p2 = p3 and p1 < p2 < p3 areconsidered different by the feature function. The feature supports (offsetsbetween pixels) were also selected from the training data in [29]: 2nd ordercliques using an approach quite different to the one used here, while 3rdorder ones were built by combining 2nd order ones, which is also applied inthis paper. However, these features do not scale well to higher orders, asthe computation time grows quadratically, and the number of distinct ordersgrows factorially, e.g. 7087261 distinct configurations of 8 pixels, versus 256for LBPs.

6

2.0.3. Structure Learning

An intuitive method for selecting the features of the model is to usegreedy heuristic sequential selection which has been proposed and used by anumber of different authors (eg. [11, 30–33]) in slightly different forms. Thisalternates between adding one or more features/factors to the current modelfrom a candidate set according to an estimate of the best feature to add, andthen finding the new maximum likelihood estimate (MLE) of the parameters(initialised at those from the previous iteration). The most common metricfor selecting the best feature to introduce is to select that with the largest‘error’ between training image and model synthesis result [11, 31, 32]. Thispaper describes a very similar model nesting approach, however parameterscorresponding to previous features are kept fixed and an attempt is made ateach iteration merely to improve on the base model rather than search for theMLE. Dudık et al. [34] described a similar coordinate-wise descent algorithmsuitable for optimising a wide class of generalised maximum-entropy problemssuch as MGRFs which selects a best single parameter to update on eachiteration. They state the standard approach of updating all parameters eachiteration ”is impractical when the number of features is very large.”

Della Pietra et al. [30] treated structure learning in MGRFs (for naturallanguage processing (NLP) problems) as a feature selection problem, andbuilt up higher order features out of lower order ones. Several other authorshave since considered selecting MGRFs features by gradually composing to-gether low order atomic features (‘general-to-specific’) (e.g. [35, 36]), or bystarting from template-like features (sometimes called ‘patterns’) which areconjunctions of simple xi = cj predicates, derived directly from the trainingdata and gradually generalising them (‘specific-to-general’) (e.g. [37]).

A closely related popular method for MGRF structure learning is theuse of sparsity-inducing L1 regularisation [32, 38], which forces some featureweights to exactly zero so that they can be removed. Otherwise proceedinglike sequential learning, this approach has the advantages that the regular-isation combats overfitting, that it allows removing a feature after it hasbeen added, and is a convex problem. Recently Chen and Welling [39] sug-gested the use of spike-and-slab priors (similar to L0 regularisation) insteadof L1 regularisation. This has some advantages, but does not also lead to anefficient greedy algorithm with an optimality guarantee.

7

2.0.4. Texture synthesis

Practical texture synthesis is currently dominated by algorithms whichcombine pieces of a source image so that the pieces fit together well. This isdefined by the match between neighbourhoods of the pieces. Region-growingtechniques add one pixel (e.g. [40, 41]) or patch (e.g. [42, 43]) to the im-age at a time. Algorithms are often multiscale [41, 44, 45]. However allthese synthesis algorithms have the property that they copy large parts ofthe source image directly into the result, either an explicit part of the algo-rithm, or an emergent property of the algorithms’ search for best matchingneighbourhoods, due to ‘forced moves’.

Another rather successful but far less efficient class of texture synthesisalgorithm is based on global optimisation of the image such as by projectiononto the set of images with certain statistics equal to those of the trainingimage, in particular statistics of wavelet-decompositions [10, 16]. A code-book of texture patches or primitives can also be used instead [46]. Theseare implicit MRF texture models as they provide no explicit probability dis-tributions (hence our discrimination between MGRFs and MRFs) and so areless broadly applicable than probabilistic models. Hence, although the syn-thesis results are good, these algorithms in fact attack a different problemthan the more statistical one that this paper does.

3. Nested Markov–Gibbs Random Fields

This section provides first some definitions, and then defines general ex-ponential distributions as the conceptual and theoretical basis for MGRFmodel nesting. A description of the nesting algorithm follows.

3.1. Basic notation and definitions

Any probability distribution which is nowhere-zero can be representedas a GPD factorised into Gibbs factors: functions of complete subgraphs(cliques) of the Markov network. In this paper we consider the class ofMGRFs with factors (synonymously, potentials) that can be described asthe product of a fixed feature function which identifies a certain signal con-figuration (pattern), and a corresponding weight/parameter. Some othermodels such as FoE fall into this category only through the introduction ofhidden variables to decompose the potentials.

8

We restrict our scope to modelling of homogeneous textures by repeatingthe Gibbs factors and cliques across the image to achieve translation invari-ance. Let R ⊂ Z2, where Z are the integers, be a finite lattice of imagecoordinates and α = {αi : αi ∈ Z2; 1 ≤ i ≤ d} be a list of d coordinate off-sets with α1 = (0, 0) fixed. An order d clique family Cα in R is the set of allspatially repeated cliques with the offset pattern given by α:

Cα := {(r1, . . . , rd) : r1, . . . , rd ∈ R; ri − r1 = αi; i = 1, . . . , d}.

Let g : R → {0, . . . , Q − 1} be an image on R with Q possible grey levels.An order k feature is a function f : {0, . . . , Q − 1}k → Ns with finite rangeNs := {1, . . . , s} and an associated clique family Cf given by offset list αf .The histogram of values of f collected over an image g is the vector

hf (g) =

∑c∈Cf

Jl = f(gc)K : l ∈ Ns

(1)

where J·K is the Iverson bracket mapping true 7→ 1, false 7→ 0, and gc is thesequence of pixels g(r1), . . . ,g(rd), ri ∈ c. The order of a model is defined asthe maximum order of any of its features.

3.2. General exponential distributions

In the sequential structure selection procedure, new features to be addedto the current model are selected based on the disagreement between theirstatistics in the training image and the model’s expected statistics, approx-imated from model samples. In this way the model is modified to correctthese errors. When the statistics in question are expectations, these correc-tions take a simple form.

Let Fi+1 be a set of feature functions. Define the histogram vector fi+1(g)to be the concatenation of histograms for the set Fi+1: fi+1(g) := [hf (g) :f ∈ Fi+1].

A general exponential family model [47] is a distribution which can bewritten in the following form specified by a base model pi(g), a vector-valuedfeature function fi+1(g), and a parameter vector θi+1,

pi+1(g|θi+1) =1

Z(θi+1)pi(g) exp(−fi+1(g) · θi+1) (2)

9

where Z(θ) is a normalisation constant. In our case, the parameters are aconcatenation of per-feature parameter vectors θi+1 = [θf : f ∈ Fi+1].

If one is given constraints to be satisfied by a model pi+1 in the form of ex-pectations Epi+1

[fi+1(g)] = fi+1(g◦) (where g◦ is given training data), but al-

ready has prior information expressed as a base model (in this case the modelat the previous iteration, pi), then it is widely known [30] that the probabilitydistribution pi+1 which matches the new constraints but deviates from thebase model the minimum possible amount (as measured by the Kullback-Leibler divergence; i.e. has maximum entropy relative to pi) has a simpleform. It is the above general exponential family modelwith the maximumlikelihood estimate (MLE) of the parameters θ∗i+1 := arg maxθ pi+1(g

◦|θ).It can be seen from Eq. (9) that the MLE parameters achieve the desiredstatistical constraints. Technically, the MLE θ∗i+1 may include positive ornegative infinity as elements, which is not permitted by the usual definitionof a GPD. This possibility is easily avoided by slightly smoothing fi+1(g

◦).Note that there is no need to modify whatever parameters the base model

may have, although in works involving sequential MGRF structure learningthis is normally done, regarding the new model as an exponential familydistribution with concatenated feature vectors.

3.3. Model Nesting

The challenges of both high-order feature and structure selection can bemet by the gradual building up of features from pieces, reducing the setof candidate features to a size that can be exhaustively searched at eachiteration. One could reasonably expect that when some configuration ofpixels in a texture (as recognised by a feature function) is characteristicallycommon, that the configuration restricted to a subset of pixels should also becommon. Hence we conjecture that in practice high-order feature functionspicking out characteristic interactions can be found by building up from lowerorder features. The nesting algorithm given here greedily adds features tothe model which have histograms with maximum disagreement against thetraining image. There is a significant difference to the traditional sequentialalgorithm. Only the parameters associated with fi+1 are optimised ratherthan all of them, in order to reduce overfitting.

Given a set of feature functions Fi which have already been selected,define a selector function C(Fi) which provides a candidate set of new fea-tures, possibly built upon previous features. The algorithm proceeds stage-wise through a sequence of selectors C1, . . . , Ck each providing features of

10

a certain order and type, to avoid the problem of comparing across fea-ture types. Sequential selection simply selects one or more feature functionsamong Cj(Fi) on which the current model pi has the largest disagreement(error) against the training data e(f) := d(hf (g

◦),hf (g(i))), where d is a dis-

tance function on histograms. The next selector in the sequence is switchedto after maxf∈Cj(Fi) d(hf (g

◦),hf (g(i))) becomes too small, and the algorithm

terminates when the selectors are exhausted. One possible variant is to useseparate training and validation datasets, and to stop when performance(measured either by d or an application-specific external evaluation functionsuch as texture discrimination accuracy) on the validation begins to decrease.Such a validation test was used in the earlier work [48], in order to detectslightly non-homogeneous textures. This paper takes the idea further thatstatistical variations across the training data are important: Section 4.1 dis-cusses the selection of features from a set of pieces of the training imagesrather than less powerfully only looking for variation between a single train-ing and validation pair.

Previous implementations of sequential feature selection have used `1

[11, 38] or `2 distance or ‘gain’ for d. Gain is the increase in informationcontained in a model (equivalently, decrease in entropy) due to adding afeature to it, and can be either be estimated analytically [8] or, at greatexpense, by actually comparing every possible extended model [30]. Zhuet al. [8] also gave an correction to the gain for theoretical uncertainty inthe estimated statistics, which is a form of the Akaike information criterion(AIC), penalising model complexity. However, real uncertainty in estimatesfound through Markov chain Monte Carlo (MCMC) sampling from a MGRFis typically vastly larger than theoretical estimates of uncertainty because thesampling does not converge (see Section 4.3). This is especially true becausethe quality, and also statistics, of the approximate samples that we attemptto rapidly draw from the current texture model are highly variable.

We used the Jensen–Shannon divergence (JSD) [49] for distance functiond as a proxy to the additional information content of hf (g

◦). The JSD isdefined as

DJS(p ‖ q) :=1

2(DKL (p ‖ m) + DKL (q ‖ m))

where DKL is the Kullback-Leibler divergence (KLD), p and q are probabilitydistributions, and m is the averaged distribution 1

2(p+q). The JSD measures

the discriminability of two distributions, namely the average certainty (as

11

measured in bits) with which a sample can be ascribed to one or the other.The JSD is symmetric, bounded in the range [0, 1] bits (when computed usingbase 2 logarithms) and is popular because it is suitable for comparing twoempirically estimated distributions, unlike the KLD, which is often undefined.In practice, the JSD usually provides very similar rankings to the `1 distance.

Features f provided by a selector are added to the model until either e(f)falls below a manually selected threshold (we used 0.005 bits), or a maximumnumber of iterations for that selector has occurred (we used a limit of 8). Thelater is more likely because we don’t expend the time to learn parametersto closely match the training histograms. The algorithm is summarised inAlgorithm 1. Figure 1 gives examples of its operation, starting from theinitial model (from which the first sample is drawn) which included onlymarginal and nearest-neighbour pairwise potentials and proceeding throughtwo selectors. The first selector returns three pairwise potentials at a time.Frequently, the is no visible improvement from one step to another, howevereven the samples getting worse does not indicate that the model is gettingworse, due to the unstable sampling process. There is also usually a big jumpas each new class of features is added, and decreasing returns thereafter.Figures 3 and 4 show the cliques selected in two of the examples in detail.Determining a suitable stopping point for texture synthesis is difficult, asthe histogram distances don’t reflect visual proximity to the training imagevery well, and a texture similarity metric such as improved structural texturesimilarity (STSIM2) [50] might be a better option.

4. Texture Modelling

The procedure described above is generic and could equally well be ap-plied to learning MGRF models for machine learning tasks outside of com-puter vision. Its refinement specific to texture-modelling, including some ofour implementation details is given below.

Broadly, there are only two ways to achieve desired invariances: by de-signing the models to enforce those invariances (e.g. by using feature func-tions with some invariance property), or by learning them from examples insuitable training data which demonstrates them. We use both approaches,although the later is only preliminarily developed here.

12

Algorithm 1 Outline of basic MGRF nesting procedure

Require: Initial base MGRF model p0 with feature set F0, training imageg◦

i← 0for each selector C in C1, . . . , Ck do

loopObtain set of samples {gl}1≤l≤n from model pi

for each f ∈ C(Fi) doCollect histograms hf (g

◦) and hf (g1), . . . ,hf (gn)hf (g)← 1

n

∑l hf (gl)

end forPick one or more f with maximal d(hf (g),hf (g

◦)) for some histogramdistance measure dFi+1 ← Fi ∪ {f}Form pi+1 by adding f to pi as in Eq. (2)Learn parameters θi+1 for the new featuresi← i + 1

end loopend for

13

1. 2. 3. 4. 5.

1.

2.

3.

Figure 1: Examples of the nesting procedure. At top are pieces of the original texture(Brodatz D9, D101, and ’text’, ’snow-field’ from Simoncelli), and the progress of featureselection. Each image shows the sample drawn from the model at a nesting iteration.First row shows addition of pairwise GLD features, and second shows addition of 9-thorder jagged star features (see Section 4.6). For the grass texture pairwise features GLDcapture little information and the procedure switches immediately to higher order.

14

4.

5.

Figure 2: Further examples of the nesting procedure, for textures 4 and 5 in Figure 1.First row shows addition of pairwise features, and second shows addition of 9-th order‘conjoin’ features (see Section 4.6).

4.1. Texture-specific minimum likelihood maximisation

Up to now, we have described model learning in terms of maximum likeli-hood estimation. This maximises the total probability of the training image,but the sufficient statistics of the image are the mean values of the featurefunctions over the image. This is completely oblivious to variations acrossthe image, and for this reason some small parts of the training image mayactually have low probability.This is particularly true of textures composedof large randomly distributed textons (coherent texture elements like pebblesor grass blades), which greatly reduces the effective amount of training data,and thus is the most difficult to learn for all kinds of MGRF models. At oneplace in the image or over the whole image a particular angle or offset be-tween textons may be more common than the average just by chance. Justlike humans, sequential feature selection easily sees patterns in noise. Wehave already made the assumptions that the texture is homogeneous withthe Markovian property and that the entire training image is a sample fromthe texture class, without foreign inclusions. We can thus expect that everypiece of the training image of sufficient size (which we term the “scale” of

15

Figure 3: The shapes of the 2nd and 9th order ‘jaggered star’ cliques selected for D101(texture 2) in Figure 1, in selection order. A piece of the texture at the same scale isshown at right.

16

Figure 4: The shapes of the 2nd and 9th order ‘conjoin’ BP cliques selected for Metal.0004(texture 5) in Figure 1, in selection order. A piece of the texture at the same scale is shownat the bottom.

17

the texture) is itself also a visually recognisable sample of the texture class,while those much smaller aren’t, and thus the model should also recognisethem. The scale will be approximately equal to once to twice the tesselationoffset if the texture is regular, related to the size of any textons that thetexture is composed of, and in general similar to the spatial extent of theMarkov blanket. Indeed, this property holds for all of the training imagesused in our experiments. Hence it unsatisfactory that only averages over thetraining image are considered, and we propose to switch away from the MLEas our objective.

Let {g◦i } be a set of training images. In practice we simply split a singletraining image g◦ up into pieces of size 80 × 80 with overlap of 22 pixels.This is related to the maximum offset length that we search for, 40 pixels.We change our optimisation problem to

maxp,θ

ming◦i

`(θ|g◦i ). (3)

It is perhaps desirable to use a smoothed differentiable version of the objectiveusing a ‘soft-min’ instead of min,

maxp,θ

∑i

`(θ|g◦i )w(g◦i | p(·|θ)), (4)

where w weights the pieces exponentially according to their proximity to theminimum (which is easy to compute):

w(g◦i |p) :=exp(−α`(θ|g◦i ))∑j exp(−α`(θ|g◦j ))

=exp(−αθ · h(g◦i ))∑j exp(−αθ · h(g◦j ))

(5)

where α is a softness parameter.There are two changes implied by this objective. Firstly, when selecting

features we replace the criterion

arg maxf

d(hf (gsamp),hf (g◦))

with one of

arg maxf

mini

d(hf (gsamp),hf (g◦i )), or (6)

arg maxf

∑i

w(gsamp|p)d(hf (gsamp),hf (g◦i )) (7)

18

to find features that are characteristic for all parts of the image. Secondly,and less obviously, when optimising parameters instead of optimising Eq. (3)we exclude the base model pi−1 and choose to optimise only the contributionof model extension which modulates the previous one

p+i(g|θi) :=1

Z(θi)exp(θ · hf (g)). (8)

Maximising this is normally equivalent to finding the MLE, but when opti-mising the piecewise minimum probability it means selecting the minimumpiece quite differently; the minimum piece is g◦min := arg ming◦j

p+i(g◦j |θi)

rather than arg ming◦jpi(g

◦j |θi). Attempting to maximise ming◦j

p(g◦j |θ) wouldbe attempting to reach an optimal outside of the hyperplane of actually al-lowed parameter adjustments. The gradient is then simply ∇`(θ|g◦min), or aweighted sum of the ∇`(θ|g◦i ) gradients if using soft-min.

4.2. Second order methods

As q(g◦) is fixed w.r.t. θ and can be dropped, parameters of general expo-nential family distibutions are no more difficult to learn than without a basemodel. The gradient vector ∇`(θ|g◦) and the Hessian matrix H(`)(θ|g◦) ofthe log-likelihood are easily derived as:

∇`(θ|g◦) = −h(g◦) + Eθ {h(g)} (9)

H(`)(θ|g◦) = −Covθ{h(g)} (10)

where Eθ{h(g)} and Covθ{h(g)} denote respectively the expected value andcovariance matrix of the image features h(g) w.r.t. the distribution p(g|θ).As the covariance matrix is always non-positive definite, the log-likelihood isconcave down.

Computing the expectation in Eq. (9) exactly is in general intractable.Its approximation using a Markov chain Monte Carlo (MCMC) sampler suchas the Gibbs sampler can be highly unreliable, because the energy landscapeoften has deep local minima which are inescapable in reasonable time-scales.

Given a set of training images g◦1, . . . ,g◦k the MLE of the product

∏i p(g◦i |θ)

is

arg maxθ

∑i≤k

`(θ|g◦i ) = arg maxθ

(− log Z(θ) + θ · 1

k

∑i

h(g◦i )

). (11)

19

hence the MLE is the same as the MLE of a hypothetical (and likely nonex-istent) ‘mean image’ g◦ with statistics defined as h(g◦) := 1

k

∑i h(g◦i ), al-

though the actual probabilty-product can not be simplified to g◦.Given samples drawn from the model p(g|θj) we can use the 2nd order

Taylor expansion about θj, using estimates to gradient and Hessian from Eq.s(9) and (10), as an approximation to the log-likelihood. Given a additionalsamples from nearby models p(g|θj−1), p(g|θj−2), . . . we could also incorpo-rate them to attempt to improve our estimates using importance sampling[51]. The Hessian may be reasonably and easily approximated as a diagonalmatrix. A Newton step can be performed by inverting the Hessian, while amore numerically stable alternative is to find the maximum of the 2nd orderlikelihood approximation along the direction of the gradient. See [52] fordetails.

As both gradient and Hessian are highly approximate and Newton’smethod takes dangerously large steps we take a single 2nd order step asan initial approximation to the parameters, and then switch to a stochastic1st order method which follows only the approximate gradient, as detailedin the next section.

4.3. Parameter learningThe most obvious method to learn the parameters of a MGRF is to per-

form stochastic gradient descent (SGD), following a noisy approximation tothe gradient given by sampling from the model to approximate the expec-tation in the gradient (Eq. (9)). Sampling could be performed by startingfrom an initial image (e.g. of noise) and performing Gibbs sampler sweepsover the whole image until its neg-energy converges. However, this method isvery slow. Alternatively, a fixed number of sweeps can be performed, whichis still useful for synthesising images but optimises the wrong objective.

Rather than repeatedly sampling from the models to approximate thegradient, controllable simulated annealling (CSA) [18] was used to simulta-neously tune the parameters and produce approximate samples an order fmagnitude or two faster than possible using SGD. An essentially identicalprocedure was reinvented under the name persistent contrastive divergence(PCD) [17], but is used for learning parameters, uses small stepsizes, and isstarted from the training data rather than an arbitrary point. CSA alter-nates Gibbs sampling macrosteps (visiting each pixel in the image once) andchanges to parameters according to

∆θ = λt ◦ (h(gsamp)− h(g◦))

20

Figure 5: Learning parameters for a model of D101 with 25 pairwise GLC features: com-parison of samples produced directly by CSA, ACSA, and SGD, and sampling from themodels learnt by CSA and ACSA. Vertical axis shows sum of the normalised `1 distanceof the 25 histograms (maximum value 50),

∑i ||hi(g)− hi(g◦)||1/||hi(g)||.

21

Figure 6: Example comparing different methods for learning parameters for a model ofD77 with 25 pairwise potentials. Vertical axis as in Figure 5.

22

where ◦ denotes element-wise Hadamard product, gsamp is the current sam-ple, λt is a vector of the current per-parameter step sizes and t is the iterationnumber. A Robbins-Monro (RM) sequence λt = 15

15+t1 was used, where 1

is the all-ones vector of the same dimension as θ. CSA can more rapidlyproduce an image with statistics which usually match the training statisticsfairly well, but also has a strong tendency to oscillate. For this reason we in-troduce a variant of CSA called accelerated controllable simulated annealing(ACSA), which uses an gain vector adaptive step size instead of RM steps,as described by Almeida et al. [53], with default parameters and initial stepsize vector λ0 = 1. This uses different learning rates for different param-eters, dampening oscillations by reducing the learning rate for parameterswhich do so, while increasing the learning rate for those that seem to requireit. Adaptive step sizes proved to give far faster convergence to the desiredstatistics than RM steps, and are more robust to initial step size.

However CSA is intended for producing a sample of prescribed statistics,not parameter tuning, and the final model produced by CSA can be verypoor when sampled from using normal Gibbs sampling. Figure 5 showsan example of the normal behaviour of CSA, ACSA, and SGD, comparingtheir convergence in statistics over time. All start from the same initialmodel of D101. CSA and ACSA used an initial step size λ0 = 1, whileSGD used λ0 = 10 and performed 40 Gibbs macrosteps before updatingparameters. CSA and ACSA quickly move towards the desired histograms,ACSA almost always considerably outperforming CSA. The models producedby CSA/ACSA were sampled periodically sampled from and evaluated. Itwas observed that usually Parameters learned by CSA and ACSA wouldimprove for a while, much faster than by using SGD, and then diverge again;ACSA often diverges very rapidly. To take advantage of this a scheme wasused which runs CSA three times for 50 iterations each, then restarts froma new initial image and resets t = 0, a.k.a. PCD with restarting. Figure 6shows an example of the behaviour of this restarting-CSA scheme, normallybeing less likely and slower to diverge than CSA or restarting-ACSA.

Parameter learning was done by finding a first approximation to the pa-rameters, as described in Section 4.2, then performing 160 macrosteps ofCSA restarting every 40 sweeps. Samples from the actual model were thenobtained by starting from a CSA sample and running a Gibbs sampler for100 sweeps or until convergence in neg-energy.

23

4.4. SamplingAn important consideration when drawing samples from a MGRF tex-

ture model using an MCMC sampler is the choice of initial image. While theMarkov chain converges asymptotically to the model distribution, this cantake a number of steps exponential in the number of pixels due to ubiquitousdeep local minima. However it is well known that MCMC converges slowlywhen highly correlated variables are updated independently, and in imagemodels all nearby pixels are typically highly correlated to one another, mak-ing Gibbs sampling inefficient and prone to being trapped in local minima.The most common example of these are ‘crystallisation’ of regular patternsgrowing from two or more areas and meeting without aligning. Gibbs sam-pling of regular textures starting from noise often leads to crystallisation ifthe image size is several multiples of the maximum clique size.

There are a number of ways to avoid crystallisation. Adding longer rangeinteractions in order to more strongly enforce global structure is one, thoughnot very desirable. Otherwise, the process can be pushed towards the correctregularity by starting from an initial image with a template, in the form ofdots, stripes, or any other unevenness, of the desired tesselation. Such tesse-lation of near-regular textures can automatically be extracted by computingco-occurrence statistics for different offsets, in the form of a model-based in-teraction map (MBIM) [52]. Other alternatives are to slowly grow the imageby extending the boundaries, starting from a single small ‘seed’ which is firstallowed to converge [48], or starting from a random piece of the trainingimage as a seed. The later was used for these experiments due to simplic-ity. If a texture model is to be used for other tasks such as segmentation orclassification better performance could be expected if appropriate learning,sampling and validation methods are used. For example for texture synthesiswe care only about the local mimimum that is reached on sampling from thestarting sample (a white noise image in this work), and disregard the energylandscape outside of this basin, while for texture classification we want toensure correct behaviour for distant images too.

For unguided texture synthesis after learning a model we used ACSA, andfor inpainting we used plain Gibbs sampling. In practice CSA and ACSAcan only be used for unguided synthesis as they are harmful for inpaintingas generalisation (e.g. differing lighting) may be required. Finding optimalparameters which produce good unguided synthesis results requires far morefine tuning to eliminate unwanted energy minima which are approached veryslowly during MCMC. In theory, if optimal parameters are learned then

24

MCMC sampling should converge to samples with the same statistics asthose achieved by CSA. Hence examining CSA samples provides evaluationof the nesting procedure’s ability to select features without too much concernof the additional issue of learning parameters, and its expected performancegiven unlimited time for optimising parameters. Note however that moreefficient algorithms have been developed for creating images matching desiredstatistics by making modifications attempting to directly move towards them[10, 16, 54]. These don’t necessarily have any relation to sampling from aMGRF distribution.

Final synthesis image size was 180 × 180 plus trimmed margins of upto 65 pixels depending on feature sizes, 300 iterations were used, and allparameters in the model were free.

4.5. Binary patterns

In order to reduce the space of candidate potentials, we define familiesof potentials that are parameterised only by the shape of their supports,with unparameterised feature functions. The simplest such order k featureon an image of Q pixel grey-levels is the trivial grey level co-occurrence(GLC) feature fGLC(x1, . . . , xk) := x1Q

0 + . . . + xkQk−1 with Qk bins. GLC

features proved to have poor generalisation ability; while they performedwell for texture synthesis, when used for inpainting tasks they rememberedthe original contrast and offset of the training image rather than matchingthe contrast of the surrounding inpainting frame. Previous researchers havefound that histograms of the pairwise grey level difference (GLD) — definedfor k = 2 as fGLD(x1, x2) := x2 − x1, with 2Q − 1 bins — encodes the largemajority of the information in a pairwise GLC histogram [13, 55], confirmedby our own comparisons GLC and GLD features for texture synthesis (seeSection 5.1). Hence we used GLD instead of GLC features to capture pairwiseinteractions.

The high-order features investigated were binary patterns (BPs) whichare a generalisation of LBPs with learned offsets from the central pixel andthe complementary binary equalities (BEs). The binary pattern feature func-tion thresholds the grey levels of the pixels in each clique against a certaindistinguished pixel (no longer necessarily in the centre): fBP(x0, . . . , xd) :=∑

1≤i≤d 2i−1Jx0 < xiK. The binary equality feature functions are defined asfBE(x0, . . . , xd) :=

∑1≤i≤d 2i−1J|x0 − xi| ≤ cK. Whether using Q = 8 and

Q = 16 grey levels we used the equality threshold c = 1. No interpola-tion between pixels was performed, nor were histogram bins combined as in

25

uniform LBPs [25].We write GLCk, GLDk, BPk and BEk to indicate order k grey level co-

occurrence, grey level difference, binary pattern and binary equality features,respectively.

4.6. Building up features

As base model, we used a MGRF with a first order factor to modelmarginal statistics and the two nearest neighbour GLD features with off-sets ((0, 0), (1, 0)) and ((0, 0), (0, 1)). As an exception, marginal potentialswere not used for the inpainting experiment in Section 5.2, which improvedthe models’ ability to match the contrast of the surrounding frame. Howeverfor unguided synthesis matching the original histogram is desirable. Figure7 shows a few examples of models learnt without a 1st order marginal poten-tial, and having only GLD and BE potentials. Because these textures have180◦ rotational symmetry the GLD features can’t distinish the texture fromits inverse image. Adding a marginal potential usually breaks this symmetry,however it seems that this did not work for D104 (4th last row of Figure 11),which has nearly symmetric marginal histogram.

All features were constrained to clique families with at most 40 pixelsdistance between two points. The initial candidate set (selector) C1 wasalways comprised of all GLD2 features within this circular window. ThreeGLD2 features are added at a time to speed learning. Thereafter differentpossibilities for C2 were considered as listed below. Whenever we create a setof BP features we also add all binary equality features to the candidate set.In practice if the texture contains flat regions then a number of BE featuresare selected, otherwise all or nearly all the selected features will be BPs.

• GLC3/GLD3 features with cliques created by combining two of theselected GLC2/GLD2 cliques end-to-end, in any of the four possibilities.

• ‘Combined’ BP5 features built out of the set of characteristic offsets{ri} occurring by the previously selected GLD2 and BP5 features. Eachpossible selection of two offsets r1, r2 and for each i = 1, 2 each choiceof either using mirrored offsets ri,−ri or halved offsets ri/2,−ri/2 pro-vided the set of four-offset candidates. BP5 features were also addedtwo at a time.

• ‘Conjoined’ BP9 features built by selecting (by exhaustive search) fourBP3 features with symmetric pairs of offsets (ri,−ri) of maximal JSD

26

Figure 7: Samples produced by models that could not distinguish a texture from its inverse.

between training and sample histograms and then combining them intoa single clique family with shape (r1,−r1, . . . , r4,−r4). Examples canbe seen in Figure 4.

• ‘Star’ BP9 and BE9 features with 8 surrounding pixels evenly spacedon a circle centred on the central pixel, and rounded to the nearest Z2

lattice point. Considering all rotations φ and radii from 1 to 20 pixelsprovided a fixed set of 144 candidates.

• ‘Jagged star’ (jag-star) BPk and BEk features, generalisations of starBPk features with k − 1 surrounding pixels alternately and evenlyspaced on two circles of radii d0, d1 centred on the central pixel, withoffsets at [

xi

yi

]= di mod 2

[cos( 2πi

k−1+ φ)

sin( 2πik−1

+ φ)

],

where di mod 2 is d1 if i is odd and d0 if i is even. These more generalconfigurations include square as well as circular patterns. Consideringa dense subset of rotations and radii from 1 to 20 pixels provided afixed set of 2638 candidates. Examples can be seen in Figure 3.

5. Evaluation

Experiments in this paper were conducted with a set of grey-scale digi-tised photographs of natural and approximately spatially homogeneous tex-tures sourced from several popular databases (downloadable from [56–59]).Textures were selected which were diverse, difficult to model, homogeneousand without periodicity or other features on a scale longer than 40 pixels(all images were kept at original scales, with the exception of the inpaintingexperiment in section 5.2). The databases used were the popular Brodatzphotoalbum [60]; the“NewTex” dataset of natural textures included with theMeasTex, a framework for standardised texture discriminator evaluation; and

27

the MIT VisTex database. We also compared results against Portilla and Si-moncelli [16] on some additional original and synthesised textures providedby them.

Textures were subjectively categorised into three classes according tostructure: stochastic (apparently described by only local interactions, e.g.sand, water), near-regular (tiled but with random defects, e.g. weaves), andirregular (containing large scale elements with unpredictable placement orshapes e.g. marbles)

Each image was quantised to Q = 8 grey levels. Except for those textureson which we compared to other results, the images were quantised usingcontrast-limited adaptive histogram equalization [61] with 16× 16 tiles anda contrast clipping limit of 0.03. Adaptive histogram equalization mostlyavoids the situation where most of the image is mapped to only one or twogrey levels, and also lessens shadows and gradients, which hinder the recoveryof long range interactions. The low Q value used speeds up Gibbs sampling.

Source code and results for all experiments are freely available from thewebsite accompanying the paper1.

5.1. Synthesis

Of the total collection of textures, 42 of the most diverse and interestingexamples were selected and modelled using each of the several alternativefeature sets described in Section 4.6 Training images were 256 × 256 in sizeand a 180×180 image was synthesised using 300 iterations of ACSA. Differentpairwise features were selected by each model as they were learned fromscratch, not from a shared pairwise base model. This likely increased thevariability in the results. Figures 8, 9,10,11 show our results. As a baselinecompare to some of the results by Portilla and Simoncelli [16] downloadablefrom [59], which in addition to being easily available, is today still one ofthe most successful attempts to model texture statistically, and has recentlybeen extended to colour textures. Recall that similarly to our method, [16]attempts to produce images matching certain statistics rather than directlysampling from a MGRF. On the other hand, all of the recent filter-basedMRF texture models [7, 19–21, 62] have only been demonstrated on a fewsimple highly regular textures with short repetition lengths and texton scales.Unfortunately the lack of difficult synthesis examples or source code for these

1To be uploaded to http://www.ivs.auckland.ac.nz/texture modelling.

28

1 2 3 4 5 6 7

Figure 8: Synthesis results. Columns are: 1 : original texture. 2 : 2nd order GLDs(common base model shared by all others). 3 : 5th order ‘conjoin’ BP/BE features. 4 :9th order ‘simple’ BP/BE features. 5 : 9th order jagged star BP/BE selection. 6 : 13thorder jagged star BP/BE selection. 7 : Result from [16], if available.

29

1 2 3 4 5 6 7

Figure 9: Synthesis results continued. See Figure 8.

30

1 2 3 4 5 6 7


31

1 2 3 4 5 6 7


32

Figure 12: Variants on jag-star BP9/BE9 models. Rows are: First : training image.Second : jag-star model; identical to column 5 of Figures 8–9. Third : jag-star models builton GLC2 rather than GLD2 potentials. Fourth: jag-star models without any pairwisepotentials at all.

33

previously published approaches hinders comparison, though [21] achievedvisually better results on the difficult D76 and D103/D104 textures by usingthree layers of hundreds of filters.

From the results, it can be seen that the MGRFs perform best on near-regular textures, as tesselation is easily handled, and on stochastic textures,for which the apparent dependency of the actual texture is on a short scaleand thus a natural fit for a MGRF. On several such textures the methodof Portilla and Simoncelli is outperformed, especially where it violated fixedtesselation offsets, although more frequently it gave a better result, appar-ently most significantly because it captures high frequency features (such asthin branches and speckles) that ours do not. In addition, sampling froma MGRF adds unwanted high frequency noise. Complex irregular texturessuch as pebbles at the top of Figure 11 are the most difficult for all methods,and in these cases hidden variables to distinguish different local contexts, asin gated MRFs, are probably necessary to visually replicate the texture. Ingeneral, the models with only pairwise potentials perform worst, the otherhigher order models are considerably better, and the 13th order jag-star onesare best, followed by 9th order jag-star features. However, it is not easy toseparate the different classes of models. Results for star BP’s (not shown)were close to those for jag-star, but generally inferior. Possible reasons forthe superiority of the star and jag-star potentials over the conjoin BP9’s arethat spacing out the offsets provides more information about the neighbour-hood, and that the candidate cliques are compared directly rather than beingdetermined by a search through a different class. Finally, a number of thesamples show artifacts at the edges of the image, probably due to the dis-tant boundary of the MGRF. Although a consider distance from edges ofthe random field was trimmed after sampling, the artifacts that occur therecan have long range effects. When the texture regularity allows it, torodialimage boundaries would be highly beneficial.

We also compared the effect of replacing pairwise GLD features with GLCfeatures or removing them entirely. Some examples are shown in Figure 12.Generally the result of exclusing pairwise potentials is not very great, or evenundetectable, as the higher order features can seemingly usually capture thesame statistics, which is against expectations. The difference between GLDand GLC features is even less, though over the whole texture set use of GLCmodels as a base seem to be slightly more powerful.

Figures 13 and 14 show some examples of inpainting a hole in a givensurrounding frame, using the same models as in Figures 8–11. Inpainting is an

34

easier problem than synthesis in some ways. An advantage of MGRF texturemodels is that inpainting non-rectangular holes is trivial, while methods suchas patch based-synthesis are not applicable. Crucially, ACSA was not used,rather the models are sampled from with Gibbs sampling, meaning that howwell fine tuned the parameters are matters greatly. This resulted in poorresults for a number of models; unguided Gibbs sampling from these modelstypically converges to something not resembling the texture. On a number oftextures, such as the first three, the order 13 jag-star models underperformthe order 9 ones, possibly as a result of having to estimate 4096 parametersper potential vs. only 256.

Total running times (a single threaded implementation) for both learningand synthesising across the different selector combinations under comparison,varied from 30 to 120 seconds for pairwise models to typically 10 minutesfor 9th order and 15 minutes for 13th order ones. Collecting histograms inorder to compare and select features, and synthesis using ACSA are bothlarge fractions of the running time.

5.1.1. Human evaluations

Preliminary synthesis results for an earlier version of the algorithm thandescribed in this paper were evaluated by a panel of observers. For each of 25textures (a subset of the 42 above) the samples from each model were scoredby 17 people on a scale from 0 for no resemblance to 9 for indistinguishable.The observers were presented with the training image and up to 5 synthesisedimages for that texture at a time in a random order and instructed to makerapid decisions rather than perform careful inspection.

The main difference between the current algorithm and the one evalu-ated by the judges was the use of a different stopping rule which frequentlyerroneously rejected all or nearly all high order features, causing bad resultson average but very similar ones to those presented here in the best case.Additionally, GLC instead of GLD features were used for pairwise potentialsand BEs were not used, though these are not prevalent in later experimentsanyway. Table 1 presents results summarised by average synthesised imagerankings of each of the feature set combinations by texture category. TheBP5 features outperforming the 9th order ones on stochastic textures maybe an artifact of the earlier stopping rule, which usually excluded the higherorder ones altogether on these textures.

35

1 2 3 4 5 6 7

Figure 13: Examples of inpainting a 100 × 100 hole in a 200 × 200 image. The columnsare: 1 : original texture. 2 : inpainting frame. 3 : using a model with 2nd order GLDfeatures. 4 : using 5th order ‘combined’ BP/BE features. 5 : using 9th order ‘conjoin’BP/BE features. 6 : using 9th order jagged star BP/BE selection. 7 : using 13th orderjagged star BP/BE selection.

36

1 2 3 4 5 6 7

Figure 14: Further examples of inpainting. See Figure 13.

Figure 15: Inpainting examples for a 9th-order jag-star model of D6, D21, D53 and D77.The left hand column of each pair shows the worst inpainting result over 20 repetitions,the right hand column the best result.

37

Table 1: Average human evaluations of texture synthesis results by texture category(mean±standard deviation).

Stochastic Near-regular Irregular Total

GLD2 3.7±1.8 3.6±1.9 3.1±1.5 3.5±1.8GLD3 4.2±1.9 4.0±1.9 3.6±1.6 4.0±1.9BP5 4.7±1.9 4.5±2.0 4.0±1.9 4.5±2.0Conjoin BP9 3.8±1.9 4.8±2.1 3.5±1.7 4.3±2.0Star BP9 4.0±1.9 5.6±2.0 4.4±1.8 4.8±1.9

Table 2: Results of inpainting evaluation; MSSIM scores against ground truth (mean ±standard deviation)

D6 D21 D53 D77

Efros & Leung [40] 0.85 ± 0.03 0.86 ± 0.03 0.86 ± 0.06 0.60 ± 0.08TmPoT [19] 0.86 ± 0.02 0.87 ± 0.01 0.86 ± 0.02 0.77 ± 0.03TssRBM [21] 0.89 ± 0.02 0.91 ± 0.01 0.92 ± 0.02 0.76 ± 0.032-layer DBN [21] 0.89 ± 0.03 0.91 ± 0.02 0.92 ± 0.03 0.77 ± 0.02cGRBMs [62] 0.91 ± 0.02 0.93 ± 0.01 0.93 ± 0.01 0.78 ± 0.03

GLD2 0.30± 0.06 0.77± 0.03 0.78± 0.02 0.66± 0.04Conjoin BP9/BE9 0.58± 0.03 0.75± 0.04 0.82± 0.09 0.70± 0.04Jag-star BP9/BE9 0.81± 0.05 0.81± 0.07 0.88± 0.02 0.73± 0.03

5.2. Inpainting

Additionally, we compared against results from three recent leading hier-archical MGRF texture models [19, 21, 62] which attempted to quantitativelyevaluate probabilistic texture models through a specified texture inpaintingtask described below. By focusing on four highly regular textures with smallfeature sizes, it is reasonable to compare the synthesized pixels in the in-painted region to the original ones since they should be nearly aligned. Thisis not applicable to other texture types, so is a very narrow test.

In the task a 76×76 pixel piece of Brodatz textures D6, D21, D53, or D77is provided as an inpainting frame with a 54 × 54 pixel hole cut out of thecentre. An algorithm fills the hole in a way consistent with the frame, andthe difference against the ground truth is measured using the mean structuralsimilarity index (MSSIM) [63]. MSSIM has been found experimentally to bea good measure of perceptual difference between images. D6 and D53 were

38

scaled down so that small filters could be used successfully by the earlierpapers. The top half of each image was used for training, and the bottomhalf for testing. Since the learning procedure is stochastic, the inpaintingis repeated wtih random inpainting frames from test region and the averageMSSIM score reported.

Our results are presented in Table 2, alongside best previously publishedresults and results from [19] for the Efros and Leung texture synthesis algo-rithm [40], using 19 × 19 patch sizes. For this experiment we used Q = 16grey levels. The new models failed to match previous ones for several reasons,although Figure 15 shows that the visual quality of results is not actuallyvery poor, except for D6. Most significantly, the model parameters were notcarefully tuned for inpainting, as we continued to rely on CSA for learn-ing. The cause of the poor performance on D6 is not known; surprisinglyall models we attempted produced bad results for it. The comparison ofground truth with 256 grey levels against models using only 16 levels gives asmall penalty not shared by the previous works, as does the noise producedby Gibbs sampling; smoothing the results gave an increase of about 0.02 inscores for D77.

6. Conclusion

While Markov random fields have long been applied to image modelling,there has been little prior investigation in this context into practical learn-ing of interaction structure, with apparently only 2nd to 4th order MGRFspreviously learned. The rapid model nesting approach introduced in thispaper has the same theoretical foundation on maximum entropy distribu-tions as previous sequential structure learning approaches, but emphasisesapproximation and speed.

Furthermore, for nearly two decades MGRF texture and image modelsbased on square linear filters have been almost exclusively studied due totheir superior performance over the primitive pairwise models that precededthem. However, we have shown that other types of high-order features,such as the generic binary pattern features introduced in this paper, areviable alternatives to linear filters. These BP and BE features have severaladvantages over filters. They are offset and nearly contrast invariant, andare still powerful even when of relatively low order. Presented synthesis andinpainting experiments showed that 9th order BP features are sufficient tocapture many important visual aspects of textures, while 9th order 3×3 filters

39

can capture comparatively very little of interest. These relatively low ordersreduce computation costs, while no filter coefficients need to be selected orlearned. show that a wide range of texture

The purpose of our synthesis experiments is not to challenge the main-stream highly effective and efficient texture synthesis algorithms but to val-idate our probabilistic models. There are however still many textures whichour models have failed to capture, especially complex irregular ones, as wellas losing fine details in general. In order to tackle these in future we planto extend our models by including additional layers of feature selectors fordiverse types of statistics, particularly co-occurrences between the outputsof feature functions at previous levels, as has been used in several texturemodels which used filters as features. A better stopping rule for the nestingprocedure which is robust to the highly variable model samples would becrucial if more feature sets are used, to keep the number of nesting itera-tions low. Use of a texture similarity metric such as STSIM2 [50] may beappropriate for determining actual progress. In future work we also hope toimprove the generalisation ability of the models, such as with invariance tosmall deformations of the image. This is closely related to the selection ofcomplex composite features which can describe more powerful abstractions.

References

[1] A. M. Ali, A. A. Farag, G. L. Gimel’farb, Optimizing binary MRFswith higher order cliques, in: D. Forsyth, P. Torr, A. Zisserman (Eds.),Proc. European Conf. on Computer Vision (ECCV 2008), Vol. 5304 ofLecture Notes in Computer Science, Springer Berlin Heidelberg, 2008,pp. 98–111.URL http://dx.doi.org/10.1007/978-3-540-88690-7 8

[2] N. Komodakis, N. Paragios, Beyond pairwise energies: Efficient op-timization for higher-order MRFs, in: IEEE Conf. on ComputerVision and Pattern Recognition (CVPR’09), 2009, pp. 2985–2992.doi:10.1109/CVPR.2009.5206846.

[3] G. R. Cross, A. K. Jain, Markov random field texture models, IEEETrans. on Pattern Analysis and Machine Intelligence 5 (1) (1983) 25–39. doi:10.1109/TPAMI.1983.4767341.

40

[4] S. Geman, D. Geman, Stochastic relaxation, gibbs distributions, and thebayesian restoration of images, IEEE Transactions on Pattern Analysisand Machine Intelligence 6 (1984) 721–741.

[5] A. Gagalowicz, S. D. Ma, Sequential synthesis of natural textures,Computer Vision, Graphics, and Image Processing 30 (3) (1985) 289 –315. doi:http://dx.doi.org/10.1016/0734-189X(85)90162-8.URL http://www.sciencedirect.com/science/article/pii/

0734189X85901628

[6] G. Gimel’farb, Quantitative description of spatially homogeneous tex-tures by characteristic grey level co-occurrences, Australian Journal ofIntelligent Information Processing Systems 6 (2000) 46–53.

[7] N. Heess, C. K. I. Williams, G. E. Hinton, Learning generative tex-ture models with extended Fields-of-Experts, in: Proc. British MachineVision Conf. (BMVC’09), 2009.

[8] S. C. Zhu, Y. Wu, D. Mumford, Minimax entropy principle and itsapplication to texture modeling, Neural Computation 9 (8) (1997) 1627–1660.

[9] S. Roth, M. J. Black, Fields of Experts, Int. J. of Computer Vision 82 (2)(2009) 205–229. doi:10.1007/s11263-008-0197-6.

[10] D. J. Heeger, J. R. Bergen, Pyramid-based texture analysis/synthesis,in: Proc. 22nd Conf. on Computer Graphics and Interactive Tech-niques, SIGGRAPH ’95, ACM, New York, 1995, pp. 229–238.doi:10.1145/218380.218446.URL http://doi.acm.org/10.1145/218380.218446

[11] S. C. Zhu, Y. Wu, D. Mumford, Filters, random fields and maximumentropy (FRAME): Towards a unified theory for texture modeling, Int.J. of Computer Vision 27 (2) (1998) 107–126.

[12] K. Valkealahti, E. Oja, Reduced multidimensional co-occurrencehistograms in texture classification, Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on 20 (1) (1998) 90–94.doi:10.1109/34.655653.

41

[13] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture dis-crimination with multidimensional distributions of signed gray-level differences, Pattern Recognition 34 (3) (2001) 727 – 739.doi:http://dx.doi.org/10.1016/S0031-3203(00)00010-8.

[14] A. Hyvarinen, E. Oja, Independent component analysis: algorithms andapplications, Neural networks 13 (4) (2000) 411–430.

[15] Y. W. Teh, M. Welling, S. Osindero, G. E. Hinton, Energy-based mod-els for sparse overcomplete representations, J. of Machine Learning Re-search 4 (2003) 1235–1260.

[16] J. Portilla, E. P. Simoncelli, A parametric texture model based on jointstatistics of complex wavelet coefficients, Int. J. Computer Vision 40 (1)(2000) 49–70. doi:10.1023/A:1026553619983.URL http://dx.doi.org/10.1023/A:1026553619983

[17] T. Tieleman, Training restricted Boltzmann machines using approxima-tions to the likelihood gradient, in: Proc. 25th Int. Conf. on MachineLearning (ICML’08), ACM, 2008, pp. 1064–1071.

[18] G. Gimel’farb, Image Textures and Gibbs Random Fields, Kluwer Aca-demic Publishers, Dordrecht, 1999.

[19] J. J. Kivinen, C. Williams, Multiple texture Boltzmann machines,in: N. D. Lawrence, M. A. Girolami (Eds.), Proc. 15th Int. Conf. onArtificial Intelligence and Statistics, 2012, pp. 638–646 vol.2.URL http://jmlr.csail.mit.edu/proceedings/papers/v22/

kivinen12/kivinen12.pdf

[20] M. Ranzato, V. Mnih, G. E. Hinton, Generating more realistic imagesusing gated MRFs, in: Advances in Neural Information Processing Sys-tems 23 (NIPS’10), Curran Associates, 2010, pp. 2002–2010.URL http://media.nips.cc/nipsbooks/nipspapers/paper files/

nips23/NIPS2010 1163.pdf

[21] H. Luo, P. L. Carrier, A. Courville, Y. Bengio, Texture modeling withconvolutional spike-and-slab RBMs and deep extensions, J. of MachineLearning Research 31 (2013) 415–423.

42

[22] G. Sharma, S. Hussain, F. Jurie, Local higher-order statistics (LHS) fortexture categorization and facial analysis, in: A. Fitzgibbon, S. Lazeb-nik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV2012, Vol. 7578 of Lecture Notes in Computer Science, Springer BerlinHeidelberg, 2012, pp. 1–12.

[23] W.-H. Liao, T.-J. Young, Texture classification using uniform extendedlocal ternary patterns, in: Multimedia (ISM), 2010 IEEE Int. Sympo-sium on, 2010, pp. 191–195. doi:10.1109/ISM.2010.35.

[24] T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texturemeasures with classification based on featured distributions, PatternRecognition 29 (1) (1996) 51 – 59. doi:http://dx.doi.org/10.1016/0031-3203(95)00067-4.

[25] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale androtation invariant texture classification with local binary patterns, IEEETrans. on Pattern Analysis and Machine Intelligence 24 (7) (2002) 971–987. doi:10.1109/TPAMI.2002.1017623.

[26] R. Nosaka, C. Suryanto, K. Fukui, Rotation invariant co-occurrenceamong adjacent lbps, in: J.-I. Park, J. Kim (Eds.), Computer Vision -ACCV 2012 Workshops, Vol. 7728 of Lecture Notes in Computer Sci-ence, Springer Berlin Heidelberg, 2013, pp. 15–25.

[27] X. Tan, B. Triggs, Enhanced local texture feature sets for face recog-nition under difficult lighting conditions, in: S. K. Zhou, W. Zhao,X. Tang, S. Gong (Eds.), Analysis and Modeling of Faces and Ges-tures, Vol. 4778 of Lecture Notes in Computer Science, Springer BerlinHeidelberg, 2007, pp. 168–182.

[28] Y. Zhai, D. Neuhoff, T. Pappas, Local radius index - a newtexture similarity feature, in: IEEE Int. Conf. on Acoustics,Speech and Signal Processing (ICASSP), 2013, pp. 1434–1438.doi:10.1109/ICASSP.2013.6637888.

[29] N. Liu, G. Gimel’farb, P. Delmas, Y. H. Chan, Texture modelling withgeneric translation- and contrast/offset-invariant 2nd-4th order MGRFs,in: Proc. 28th Int. Conf. on Image and Vision Computing New Zealand(IVCNZ 2013), 2013.

43

[30] S. Della Pietra, V. Della Pietra, J. Lafferty, Inducing features of randomfields, IEEE Trans. on Pattern Analysis and Machine Intelligence 19 (4)(1997) 380–393. doi:10.1109/34.588021.

[31] A. Zalesny, Analysis and synthesis of textures with pairwise signal in-teractions, Tech. Rep. KUL/ESAT/PSI/9902, Katholieke UniversiteitLeuven (1999).

[32] S. Perkins, K. Lacker, J. Theiler, I. Guyon, A. Elisseeff, Grafting: Fast,incremental feature selection by gradient descent in function space, J.of Machine Learning Research 3 (2003) 1333–1356.

[33] J. Zhu, N. Lao, E. P. Xing, Grafting-light: fast, incremental featureselection and structure learning of Markov random fields, in: Proc. 16thACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining,ACM, 2010, pp. 303–312.

[34] M. Dudık, S. J. Phillips, R. E. Schapire, Maximum entropy densityestimation with generalized regularization and an application to speciesdistribution modeling, J. Mach. Learn. Res. 8 (2007) 1217–1260.URL http://dl.acm.org/citation.cfm?id=1314498.1314540

[35] M. Schmidt, K. Murphy, Convex structure learning in log-linear mod-els: Beyond pairwise potentials, in: Proc. Int. Workshop on ArtificialIntelligence and Statistics, 2010.

[36] A. McCallum, Efficiently inducing features of conditional random fields,in: Proc. 19th Conf. on Uncertainty in Artificial Intelligence (UAI’03),Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003, pp.403–410.URL http://dl.acm.org/citation.cfm?id=2100584.2100633

[37] L. Mihalkova, R. J. Mooney, Bottom-up learning of Markov logic net-work structure, in: Proc. 24th Int. Conf. on Machine Learning, ICML’07, ACM, New York, 2007, pp. 625–632. doi:10.1145/1273496.1273575.URL http://doi.acm.org/10.1145/1273496.1273575

[38] S. Lee, V. Ganapathi, D. Koller, Efficient structure learning of Markovnetworks using L1-regularization, in: Advances in Neural InformationProcessing Systems 19 (NIPS’06), 2006.

44

[39] Y. Chen, M. Welling, Bayesian structure learning for Markov randomfields with a spike and slab prior, in: Proc. 28th Conf. on Uncertaintyin Artificial Intelligence, AUAI Press, 2012, pp. 174–184.

[40] A. A. Efros, T. K. Leung, Texture synthesis by non-parametric sampling,in: Proc. Seventh IEEE Int. Conf. Computer Vision, 1999, pp. 1033–1038 vol.2. doi:10.1109/ICCV.1999.790383.

[41] L.-Y. Wei, M. Levoy, Fast texture synthesis using tree-structured vectorquantization, in: Proc. 27th Conf. on Computer Graphics and Interac-tive Techniques, SIGGRAPH ’00, ACM Press/Addison-Wesley Publish-ing, New York, 2000, pp. 479–488. doi:10.1145/344779.345009.URL http://dx.doi.org/10.1145/344779.345009

[42] A. A. Efros, W. T. Freeman, Image quilting for texture synthesis andtransfer, in: Proc. 28th Conf. on Computer Graphics and InteractiveTechniques, SIGGRAPH ’01, ACM, New York, 2001, pp. 341–346.doi:10.1145/383259.383296.URL http://doi.acm.org/10.1145/383259.383296

[43] V. Kwatra, A. Schodl, I. Essa, G. Turk, A. Bobick, Graphcut textures:Image and video synthesis using graph cuts, ACM Trans. on Graphics,SIGGRAPH 2003 22 (3) (2003) 277–286.

[44] K. Popat, R. W. Picard, Novel cluster-based probability model for tex-ture synthesis, classification, and compression, in: Visual Communica-tions ’93, Int. Society for Optics and Photonics, 1993, pp. 756–768.

[45] R. Paget, I. Longstaff, Texture synthesis via a noncausal nonparametricmultiscale Markov random field, IEEE Trans. on Image Processing 7 (6)(1998) 925–931. doi:10.1109/83.679446.

[46] G. Peyre, Sparse modeling of textures, J. of Mathematical Imaging andVision 34 (1) (2009) 17–31. doi:10.1007/s10851-008-0120-3.URL http://dx.doi.org/10.1007/s10851-008-0120-3

[47] E. T. Jaynes, Information theory and statistical mechanics, Phys. Rev.106 (1957) 620–630. doi:10.1103/PhysRev.106.620.

45

[48] R. Versteegen, G. Gimel’farb, P. Riddle, Learning high-order generativetexture models, in: Proc. 29th Int. Conf. on Image and Vision Comput-ing New Zealand (IVCNZ 2014), 2014.

[49] J. Lin, Divergence measures based on the Shannon entropy, IEEE Trans.on Inf. Theory 37 (1) (1991) 145–151. doi:10.1109/18.61115.

[50] J. Zujovic, Perceptual texture similarity metrics, Ph.D. thesis, North-western University (2011).

[51] X. Descombes, R. Morris, J. Zerubia, M. Berthod, Estimation of Markovrandom field prior parameters using Markov chain Monte Carlo maxi-mum likelihood, Image Processing, IEEE Transactions on 8 (7) (1999)954–963. doi:10.1109/83.772239.

[52] G. Gimel’farb, D. Zhou, Texture analysis by accurate identification of ageneric Markov–Gibbs model, in: H. Bunke, A. Kandel, M. Last (Eds.),Applied Pattern Recognition, Vol. 91, Springer Berlin Heidelberg, 2008,pp. 221–245.

[53] L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, On-line learningin neural networks, Cambridge University Press, New York, 1998, Ch.Parameter Adaptation in Stochastic Optimization, pp. 111–134.URL http://dl.acm.org/citation.cfm?id=304710.304730

[54] S.-C. Zhu, X. Liu, Y. Wu, Exploring texture ensembles by efficientMarkov chain Monte Carlo – toward a “trichromacy” theory of tex-ture, IEEE Trans. on Pattern Analysis and Machine Intelligence 22 (6)(2000) 554–569. doi:10.1109/34.862195.

[55] M. Unser, Sum and difference histograms for texture classification,IEEE Trans. Pattern Anal. Mach. Intell. 8 (1) (1986) 118–125.doi:10.1109/TPAMI.1986.4767760.URL http://dx.doi.org/10.1109/TPAMI.1986.4767760

[56] Brodatz textures, http://www.ux.uis.no/∼tranden/brodatz.html,accessed: 2013-06-26.

[57] Meastex image texture database and test suite, http://www.

texturesynthesis.com/meastex/meastex.html, accessed: 2013-09-02.

46

[58] Vision Texture, http://vismod.media.mit.edu/vismod/imagery/

VisionTexture/, accessed: 2013-09-02.

[59] J. Portilla, E. P. Simoncelli, Representation and synthesis of visual tex-ture, http://www.cns.nyu.edu/∼lcv/texture/, accessed: 2014-10-10.

[60] P. Brodatz, Textures: a Photographic Album for Artists and Designers,Dover, New York, 1966.

[61] K. Zuiderveld, Contrast limited adaptive histogram equalization, in:Graphic Gems IV, Academic Press Professional, 1994, pp. 474–485.

[62] Q. Gao, S. Roth, Texture synthesis: From convolutional RBMs to effi-cient deterministic algorithms, in: P. Franti, G. Brown, M. Loog, F. Es-colano, M. Pelillo (Eds.), Structural, Syntactic, and Statistical PatternRecognition, Vol. 8621 of LNCS, Springer Berlin Heidelberg, 2014, pp.434–443. doi:10.1007/978-3-662-44415-3 44.

[63] Z. Wang, A. Bovik, H. Sheikh, E. Simoncelli, Image quality assessment:from error visibility to structural similarity, IEEE Trans. on Image Pro-cessing 13 (4) (2004) 600–612. doi:10.1109/TIP.2003.819861.

47

Texture Modelling with Nested High–order Markov–Gibbs ... · Markov–Gibbs Random Fields Ralph...

Documents

Transcript of Texture Modelling with Nested High–order Markov–Gibbs ... · Markov–Gibbs Random Fields Ralph...