05.Tr862.Probabilistic Modeling for Semantic Scene Classification

download 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

of 115

Transcript of 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    1/115

    1

    Probabilistic Modeling for Semantic Scene Classification

    Matthew R. Boutell

    URCS Technical Report 862

    May, 2005

    This thesis was proposed in April, 2003. The dissertation of the thesis will

    be published in May, 2005. While the dissertation subsumes and modifies much

    of the material in this proposal, I have made it available (as URCS TR 862)

    as a historical supplement to document the details and results of the MASSES

    (Material and Spatial Experimental Scenes) prototype.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    2/115

    Probabilistic Modeling for SemanticScene Classification

    by

    Matthew R. Boutell

    Thesis Proposal

    for the Degree

    Doctor of Philosophy

    Supervised by

    Christopher M. Brown

    Department of Computer Science

    The College

    Arts and Sciences

    University of Rochester

    Rochester, New York

    2005

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    3/115

    ii

    Abstract

    Scene classification, the automatic categorization of images into semantic classes

    such as beach, field, or party, is useful in applications such as content-based im-

    age organization and context-sensitive digital enhancement. Most current scene-

    classification systems use low-level features and pattern recognition techniques;

    they achieve some success on limited domains.

    Several contemporary classifiers, including some developed in Rochester, in-

    corporate semantic material and object detectors. Classification performance im-

    proves because because the gap between the features and the image semantics is

    narrowed. We propose that spatial relationships between the objects or materials

    can help by distinguishing between certain types of scenes and by mitigating the

    effects of detector failures. While past work on spatial modeling has used logic-

    or rule-based models, we propose a probabilistic framework to handle the loose

    spatial relationships that exist in many scene types.

    To this end, we have developed MASSES, an experimental testbed that can

    generate virtual scenes. MASSES can be used to experiment with different spatial

    models, different detector characteristics, and different learning parameters. Using

    a tree-structured Bayesian network for inference on a series of simulated natural

    scenes, we have shown that the presence of key materials can effectively distinguishcertain scene types. However, spatial relationships are needed to disambiguate

    other types of scenes, achieving a gain of 7% in one case.

    However, our simple Bayes net is not expressive enough to model the faulty

    detection at the level of individual regions. As future work, we propose first to

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    4/115

    iii

    evaluate full (DAG) Bayesian networks and Markov Random Fields as potential

    probabilistic frameworks. We then plan to extend the chosen framework for our

    problem. Finally, we will compare our results on real and simulated sets of images

    with those obtained by other systems using spatial features represented implicitly.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    5/115

    iv

    Table of Contents

    Abstract ii

    List of Tables vii

    List of Figures viii

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 The Problem of Scene Classification . . . . . . . . . . . . . . . . . 3

    1.2.1 Scene Classification vs. Full-scale Image Understanding . . 4

    1.2.2 Scene Classification vs. Not Object Recognition . . . . . . 5

    1.3 Past Work in Scene Classification . . . . . . . . . . . . . . . . . . 6

    1.4 Statement of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.4.1 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.5 Summary of Preliminary Work . . . . . . . . . . . . . . . . . . . . 8

    1.6 Organization of Proposal . . . . . . . . . . . . . . . . . . . . . . . 9

    2 Related Work 10

    2.1 Design Space of Scene Classification . . . . . . . . . . . . . . . . . 11

    2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    6/115

    v

    2.1.2 Learning and Inference Engines . . . . . . . . . . . . . . . 11

    2.2 Scene Classification Systems . . . . . . . . . . . . . . . . . . . . . 13

    2.2.1 Low-level Features and Implicit Spatial Relationships . . . 13

    2.2.2 Low-level Features and Explicit Spatial Relationships . . . 16

    2.2.3 Mid-level Features and Implicit Spatial Relationships . . . 19

    2.2.4 Semantic Features without Spatial Relationships . . . . . . 20

    2.2.5 Semantic Features and Explicit Spatial Relationships . . . 21

    2.2.6 Summary of Scene Classification Systems . . . . . . . . . . 22

    2.3 Options for Computing Spatial Relationships . . . . . . . . . . . . 23

    2.3.1 Computing Qualitative Spatial Relationships . . . . . . . . 23

    2.3.2 Computing Quantitative Spatial Relationships . . . . . . . 26

    2.4 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . 27

    2.4.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 29

    2.4.2 Markov Random Fields . . . . . . . . . . . . . . . . . . . . 33

    2.4.3 Relative Merits . . . . . . . . . . . . . . . . . . . . . . . . 42

    3 Methodology 44

    3.1 Statistics of Natural Images . . . . . . . . . . . . . . . . . . . . . 45

    3.1.1 Ground Truth Collection Process . . . . . . . . . . . . . . 46

    3.1.2 Scene Prototypes . . . . . . . . . . . . . . . . . . . . . . . 48

    3.1.3 Spatial Relationships . . . . . . . . . . . . . . . . . . . . . 48

    3.1.4 Scene-specific Spatial Relationship Statistics . . . . . . . . 51

    3.2 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . 51

    3.2.1 Advantages of Working in Simulation . . . . . . . . . . . . 52

    3.2.2 MASSES Prototype . . . . . . . . . . . . . . . . . . . . . . 53

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    7/115

    vi

    3.2.3 Simulating Faulty Detectors . . . . . . . . . . . . . . . . . 58

    3.2.4 Background Regions . . . . . . . . . . . . . . . . . . . . . 64

    4 Experimental Results 66

    4.1 Best-case Detection in MASSES . . . . . . . . . . . . . . . . . . . 66

    4.2 Best-case Detection on Beach Photographs . . . . . . . . . . . . . 68

    4.3 Faulty Detection on Beach Photographs . . . . . . . . . . . . . . 70

    5 Proposed Research 73

    5.1 Research Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.1.1 Why a New Inference Mechanism? . . . . . . . . . . . . . 76

    5.1.2 Bayes Nets Vs. MRFs . . . . . . . . . . . . . . . . . . . . 77

    5.1.3 Extend the Chosen Framework . . . . . . . . . . . . . . . 79

    5.1.4 Analyze the Effect of Detector Quality . . . . . . . . . . . 82

    5.1.5 Evaluate Explicit Spatial Relationships . . . . . . . . . . . 82

    5.1.6 Evaluate Semantic Features . . . . . . . . . . . . . . . . . 83

    5.1.7 Generalize to Other Scene Classes . . . . . . . . . . . . . . 83

    5.1.8 Explore Potential Long-term Directions . . . . . . . . . . . 84

    5.2 Issues Not Addressed in This Thesis . . . . . . . . . . . . . . . . . 85

    5.3 Research Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    6 Acknowledgments 87

    Bibliography 88

    A Natural Scene Statistics 98

    B Detector Characteristics 103

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    8/115

    vii

    List of Tables

    2.1 Options for features to use in scene classification. . . . . . . . . . 12

    2.2 Potential classifiers to use in scene classification. . . . . . . . . . . 14

    2.3 Related work in scene classification, organized by feature type and

    use of spatial information. . . . . . . . . . . . . . . . . . . . . . . 15

    3.1 Scene class definitions. . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2 Distribution resulting from offline sampling procedure. . . . . . . 64

    4.1 MASSES with best-case material detection: Accuracy with and

    without spatial information. . . . . . . . . . . . . . . . . . . . . . 67

    4.2 MASSES with best-case material detection: Accuracy with and

    without spatial information. . . . . . . . . . . . . . . . . . . . . . 69

    4.3 MASSES with faulty material detection: Accuracy with and with-

    out spatial information. . . . . . . . . . . . . . . . . . . . . . . . . 70

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    9/115

    viii

    List of Figures

    1.1 Content-ignorant color balancing can destroys the brilliance of sun-

    set images, such as those pictured, which have the same global color

    distribution as indoor, incandescent-illuminated images. . . . . . . 3

    2.1 A Bayes Net with a loop. . . . . . . . . . . . . . . . . . . . . . . . 32

    2.2 Portion of a typical two-layer MRF. In low-level computer vision

    problems, the top layer (black) represents the external evidence of

    the observed image while the bottom layer (white) expresses the a

    prioriknowledge about relationships between parts of the underly-

    ing scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.1 Ground-truth labeling of a beach scene. Sky, water, and sand re-

    gions are clearly shown. . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2 Prototypical beach scenes. (a) A simple beach scene without back-

    ground objects. (b) Because we make no attempt to detect it, we

    consider the sailboat to be background. (c) A more complicated

    scene: a developed beachfront. (d) A scene from a more distant

    field-of-view. (e) A crowded beach. . . . . . . . . . . . . . . . . . 49

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    10/115

    ix

    3.3 Prototypical urban scenes. (a) The most common urban scene,

    containing sky, buildings, and roads. (b),(c) The sky is not simply

    above the buildings in these images. (d) Roads are not necessary.

    (e) Perspective views induce varied spatial relationships. (f) Close

    views can preclude the presence of sky. . . . . . . . . . . . . . . . 50

    3.4 An example of spatial relationships involving a split region. . . . . 50

    3.5 The MASSES environment. Statistics from labeled scenes are used

    to bootstrap the generative model, which can then produce new

    virtual scenes for training or testing the inference module. . . . . 54

    3.6 Single-level Bayesian network used for MASSES . . . . . . . . . . 54

    3.7 Sampling the scene type yields class C. Then we sample to find

    the materials present in the image, in this case, M1, M3, and M4.

    Finally, we sample to find the relationships between each pair of

    these material regions. . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.8 Bayesian network subgraph showing relationship between regions

    and detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.1 Images incorrectly classified due to spatial relationships. . . . . . 69

    4.2 Images incorrectly classified using faulty detectors and no spatial

    relationships. The actual materials are shown the top row; the

    detected materials are shown below each. . . . . . . . . . . . . . . 70

    4.3 Images incorrectly classified using faulty detectors. . . . . . . . . 71

    4.4 Images incorrectly classified using faulty detectors. . . . . . . . . 72

    5.1 Classifier input (labeled regions) and output (classification and con-

    fidence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    5.2 An image in which a region is mis-detected, creating contradictory

    spatial relationships in the material-based inference scheme. . . . 76

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    11/115

    x

    5.3 Proposed DAG Bayesian network. Note that separate material and

    region layers are needed. . . . . . . . . . . . . . . . . . . . . . . . 77

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    12/115

    1

    1 Introduction

    Semantic scene classification, the process of categorizing images into semantic

    classes such as beaches, sunsets or parties is a useful endeavor. As humans, we

    can quickly determine the classification of a scene, even without recognizing every

    one of the details present. Even the gist of a scene is worth much in terms of

    communication.

    1.1 Motivation

    Automatic semantic classification of digital images finds many applications. We

    describe two major ones briefly: content-based image organization and retrieval

    (CBIR) and digital enhancement.

    With digital libraries growing in size so quickly, accurate and efficient tech-

    niques for CBIR become more and more important. Many current systems allow a

    user to specify an image and then search for images similar to it, where similarity

    is often defined only by color or texture properties. Because a score is computed

    on each image in the potentially-large database, it is somewhat inefficient (though

    individual calculations vary in complexity).

    Furthermore, this so-called query by example has often proven to be return

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    13/115

    2

    inadequate results [68]. Sometimes the match between the retrieved and the query

    images is hard to understand, while other times, the match is understandable, but

    contains no semantic value. For instance, with simple color features, a query for a

    rose can return a picture of a man wearing a red shirt, especially if the background

    colors are similar as well.

    Knowledge of the semantic category of a scene helps narrow the search space

    dramatically [37]. If the categories of the query image and the database images

    have been assigned (either manually or by an algorithm), they can be exploited

    to improve both efficiency and accuracy. For example, knowing what constitutes

    a party scene allows us to consider only potential party scenes in our search and

    thus helps to answer the query find photos of Marys birthday party. This way,

    search time is reduced, the hit rate is higher, and the false alarm rate is expected

    to be lower. Visual examples can be found in [76].

    Knowledge about the scene category can find also application in digital en-

    hancement [73]. Digital photofinishing processes involve three steps: digitizing

    the image if necessary (if the original source was film), applying enhancement

    algorithms, and outputting the image in either hardcopy or electronic form. En-hancement consists primarily of color balancing, exposure enhancement, and noise

    reduction. Currently, enhancement is generic (i.e. without knowledge of the scene

    content). Unfortunately, while a balancing algorithm might enhance the quality

    of some classes of pictures, it degrades others.

    Take color balancing as an example. Photographs captured under incandescent

    lighting without flash tend to be yellowish in color. Color balancing removes

    the yellow cast. However, applying the same color balancing to a sunset image(containing the same overall yellow color) destroys the desired brilliance (Figure

    1.1).

    Other images that are negatively affected by color balancing are those con-

    taining skin-type colors. Correctly balanced skin colors are important to human

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    14/115

    3

    Figure 1.1: Content-ignorant color balancing can destroys the brilliance of sunset

    images, such as those pictured, which have the same global color distribution as

    indoor, incandescent-illuminated images.

    perception [64], and it is important to balance them. However, causing non-skin

    objects with similar colors to look like skin is a conspicuous error.

    Rather than applying generic color balancing and exposure adjustment to all

    images, knowledge of the scenes semantic classification allows us to customize

    them to the scene. Following the example above, we could retain or boost sun-

    set scenes brilliant colors while reducing a tungsten-illuminated indoor scenes

    yellowish cast.

    1.2 The Problem of Scene Classification

    On one hand, isnt scene classification preceded by image understanding, the holy

    grail of vision? What makes us think we can achieve results? On the other hand,

    isnt scene classification just an extension of object recognition, for which many

    techniques have been proposed with varying success? How is scene classification

    different from these two related fields?

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    15/115

    4

    1.2.1 Scene Classification vs. Full-scale Image Understand-

    ing

    As usually defined, image understanding is the process of converting pixels to

    predicates: (iconic) image representations to another (symbolic) form of knowl-

    edge [2]. Image understanding is the highest (most abstract) processing level in

    computer vision [71], as opposed to image processing techniques, which convert

    one image representation to another. (For instance, using a mask to convert raw

    pixels to an edge image is much more concrete than identifying the expression

    on a persons face in the image!) Lower-level image processing techniques such

    as segmentation are used to create regions that can then be identified as objects.

    Various control strategies are used to order the processing steps and can vary [3].

    The end result desired is for the vision to support high-level reasoning about the

    objects and their relationships to meet a goal.

    While image understanding in unconstrained environments is still very much

    an open problem [71; 77], much progress is currently being made in scene clas-

    sification. Because scenes can often be classified without full knowledge of every

    object in the image, the goal is not as ambitious. For instance, if a person recog-

    nizes trees at the top of a photo, grass on the bottom, and people in the middle,

    he may hypothesize that he is looking at a park scene, even if he cannot see every

    detail in the image. Or on a different level, if there are lots of sharp vertical and

    horizontal edges, he may be looking at an urban scene.

    It may be possible in some cases to use low-level information, such as color

    or texture, to classify some scene types accurately. In other cases, perhaps ob-

    ject recognition is necessary, but not necessarily of every object in the scene.

    In general, classification seems to be an easier problem than unconstrained im-

    age understanding; early results have confirmed this for certain scene types in

    constrained environments [74; 77]. Scene classification is a subset of the image

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    16/115

    5

    understanding problem, and can be used to ease other image understanding tasks

    [75]. For example, knowing that a scene is of a beach constrains where in the

    scene one should look for people.

    Obtaining image understanding in unconstrained environments is a lofty goal,

    and one worthy of pursuit. However, given the state of image understanding, we

    see semantic scene classification as a necessary stepping-stone in pursuit of the

    grail.

    1.2.2 Scene Classification vs. Not Object Recognition

    However, scene classification is a different beast than object recognition. Detection

    of rigid objects can rely upon geometrical relationships within the objects, and

    various techniques [21; 63] can be used to achieve invariance to affine transforms

    and changes in scene luminance. Detection of non-rigid objects is less constrained

    physically, since the relationships are looser [12]. Scene classification is even less

    constrained, since the components of a scene are varied. For instance, while

    humans might find it easy to recognize a scene of a childs birthday party, the

    objects and people that populate the scene can vary widely, and the cues that

    determine the birthday scene class (such as special decorations, articles marked

    with the age of the child, and facial expressions on the attenders) can be subtle.

    Even the more obvious cues, like a birthday cake, may be difficult to determine.

    Again, the areas of scene classification and object recognition are related;

    knowing the identity of some of the scenes objects will certainly help to classify the

    scene, while knowing the scene type affects the expected likelihood and location

    of the objects it contains.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    17/115

    6

    1.3 Past Work in Scene Classification

    Most of the current systems primarily use low-level features to classify scenes

    and achieve some success on constrained problems. These systems tend to be

    exemplar-based, in which features are extracted from images, and pattern recog-

    nition techniques are used to learn the statistics of a training set and to classify

    novel test images. Very few systems are model-based, in which the expected

    configuration of the objects in the scenes is specified by a human expert.

    1.4 Statement of Thesis

    The limited success of scene classification systems using low-level features forces us

    to look for other solutions. Currently, good semantic material detectors and object

    recognizers are available [70; 38; 63] and have begun to be successfully applied to

    scene classification [37]. However, the presence or absence of certain objects is

    not always enough to determine a scene type. Furthermore, object detection is

    still developing and is far from perfect. Faulty detection causes brittle rule-based

    systems to break.

    Our central claim is that Spatial modeling of semantic objects and materials

    can be used to disambiguate certain scene types as well as mitigate the effects of

    faulty detectors. Furthermore, an appropriate probabilistic inference mechanism

    must be developed to handle the loose spatial structure found in real images.

    Current research into spatial modeling relies on (fuzzy) logic and subgraph

    matching [44; 83]. While we have found no research that incorporates spatial

    modeling in a probabilistic framework, we argue that a probabilistic approach

    would be more appropriate. First, logic (even fuzzy variants) is not equipped to

    handle exceptions efficiently [50], a concern we address in more detail in Section

    2.4. Second, semantic material detectors often yield belief in the materials. While

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    18/115

    7

    it is not obvious how to use belief values, it seems desirable to exploit the uncer-

    tainty in calculating the overall belief in each scene type. A nice side effect of true

    belief-based calculation is the ease in which a dont know option can be added

    to the classifier: simply threshold the final belief value.

    The first interesting problem is the appropriate choice of a probabilistic frame-

    work. Both Bayesian networks [66] and Markov Random Fields [30; 16; 15] have

    been applied to other vision problems in the past.

    We also propose to investigate the effects of spatial modeling. In our experi-

    mentation, we plan to compare the following:

    1) Baseline (no spatial relationships). Use naive Bayes classification rules usingthe presence or absence of materials only.

    2a) Qualitative spatial relationships. Incorporate relations such as above or

    beside between regions. This would be appropriate for use in a Bayesian Network

    framework.

    2b) Quantitative spatial relationships. Use distance and direction between

    regions. This may potentially be more accurate, due to the increase in the infor-

    mation used, but requires more training data. These may work particularly well

    within a Markov Random Field framework.

    One of our major foreseen contributions will be to validate the hypothesized

    gain due to spatial modeling.

    1.4.1 Philosophy

    The success of our approach seems to hinge on the strength of the underlying

    detectors. Consider two scenarios. First, if the detectors are reasonably accurate,

    then we can expect to overcome some faults using spatial relationships. However,

    if they are extremely weak, we would be left with a very ambitious goal: from a

    very pessimistic view of the world (loose spatial structure and weak detectors),

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    19/115

    8

    pull order from the chaos and accurately apply a discrete label to a configuration

    of objects.

    In this latter case, prior research seems to confirm that the task does not sound

    promising. For instance, Selinger found that if an object could be recognized

    with moderate success using a single camera view, additional views could improve

    recognition substantially. However, if the single view gave weak detection, then

    multiple views could not redeem the problem. She states [62] (p. 106):

    The result is in concert with general expertise in the field of recog-

    nition concerning the difficulty of leveraging multiple sources of weak

    evidence into strong hypotheses.

    Therefore, while we cannot expect to use our technique to classify scenes for

    which the detectors are completely inaccurate, we stand a reasonable chance if

    improving accuracy if the detectors are reasonably strong themselves.

    1.5 Summary of Preliminary Work

    We have performed our experiments in a simulated abstract world. The materials

    and spatial relationships used are based on statistics captured from real scenes.

    This provides us with a rich test bed, in which we can develop algorithms, compare

    approaches, quickly experiment with parameters, and explore what-if situations.

    With a prototype scene simulator we developed, using a single-level, tree-

    structured Bayesian network for inference on a series of simulated natural scenes,

    we have shown that the presence of key materials can effectively distinguish certain

    scene types. However, spatial relationships are needed to disambiguate other types

    of scenes, achieving a gain of 7% in one case.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    20/115

    9

    However, when simulating faulty detectors, we found that the network is not

    expressive enough to capture the necessary information, actually leading to lower

    accuracy when spatial relationships were used.

    1.6 Organization of Proposal

    Chapter 2 contains an overview of the relevant literature in scene classification,

    spatial modeling, and probabilistic frameworks. In Chapter 3, we describe our

    methodology, both for the detector and the simulator. Chapter 4 is a summary of

    our experiments and results (using best-case and faulty detectors). We conclude

    in Chapter 5, in which we describe our research plan and proposed contributions.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    21/115

    10

    2 Related Work

    Scene classification is a young, emerging field. The first section of this chapter

    is taken in large part from our earlier review of the state of the art in scene

    classification [6]; because this thesis is a work in progress, there is much overlap

    between the two. Here we focus our attention on systems using approaches directly

    related to our proposed thesis. Readers desiring a more comprehensive survey or

    more detail are referred to the original review.

    All systems classifying scenes must extract appropriate features and use some

    sort of learning or inference engine to classify the image. We start by outlining

    the options available for features and classifiers. We then present a number of

    systems which we have deemed to be good representations of the field.

    We augment our review of the literature by discussing two computational

    models of spatial relationships and then discussing in detail two graphical models

    we could use for probabilistic inference: Bayesian Networks and Markov Random

    Fields, each of which will be explored in the thesis.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    22/115

    11

    2.1 Design Space of Scene Classification

    The literature reveals two approaches to scene classification: exemplar-based and

    model-based. On one hand, exemplar-based approaches use pattern recognition

    techniques on vectors of low-level image features (such as color, texture, or edges)

    or semantic features (such as sky, faces or grass). The exemplars are thought

    to fall into clusters, which can then be used to classify novel test images, using

    an appropriate distance metric. Most systems use an exemplar-based approach,

    perhaps due to recent advances in pattern recognition techniques. On the other

    hand, model-based approaches are designed using expert knowledge of the scene

    such as the expected configuration of a scene. A scenes configuration is the layout

    (relative location and sizes) of its objects. While it seems as though this should

    be very important, very little research has been done in this area.

    In either case, appropriate features must be extracted for accurate classifica-

    tion. What makes a certain feature appropriate for a given task? For pattern

    classification, one wants the inter-class distance to maximized and the intra-class

    distances to be minimized. Many choices are intuitive, e.g. edge features should

    help separate city and landscape scenes [78].

    2.1.1 Features

    In our review [6], we described features we found in similar systems, or which

    we thought could be potentially useful. Table 2.1 is a summary of that set of

    descriptions.

    2.1.2 Learning and Inference Engines

    Pattern recognition systems classify samples represented by feature vectors (see a

    good review in [28]). Features are extracted from each of a set of training images,

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    23/115

    12

    Table 2.1: Options for features to use in scene classification.

    Feature Description

    Color Histograms [72], Coherence vectors [49], Moments [79]

    Texture [51] Wavelets[42; 65], MSAR [73], Fractal dimension [71]

    Filter Output Fourier & discrete cosine transforms [46; 73; 74; 75; 77],

    Gabor [59], Spatio-temporal [53]

    Edges Direction histograms [77], Direction coherence vectors

    Context Patch Dominant edge with neighboring edges [63]

    Object Geometry Area, Eccentricity, Orientation [10]

    Object Detection Output from belief-based material detectors[37; 66],

    rigid object detectors [63], face detectors [60]

    IU Output Output of other image understanding systems

    e.g., Main Subject Detection [66]

    Context Within images (scale, focus, pose) [75]

    Between images (adjacent images on film or video)

    Mid-level Spatial envelope features [47]

    Meta-data Time-stamp, Flash firing, Focal length, text [36; 24]

    Statistical Measures Dimensionality reduction [18; 58]

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    24/115

    13

    or exemplars. In most classifiers, a statistical inference engine then extracts infor-

    mation from the processed training data. Finally, to classify a novel test image,

    this type of system extracts the same features from the test image and compares

    them to those in the training set [18]. This exemplar-based approach is used by

    most of the current systems.

    The classifiers used in these type of systems differ in how they extract infor-

    mation from the training data. In Table 2.2, we present a summary of the major

    systems used in the realm of scene classification.

    2.2 Scene Classification Systems

    As stated, many of the systems proposed in the literature for scene classification

    are exemplar-based, but a few are model-based, relying on expert knowledge to

    model scene types, usually in terms of the expected configuration of objects in

    the scene. In this section, we describe briefly some of these systems and point

    out some of their limitations en route to differentiating our proposed method. We

    organize the systems by feature type and in the use of spatial information, asshown in Table 2.3. Features are grouped into low-level, mid-level, and high-level

    (semantic) features, while spatial information is grouped into those that model

    the spatial relationships explicitly in the inference stage and those that do not.

    2.2.1 Low-level Features and Implicit Spatial Relationships

    A number of researchers have used low-level features sampled at regular spatial

    locations (e.g. blocks in a rectangular grid). In this way, spatial features are

    encoded implicitly, since the features computed on each location are mapped to

    fixed dimensions in the feature space.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    25/115

    14

    Table 2.2: Potential classifiers to use in scene classification.

    Classifier Description

    1-Nearest-Neighbor Classifies test sample with same class as the exemplar

    (1NN) closest to it in the feature space.

    K-Nearest-Neighbor Generalization of 1NN in which the sample is given

    (kNN) [18] the label of the majority of the k closest exemplars.

    Learning Vector A representative set of exemplars, called a codebook,

    Quantization (LVQ) is extracted. The codebook size and learning rate

    [31; 32] must be chosen in advance.

    Maximum a Posteriori Combines the class likelihoods (which must be

    (MAP) [77] modeled, e.g., with a mixture of Gaussians) with

    class priors using Bayes rule.

    Support Vector Find an optimal hyperplane separating two classes.

    Machine (SVM) Maps data into higher dimensions, using a kernel

    [8; 61] function,to increase separability. The kernel and

    associated parameters must be chosen in advance.

    Artificial Neural Function approximators in which the inputs are

    Networks (ANN) [1] mapped, through a series of linear combinations

    and non-linear activation functions to outputs.

    The weights are learned using a technique

    called backpropagation.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    26/115

    15

    Table 2.3: Related work in scene classification, organized by feature type and use

    of spatial information.

    Spatial Information

    Feature Type Implicit/None Explicit

    Low-level Vailaya, et al. Lipson, et al.

    Oliva, et al. Ratan & Grimson

    Szummer & Picard Smith & Li

    Serrano, et al.

    Paek & Chang

    Carson, et al.

    Wang, et al.

    Mid-level Oliva, et al.

    High-level Luo, et al. Mulhem, et al.

    (Semantic) Song & Zhang Proposed Method

    The problems addressed include indoor vs. outdoor classification (Szummer

    and Picard [73], Paek and Chang [48], and Serrano et al. [65]), outdoor scene

    classification (Vailaya et al. [77]), and image orientation detection [79; 80] 1.

    The indoor vs. outdoor classifiers accuracy approaches 90% on tough (e.g.,

    consumer) image sets. On the outdoor scene classification problem, mid-90%

    accuracy is reported. This may be due to the use of constrained data sets (e.g.

    from the Corel stock photo library), because on less constrained (e.g., consumer)

    image sets, we found the results to be lower. The generalizability of the technique

    is also called into question by the discrepancies in the numbers reported for image

    orientation detection by some of the same researchers [79; 80].

    1While image orientation detection is a different level of semantic classification, many of the

    techniques used are similar.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    27/115

    16

    Pseudo-Object Features

    The Blobworldsystem, developed at Berkeley, was developed primarily for content-

    based indexing and retrieval. However, it is used for scene classification problem

    in [9]. The researchers segment the image and use statistics computed for each re-

    gion (e.g., color, texture, location with respect to a 33 grid) without performing

    object recognition. Admittedly, this is a more general approach for scene types

    containing no recognizable objects. However, we can hope for more using object

    recognition. Finally, a maximum likelihood classifier performs the classification.

    Wangs SIMPLIcity (Semantics-sensitive Integrated Matching for Picture LI-

    braries) system [80] also uses segmentation to match pseudo-objects. The system

    uses a fuzzy method called Integrated Region Matching to effectively compen-

    sate for potentially poor segmentation, allowing a region in one image to match

    with several in another image. However, spatial relationships between regions are

    not used and the framework is used only for CBIR, not scene classification.

    2.2.2 Low-level Features and Explicit Spatial Relationships

    The systems above either ignore spatial information or encode it implicitly using

    a feature vector. However, other bodies of research imply that explicitly-encoded

    spatial information is valuable and should be encoded explicitly and used by the

    inference engine. In this section, we review this body of research, describing a

    number of systems using spatial information to model the expected configuration

    of the scene.

    Configural Recognition

    Lipson, Grimson, and Sinha at MIT use an approach they call configural recog-

    nition [34; 35], using relative spatial and color relationships between pixels in low

    resolution images to match the images with class models.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    28/115

    17

    The specific features extracted are very simple. The image is smoothed and

    subsampled at a low resolution (ranging from 8 8 to 32 32). Each pixel

    represents the average color of a block in the original image; no segmentation

    is performed. For each pixel, only its luminance, RGB values, and position are

    extracted.

    The hand-crafted models are also extremely simple. For example, a template

    for a snowy mountain image is a blue region over a white region over a dark

    region; one for a field image is a large bluish region over a large greener region.

    In general, the model contains relative x- and y-coordinates, relative R-, G-, B-,

    and luminance values, and relative sizes of regions in the image.

    The matching process uses the relative values of the colors in an attempt to

    achieve illumination invariance. Furthermore, using relative positions mimics the

    performance of a deformable template: as the model is compared to the image,

    the model can be deformed by moving the patch around so that it best matches

    the image. A model-image match occurs if any one configuration of the model

    matches the image. However, this criterion may be extended to include the degree

    of deformation and multiple matches depending on how well the model is expectedto match the scene.

    Classification is binary for each classifier. On a test set containing 700 pro-

    fessional images (the Corel Fields, Sunsets and Sunrises, Glaciers and Mountains,

    Coasts, California Coasts, Waterfalls, and Lakes and Rivers CDs), the authors

    report recall using four classifiers: fields (80%), snowy mountains (75%), snowy

    mountains with lakes (67%), and waterfalls (33%). Unfortunately, exact precision

    numbers cannot be calculated from the results given.

    The strength of the system lies in the flexibility of the template, in terms of

    both luminance and position. However, one limitation the authors state is that

    each class model captured only a narrow band of images within the class and that

    multiple models were needed to span a class.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    29/115

    18

    Learning the Model Parameters In a follow-up study by Ratan and Grimson

    [54], they also used the same model, but learned the model parameters from

    exemplars. They reported similar results to the hand-crafted models used by

    Lipson. However, the method was computationally expensive [83].

    Combining Configurations with Statistical Learning In another varia-

    tion on the previous research, Yu and Grimson adapt the configural approach to

    a statistical, feature-vector based approach, treating configurations like words ap-

    pearing in a document [83]. Set representations, e.g. attributed graphs, contain

    parts and relations. In this framework, the configurations of relative brightness,

    positions, and sizes are subgraphs. However, inference is computationally costly.

    Vector representations allow for efficient learning of visual concepts (using

    the rich theory of supervised learning). Encoding configural information in the

    features overcomes the limited ability of vector representations to preserve relevant

    information about spatial layout. [83]

    Within a CBIR framework with two query images, configurations as extracted

    as follows. Because configurations contained in both images are most informative,

    an extension of the maximum clique method is used to extract common subgraphs

    from the two images. The essence of the method is that configurations are grown

    from the best matching pairs (e.g., highly contrasting regions) in each image.

    During the query process, the common configurations are broken into smaller

    parts and converted to a vector format, in which feature i corresponds to the

    probability that sub-configuration i is present in the image.

    A naive (i.e., single-level, tree-structured) Bayesian network is trained on-linefor image retrieval. A set of query images is used for training, with likelihood

    parameters estimated by EM. Database images are then retrieved in order of

    their posterior probability.

    On a subset of 1000 Corel images, a single waterfall query is shown to have

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    30/115

    19

    better retrieval performance than other measures such as color histograms, wavelet

    coefficients, and Gabor filter outputs.

    Note that the spatial information is explicitly encoded in the features, but is

    used directly in the inference process.

    In the subgraph extraction process above, if extracting a common configuration

    from more than two images is desired, one can use Hong, et al.s method [26].

    Composite Region Templates (CRT)

    CRTs are configurations of segmented image regions [69]. The configurations are

    limited to those occurring in the vertical direction: each vertical column is stored

    as a region string and statistics are computed for various sequences occurring in

    the strings. While an interesting approach, one unfortunate limitation of their

    experimental work is that the size of the training and testing sets were both

    extremely limited.

    2.2.3 Mid-level Features and Implicit Spatial Relationships

    Oliva and Torralba [46; 47] propose what they call a scene-centered description

    of images. They use an underlying framework of low-level features (multiscale

    Gabor filters), coupled with supervised learning to estimate the spatial envelope

    properties of a scene. They classify images with respect to verticalness (vertical

    vs. horizontal), naturalness (vs. man-made), openness (presence of a horizon

    line), roughness (fractal complexity), busyness (sense of clutter in man-made

    scenes), expansion (perspective in man-made scenes), ruggedness (deviationfrom the horizon in natural scenes), and depth range.

    Images are then projected into this 8-dimensional space in which the dimen-

    sions correspond to the spatial envelope features. They measure their success first

    on individual dimensions through a ranking experiment. They then claim that

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    31/115

    20

    their features are highly correlated with the semantic categories of the images

    (e.g., highway scenes are open and exhibit high expansion), demonstrating

    some success on their set of images. It is unclear how their results generalize.

    One observation they make is that their scene-centered approach is comple-

    mentary to an object-centered approach like ours.

    2.2.4 Semantic Features without Spatial Relationships

    Semantic Features for Indoor Vs. Outdoor Classification

    Luo and Savakis extended the method of [65] by incorporating semantic mate-

    rial detection [37]. A Bayesian Network was trained for inference, with evidence

    coming from low-level (color, texture) features and semantic (sky, grass) features.

    Detected semantic features (which are not completely accurate) produced a gain

    of over 2% and best-case (100% accurate) semantics gave a gain of almost 8%

    over low-level features alone. The network used conditional probabilities of the

    form P(sky present|outdoor). While this work showed the advantage of using

    semantic material detection for certain types of scene classification, it stopped

    short of using spatial relationships.

    Semantic Features for Image Retrieval

    Song and Zhang investigate the use of semantic features within the context of

    image retrieval [70]. Their results are impressive, showing that semantic features

    greatly outperform typical low-level features, including color histograms, color

    coherence vectors, and wavelet texture for retrieval.

    They use the illumination topology of images (using a variant of contour trees)

    to identify image regions and combine this with other features to classify the

    regions into the semantic categories such as sky, water, trees, waves, placid water,

    lawn, and snow.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    32/115

    21

    While they do not apply their work directly to scene classification, their success

    with semantic features confirms our hypothesis that they help bridge the semantic

    gap between pixel-representations and high-level understanding.

    2.2.5 Semantic Features and Explicit Spatial Relationships

    Mulhem, Leow, and Lee [44] present a novel variation of fuzzy conceptual graphs

    for use in scene classification. Conceptual graphs are used for representing knowl-

    edge in logic-based applications, since they can be converted to expressions of

    first-order logic. Fuzzy conceptual graphs extend this by adding a method of

    handling uncertainty.

    A fuzzy conceptual graph is composed of three elements: a set of concepts

    (e.g., mountain or tree), a set of relations (e.g., smaller than or above),

    and a set of relation attributes (e.g., ratio of two sizes). Any of these elements

    which contain multiple possibilities is called fuzzy, while one which does not is

    called crisp.

    Model graphs for prototypical scenes are hand-crafted, and contain crisp con-

    cepts and fuzzy relations and attributes. For example, a mountain-over-lake

    scene must contain a mountain and water, but the spatial relations are not guar-

    anteed to hold. A fuzzy relation such as smaller than may hold most of the

    time, but not always.

    Image graphs contain fuzzy concepts and crisp relations and attributes. This

    is intuitive: while a material detector calculates the boundaries of objects and

    can therefore calculate relations (e.g. to the left of) between them, they can be

    uncertain as to the actual classification of the material (consider the difficulty of

    distinguishing between cloudy sky and snow, or of rock and sand). The ability to

    handle uncertainty on the part of the material detectors is an advantage of this

    framework.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    33/115

    22

    Two subgraphs are matched using graph projection, a mapping such that each

    part of a subgraph of the model graph exists in the image graph, and a metric

    for linearly combining the strength of match between concepts, relations, and

    attributes. A subgraph isomorphism algorithm is used to find the subgraph of the

    model that matches best the image.

    The basic idea of the algorithms is to decompose the model and image into

    arches (two concepts connected by a relation), seed a subgraph with the best

    matching pair of arches, and incrementally add other model arches that match

    well.

    They found that the image matching metric worked well on a small database oftwo hundred images and four scene models (of mountain/lake scenes) generated by

    hand. Fuzzy classification of materials was done using color histograms and Gabor

    texture features. The method of generating confidence levels of the classification

    is not specified.

    While the results look promising for mountain/lake scenes, it remains to be

    seen how well this approach will scale to a larger number of scene types.

    2.2.6 Summary of Scene Classification Systems

    Referring back to the summary of prior work in semantic scene classification given

    in Table 2.3, we see that our work is closest to that of Mulhem, et al., but differs in

    one key aspect: while theirs is logic-based, our proposed method is founded upon

    probability theory, leading to principled methods of handling variability in scene

    configurations. Our proposed method also learns the model parameters from a

    set of training data, while theirs are fixed.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    34/115

    23

    2.3 Options for Computing Spatial Relationships

    If we are to utilize spatial relationships between regions, we need a method for

    computing and encoding these relationships. We discuss both qualitative and

    quantitative spatial relationships.

    2.3.1 Computing Qualitative Spatial Relationships

    We start by considering three models of computing qualitative spatial relation-

    ships: Attentional Vector Sum, a biologically-inspired model; Weighted Walk-

    throughs, a model developed for occluded regions; and a hybrid model producedfor efficient computation.

    Attentional Vector Sum

    Regier and Carlson [55] propose a computational model of spatial relations based

    on human perception. They consider up, down, left, and right, to be symmetric,

    and so focus their work on the above relation.

    They call the reference object the landmark and the located object the tra-

    jector. For example, the ball (trajector) is above the table (landmark). The

    model is designed to handle 2D landmarks, but only point trajectors. However,

    the researchers state that they are in the process of extending the model.

    Four models are compared:

    1. Bounding box (BB). A is above B if it is higher than the landmarks highest

    point and between its leftmost and rightmost points. The strength of the

    match varies depending on the height and how centered it is above the ob-

    ject; three parameters govern how quickly the match drops off, via sigmoidal

    functions.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    35/115

    24

    2. Proximal and Center-of-Mass (PC). Here, the projection is defined based

    on the angle formed by the y-axis and the line connecting the trajector to

    the landmark. Connecting it to the closest point on the landmark gives

    the proximal angle and to the centroid gives the center-of-mass angle. This

    model has four parameters: a gain, slope, and y-intercept of the piecewise

    function for the goodness of the angle, and the relative weight given to the

    components corresponding to the two angles.

    3. Hybrid Model (PC-BB). This model extends the PC model by adding the

    BB models height term. The height term gives the presence of a grazing

    line at the top of the landmark, an effect that was observed experimentally.The model has four parameters: the slope, y-intercept, and relative weight

    of the PC model plus the gain on the height functions sigmoid.

    4. Attentional Vector Sum (AVS). This model incorporates two human percep-

    tual elements:

    Attention. Visual search for a target in a field of distractors is slow

    when targets differ from distractors only in the spatial relation amongtheir elements (i.e. they do not exhibit pop-up). Therefore, they

    require attention.

    Vector sum. Studies of orientation cells in the monkey cortex show

    that directions were modeled by a vector sum of the cells.

    In the AVS model, the angle between the landmark and the trajector is

    calculated as the weighted sum of angles between the points in the landmark

    area and the trajector. The weights in the sum are related to attention. The

    center of attention on the landmark is the point closest to the trajector; its

    angle receives the most weight. As the landmark points get further from

    the center of attention, they are weighted less, dropping off exponentially.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    36/115

    25

    Lastly, the BB models height function is used again (for which they can

    give no physiological or perceptual basis, but only because it was observed

    experimentally). The model has four parameters: the width of the beam of

    attention, the slope and y-intercept of the linear function relating angle to

    match strength (as in the PC model) and the gain on the height functions

    sigmoid.

    Optimal parameters for each model were found by fitting the model with an-

    other researchers data set. A series of experiments was then performed to dis-

    tinguish between the models. The AVS model fit each experiments data most

    consistently.

    The AVS method also gives a measure of how above one region is compared

    to another. This measure may potentially be used to our advantage.

    However, as given, AVS may be too computationally expensive. Where ni

    is # points in each region i, finding points on perimeter of each is O(n1 + n2),

    giving p perimeter points. Finding closest point between landmark and trajector,

    yields O(p) distances. Integrating over each region yields O(n1 n2) distances.

    However, if using a larger step size would not substantially reduce accuracy, we

    could reduce the computation significantly.

    Weighted Walkthroughs

    Berretti et al. [5] developed a technique named weighted walkthroughs to cal-

    culate spatial relations. The method is designed to compare segmented regions

    created by color backpropagation, and therefore has the advantage of handlinglandmarks or trajectors that are made of multiple regions. This may be impor-

    tant in natural images, where large regions are sometimes occluded.

    The method is straightforward: consider two regions A and B. All pairs of

    points (a, b) in the set S = {(a, b)|a A, b B} are compared (a walkthrough

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    37/115

    26

    each of the regions). For some pairs, A will lie northeast of B, for others A will

    lie SE, etc. for the four quadrants. The fraction of pairs contained in each of the

    four quadrants are computed: four weights: wNE, wNW, wSE, and wSW.

    Finally, these can be converted to above/below, left/right, and diagonality by

    computing four features: above = wNE+wNW, right = wNE+wSE, and diagonality

    = wNE + wSW.

    One advantage of this method is its ability to handle 2D, occluded ((i.e.,

    disconnected) landmarks and trajectors.

    Hybrid Approach

    In their research, Luo and Zhu [40] use a hybrid approach, combining the bounding

    box and weighted walkthrough methods. The method was designed for modeling

    spatial relations between materials in natural scenes, and so favors above/below

    calculations. It skips weighted walkthroughs when object bounding boxes do not

    overlap. It does not handle some obscure cases correctly, but is fast and correct

    when used in practice.

    The final decision of the spatial relationship model would be most appropriate

    for my work depends in large part on whether the AVS method can be extended

    to 2D trajectors while being made computationally tractable. One answer may

    be to work on a higher conceptual level than individual pixels.

    2.3.2 Computing Quantitative Spatial Relationships

    While computationally more expensive and possibly too sensitive, more detailed

    spatial information may be necessary to distinguish some scene types. Rather

    than just encoding the direction (such as above) in our knowledge framework,

    we could incorporate a direction and a distance. For example, Rimey encoded

    spatial relationships using an expected area net [56].

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    38/115

    27

    In the limit, we may wish to model the class-conditional spatial distributions as

    continuous variables. There is a body of literature addressing the issue of efficient

    inference of continuous variables in graphical models (e.g. Bayes Nets), if the

    distributions are assumed Gaussian (e.g., [33]). Felzenszwalb and Huttenlocher

    [19] used Gaussian models in their object recognition system, which we will review

    in the next section.

    Another option, if a lattice-structured Markov Random Field framework is

    used (also discussed in the next section), would be to use the spatial relationships

    that arise from the lattice structure.

    2.4 Probabilistic Graphical Models

    Early research involving spatial relationships between objects used logic [2]. This

    approach was not particularly successful on natural scenes: while logic is certain,

    life is uncertain. In an attempt to overcome this limitation, more recent work has

    extended the framework to use fuzzy logic [44].

    However, Pearl [50] argues that logic cannot be extended to the uncertaintyof life, where many rules have exceptions. The rules of Boolean logic contain no

    method of combining exceptions. Furthermore, logic interactions occur in stages,

    allowing for efficient computation. We would like to handle uncertain evidence

    incrementally as well. But unless one makes strict independence assumptions, this

    is impossible with logicand computing the effect of evidence in one global step

    is impossible.

    Logic-based (syntactic or rule-based) systems combine beliefs numerically. The

    uncertainty of a formula is calculated as a combination of the uncertainties of the

    sub-formulas. Computationally, this approach mirrors the process of logical infer-

    ence, leading to an efficient, modular scheme. Rules can be combined regardless

    of other rules and regardless of how the rule was derived. Semantically, these

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    39/115

    28

    assumptions are too strong, except under the strongest of independence assump-

    tions. They cause the following problems semantically.

    1. Bidirectional inferences. Semantically, if we have the two statements, firesmoke

    and smoke, then fire should be more plausible. However, in a rule-based sys-

    tem, this would introduce a loop.

    2. Limits of modularity. Consider the rules alarm burglar and alarm

    earthquake. Due to the modular nature of logic, if alarm becomes more

    plausible, then burgular should become more plausible as well. However,

    using plausible reasoning, if we add evidence for earthquake, then alarm

    becomes more plausible and burglar becomes less plausible, which corre-

    sponds with human intuition.

    3. Correlated evidence. The rules of logic cannot handle multiple pieces of

    evidence originating from a single source. As an example, one should not

    independently increase the belief of an event based on many local news

    stories that merely echo the Associated Press.

    Some attempts have been made to overcome this last limitation, such as bounds

    propagation or user-defined combination; however, each approach introduces fur-

    ther difficulties.

    We are fully aware that there is not universal agreement with Pearl philo-

    sophically regarding the superiority of probability over logic. (Witness the heated

    rebuttals to Cheesemans argument for probability [14] by the logic community!)

    Still, we think his arguments are sound.

    Specifically, Pearl argues for a graphical model-based approach founded on

    probability calculus. While he elaborated on Bayesian Networks in [50], we also

    consider Markov Random Fields (MRF), another probabilistic graphical model

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    40/115

    29

    that has been used primarily for low-level vision problems (finding boundaries,

    growing regions), but has recently been used for object detection.

    In general, graphical probability models provide a distinct advantage in prob-

    lems of inference and learning, that of statistical independence assumptions. In

    a graphical model, nodes represent random variables and edges represent depen-

    dencies between those variables. Ideally, nodes are connected by an edge if and

    only if their variables are directly dependent; however, many models only capture

    one direction.

    Sparse graphs, in particular, benefit from the message-passing algorithms used

    to propagate evidence around the network. While the calculation of a joint proba-bility distribution takes exponential space (and marginals are difficult to calculate)

    in general, these calculations are much cheaper in certain types of graphs, as we

    will see.

    2.4.1 Bayesian Networks

    Bayesian (or belief) networks are used to model causal probabilistic relationships

    [13] between a system of random variables. The causal relationships are repre-

    sented by a directed acyclic graph (DAG) in which each link connects a cause (the

    parent node) to an effect (the child node). The strength of the link between

    the two is represented as the conditional probability of the child given the parent.

    The directed nature of the graph allows conditional independence to be specified;

    in particular, a node is conditionally independent of all of its non-successors, given

    its parent(s).

    The independence assumptions allow the joint probability distribution of all

    of the variables in the system to be specified in a simplified manner, particularly

    if the graph is sparse.

    Specifically, the network consists of four parts, as follows [66]:

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    41/115

    30

    Prior probabilities are the initial beliefs about the root node(s) in the net-

    work when no evidence is presented.

    Each node has a conditional probability matrix (CPM) associated with it,representing the causality between the node and its parents. These can be

    assigned by an expert or learned from data.

    Evidence is the input presented to the network. Nodes can be instantiated

    (by setting the belief in one of its hypotheses to 1) or set to fractional

    (uncertain) beliefs (via virtual evidence [50]).

    Posteriors are the output of the network. Their value is calculated fromthe product of priors and likelihoods arising from the evidence (as in Bayes

    Rule).

    The expressive power, inference schemes and associated computational com-

    plexity all depend greatly on the density and topology of the graph. We discuss

    three categories: tree, poly-tree, and general DAG.

    Trees

    If the graph is tree-structured, with each node having exactly one parent node,

    each nodes exact posterior belief can be calculated quickly and in a distributed

    fashion using a simple message-passing scheme. Feedback is avoided by separating

    causal and diagnostic (evidential) support for each variable using top-down and

    bottom-up propagation of messages, respectively.

    The message-passing algorithm for tree-structured Bayesian networks is simple

    and allows for inference in polynomial time. However, its expressive power is

    somewhat limited because each effect can have only a single cause. In human

    reasoning, effects can have multiple potential causes that are weighed against one

    another as independent variables [50].

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    42/115

    31

    Causal Polytrees

    A polytree is a singly-connected graph (one whose underlying undirected graph

    is acyclic). Polytrees are a generalization of trees that allow for effects to have

    multiple causes.

    The message-passing schemes for trees generalize to polytrees, and exact pos-

    terior beliefs can be calculated. One drawback is that each variable is conditioned

    on the combination of its parents values. Estimating the values in the condi-

    tional probability matrix may be difficult because its size is exponential in the

    number of parent nodes. Large numbers of parents for a node can induce con-

    siderable computational complexity, since the message involves a summation over

    each combination of parent values.

    Models for multicausal interactions, such as the noisy-OR gate, have been

    developed to solve this problem. They are modeled after human reasoning and

    reduce the complexity of the messages from a node to O(p), linear in the number

    of its parents. The messages in the noisy-OR gate model can be computed in

    closed form (see [50]).

    A nice summary of the inference processes for trees and polytrees given in [50]

    can be found in [66].

    General Directed Acyclic Graphs

    The most general case is a DAG that contains undirected loops. While a DAG

    cannot contain a directed cycle, its underlying undirected graph may contain a

    cycle, as shown in Figure 2.1.

    Loops cause problems for Bayesian networks, both architectural and semantic.

    First, the message passing algorithm fails, since messages may cycle around the

    loop. Second, the posterior probabilities may not be correct, since the conditional

    independence assumption is violated. In Figure 2.1, variables B and C may be

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    43/115

    32

    A

    B

    D

    C

    Figure 2.1: A Bayes Net with a loop.

    conditionally independent given their common parent A, but messages passed

    through D from B will also (incorrectly) affect the belief in C.

    There exist a number of methods for coping with loops [50]. Two methods,

    clusteringand conditioning, are tractable only for sparse graphs. Another method,

    stochastic simulation, involves sampling the Bayesian network. We use a simple

    top-down version, called logic sampling, as a generative model and describe it in

    Section 3.2.2. However, it is inefficient in the face of instantiated evidence, since

    it involves rejecting each sample that does not agree with the evidence.

    Finally, the methods ofbelief propagation and generalized belief propagation, in

    which the loops are simply ignored, has been applied with success in many cases

    [82] and is worth further investigation. We discuss these methods in the context

    of MRFs in the next section.

    Applications of Bayesian Networks

    In computer vision, Bayesian networks have been used in many applications in-

    cluding indoor vs. outdoor image classification [37; 48], main subject detection

    [66], and control of selective perception [57]. An advantage of Bayesian networks

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    44/115

    33

    is that they are able to fuse different types of sensory data (e.g. low-level and

    semantic features) in a well-founded manner.

    2.4.2 Markov Random Fields

    Markov Random Fields (MRFs), or Markov Networks, model a set of random

    variables as nodes in a graph. Dependencies between variables are represented

    by undirected arcs between the corresponding nodes in the graph. The topology

    of the network explicitly identifies independence assumptions absence of an

    arc between two nodes indicates that the nodes are assumed to be conditionally

    independent given their nbrs. MRFs are used extensively for problems in low-

    level computer vision and statistical physics. MRFs provide a framework to infer

    underlying global structure from local observations.

    We now discuss the basic concepts of MRFs, drawing from our in-house review

    [7] of the typical treatments in the literature [30; 16; 22; 15].

    Random Field A set of random variables X = {xi}.

    Graphical Model A random field X may be represented as a graphical model

    G = (Q, E) composed of a set of nodes Q and edges E connecting pairs

    of nodes. A node i Q represents the random variable xi X. An

    edge (i, j) E connecting nodes i and j indicates a statistical dependency

    between random variables xi and xj. More importantly, the lack of an edge

    between two graph nodes indicates an assumption of independence between

    the nodes given their neighbors.

    Configuration For a random field X of size n, a configuration of X assigns

    a value (x1 = 1, x2 = 2, . . . xn = n) to each random variable xi X.

    P() is the probability density function over the set = {} of all possible

    configurations of X.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    45/115

    34

    Neighborhood Relationship We define a neighborhood relationship N on a

    field X as follows. Let E be a set of ordered pairs representing connections

    (typically probabilistic dependencies) between elements of X. Then for any

    xi, xj X, xj Ni (xi, xj) E.

    Markov Property [30]

    For a variable xi in a random field X, if xi satisfies the Markov Property,

    P(xi = i|xj = j, j = i) = P(xi = i|xj = j, j Ni)

    The probabilities in Equation 2.4.2 are called local characteristics and intu-

    itively describe a locality condition, namely that the value of any variable

    in the field depends only on its neighbors.

    Positivity Condition The positivity condition states that for every configura-

    tion , P(x = ) > 0.

    Markov Random Field Any random field satisfying both the Markov property

    and the positivity condition. Also called a Markov Network.

    Two-Layer MRF [23; 16; 22]

    Two-Layer describes the network topology of the MRF. The top layer

    represents the input, or evidence, while the bottom layer represents the

    relationships between neighboring nodes (Figure 2.2).

    In typical computer vision problems, inter level links between the top and

    bottom layers enforce compatibility between image evidence and the un-

    derlying scene. Intra-level links in the top layer of the MRF leverage a

    prioriknowledge about relationships between parts of the underlying scene

    to enforce consistency between neighboring nodes in the underlying scene

    [16]..

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    46/115

    35

    Figure 2.2: Portion of a typical two-layer MRF. In low-level computer vision prob-

    lems, the top layer (black) represents the external evidence of the observed image

    while the bottom layer (white) expresses the a prioriknowledge about relationships

    between parts of the underlying scene.

    Pairwise MRF [23; 16; 82]

    In a pairwise MRF, the joint distribution over the MRF is captured by a set

    of compatibility functions that describe the statistical relationships between

    pairs of random variables in the MRF. For inferential purposes, this means

    that the graphical model representing the MRF has no cliques larger than

    size two.

    Compatibility Functions The statistical dependency between the two random

    variables xi, xj in a random field is characterized by a compatibility function

    i,j(i, j) that scores every possible pair of hypotheses (xi = i, xj = j).

    As an example, consider a link (i, j) in a graphical model G connecting

    nodes i and j. If there are three possible outcomes for xi and two possible

    outcomes for xj, the compatibility function relating i and j is a 32 matrix,

    M = [mij].

    Depending upon the problem, compatibilities may be characterized by either

    the joint distribution of the two variables2. For some problems for which

    2Or equivalently by both conditional distributions (p(x,y) may be obtained from p(x|y) and

    p(y|x).

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    47/115

    36

    the joint is unobtainable, a single conditional distribution suffices (e.g.for a

    problem for which p(x|y) is known, but p(y|x) cannot be computed).

    Inference

    In typical low-level computer vision applications of MRFs, what is desired from

    the inference procedure is the MAP estimate of the true scene (the labeling), given

    the observed data (the image). We have identified two complementary approaches

    in the literature for calculating the MAP estimate: deterministic techniques and

    Monte Carlo techniques (described later in this section).

    We start by reviewing two deterministic techniques: Belief Propagation and

    Highest Confidence First. The Highest Confidence First algorithm finds local

    maxima of the posterior distribution by using the principle of least commitment

    [43], while belief propagation is an inexact inference procedure using message-

    passing algorithms successfully in loopy networks by simply ignoring the loops.

    Highest Confidence First (HCF) [16]

    The HCF algorithm is used for MAP estimation, finding local maxima ofthe posterior distribution. It is a deterministic procedure founded on the

    principle of least commitment. Scene nodes connected to image nodes with

    the strongest external evidence (i.e. a hypothesis with a large ratio of the

    maximum-likelihood hypothesis to the others) are committed first, since

    they are unlikely to change (based on compatibility with neighbors). Nodes

    with weak evidence commit later and are based primarily on their compat-

    ibility with their committed neighbors.

    Using edge-modeling MRF as an example, large intensity gradients might

    constitute strong evidence in some of the networks nodes. The nodes with

    strong evidence should influence scene nodes with weaker evidence (via edge

    continuity constraints) more than the other way around.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    48/115

    37

    Belief Propagation [82; 22; 29]

    The Belief Propagation (BP) algorithm is a message-passing algorithm for

    probabilistic networks. It is a generalization of a number of inference al-gorithms such as the forward-backward algorithm, the Viterbi algorithm,

    Pearls algorithm for Bayesian polytrees, and the Kalman filter.

    At each iteration, each node computes a belief, which is a marginal of the

    joint probability. The belief is a function of local compatibilities, i(xi)

    (e.g. local evidence nodes, which are constant, can subsumed) and incoming

    messages, mji from neighboring nodes:

    bi(xi) = ki(xi)

    jN(i)

    mji(xi)

    The messages, mji, are computed from a function of compatibilities of

    the messages sender and recipient nodes and previous messages from the

    senders other neighbors:

    mij(xj) =xi

    (xi, yi)(xi, xj)

    kN(i)j

    mki(xi)

    Intuitively, the incoming messages represent combined evidence that has

    already propagated through the network.

    In the rare case that the graph contains no loops, it can be shown that

    the marginals are exact. However, some experimental work suggests that

    at least for certain problems, that the approximations are good even in the

    typical loopy networks, as the evidence is double-counted [81].

    One can calculate the MAP estimate at each node by replacing the summa-

    tions in the messages with max.

    Generalized belief propagation [82]

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    49/115

    38

    Generalized belief propagation (GBP) uses messages from clusters of nodes

    to other clusters of nodes. Since these messages are expected to be more

    informative, the performance is also expected to increase.

    GBP is theoretically justified: in fact, GBP is related to Kikuchi approxi-

    mations in a manner analogous to BP and Bethe free energy: a set of beliefs

    gives a GBP fixed point in a graph if and only if the beliefs are local station-

    ary points of the Kikuchi free energy (for details of free energy minimization

    techniques, see [82]).

    GBP has been found to perform much better than BP on graphs with short

    loops. The drawback is that the complexity is exponential in the clustersize, but again, if the graph has short loops (and thus necessitates only

    small clusters), the increased complexity can be minimal.

    Pearls clustering algorithm is a special case of GBP, with clusters chosen

    to overlap in a fixed manner that are usually large. They obtain increased

    accuracy, but at increased complexity.

    An advantage of GBP is that it can be used to vary the cluster size in order

    to make a trade-off between accuracy and complexity.

    Inference on Tree-Structured MRFs

    Felzenszwalb and Huttenlocher [19] use tree-structured MRFs for recognition of

    objects such as faces and people. They model their objects as a collection of

    parts appearing in a particular spatial arrangement. Their premise is that in a

    part-based approach, recognition of individual parts is difficult without context,and needs spatial context for more accurate performance.

    They model the expected part locations using a tree-structured two-layer

    MRF. In the scene layer, the nodes represent parts and the connections repre-

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    50/115

    39

    sent general spatial relationships between the parts. However, rather than using

    the typical square lattice, they use a tree.

    Inference in the MRF is both exact and efficient, due to the tree structure.

    Their MAP estimation algorithm is based on dynamic programming and is very

    similar in flavor to the Viterbi algorithm for Hidden Markov Models. In fact, the

    brief literature in the field on using Hidden Markov Models for object and people

    detection [20] might be better cast in an MRF framework.

    Monte Carlo Methods

    Our treatment is taken in large part from [7; 45; 41; 23].

    Monte Carlo methods are used for sampling. The goal is to characterize a

    distribution using a set of well-chosen samples. These can be used for approximate

    MAP estimation, computing expectations, etc., and are especially helpful when

    the expectations cannot be calculated analytically.

    How the representative samples are drawn and weighted depends on the Monte

    Carlo method used. One must keep in mind that the number of iterations of the

    various algorithms that are needed to obtain independent samples may be large.

    Monte Carlo Integration Monte Carlo integration is used to compute expec-

    tations of functions over probability distributions. Let p(x) be a probability

    distribution and a(x) be a function of interest. We assume that

    a(x)p(x)

    cannot be evaluated analytically, but p(x) is easy to sample from (e.g., Gaus-

    sian). To compute a Monte Carlo estimate of

    a(x)p(x), we first create a

    representative sample of Xs from p(x). There will be many Xs from theregions of high probability density for p(x) (intuitively, the Xs that should

    be common in the real world). We then calculate a(x) for each x in the set.

    The average value of a(x) closely approximates the expectation.

    A key insight into this concept can be stated as follows:

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    51/115

    40

    The expectation of a function of random variables depends not

    only on the functions value, but on how often the random vari-

    ables take certain values!

    The drawback is that our assumption that p(x) is easy is often not valid;

    p(x) is often not easy to sample from, and so we need a search mechanism

    to draw good samples. Furthermore, we must be careful that this search

    mechanism does not bias the results.

    Monte Carlo Markov Chain (MCMC) Methods [45; 41]

    In Monte Carlo Markov Chain (MCMC) methods, the samples are drawn

    from the end of a random walk.

    A Markov chain is a series of random variables, X0, X1, . . . , X t in which a

    locality condition is satisfied, that is,

    P(X(t + 1)|Xt, X(t 1), . . . , X 0) = [P(X(t + 1)|Xt)

    The chain can be specified using the initial probability, p0(x) = P(X0) and

    the transition probabilities, p(X(t + 1)|Xt). The transition probability of

    moving from state x to state y at time t is denoted Tt(x, y) , which can be

    summarized in a transition matrix, Tt.

    We consider homogenous Markov chains, those in which the transition prob-

    abilities are constant, i.e. Tt = T for all T.

    We take an initial distribution across the state space (which, in the case of

    MRFs, is the set of possible configurations of the individual variables).

    This distribution is multiplied by a matrix of transition probabilities re-

    peatedly, each iteration yielding a new distribution. The theory of Markov

    chains gaurantees that the distribution will converge to the true distribution

    if the chain is ergodic (always converging to the same distribution).

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    52/115

    41

    In practice, one generates a sample from the initial probability distribution

    (e.g., uniform) and then moves through the state space stochastically; a

    random walk guided by the values in the transition matrix. Since the distri-

    bution is guaranteed to converge, at the end of the random walk, the sample

    should be from the actual distribution.

    The number of steps needed before convergence is reached is bounded by

    theoretical results, but varies in practice.

    There are a number MCMC algorithms:

    Gibbs Sampler. [23; 45; 41]

    In the Gibbs sampling algorithm, each step of the random walk is taken

    along one dimension, conditioned on the present values of the other

    dimensions. In a MRF problem, it is assumed that the conditional

    probabilities are known, since they are local (by the Markov property).

    This method was developed in the field of physics, but was first applied

    to low-level computer vision problems by Geman and Geman on the

    problem of image restoration [23]. Geman and Geman furthermore

    combined Gibbs sampling with simulated annealing to obtain not just

    a sample, but the MAP estimate of their distribution.

    Their application of Gibbs sampling is also called stochastic relaxation

    (so as to differentiate it from deterministic relation techniques).

    Metropolis Sampler. [45; 41]

    In each iteration of the Metropolis algorithm, one makes a small change

    from the current state, and accepts the change based on how good

    (probabilistically) the new state is compared to the old one.

    Metropolis-Hastings Sampler. [41]

    Generalization of the Metropolis algorithm.

  • 8/14/2019 05.Tr862.Probabilistic Modeling for Semantic Scene Classification

    53/115

    42

    Importance Sampling [45; 41]

    To sample from a distribution f(x), sample from a simpler distribution,

    g(x), and weight based on the ratio between the original distribution and

    g(x). One caveat is that g(x) should have heavy tails, e.g. Cauchy, because

    g(x) should not equal zero where the original distribution has non-zero prob-

    ability.

    Rejection Sampling [45; 41]

    To sample from a distribution f(x), sample from another, similar distribu-

    tion g(x), which is bounded by a constant multiple, c, of the true distribu-

    tion. Generate a point x from g(x) and accept the point with probability

    f(x)/cg(x), repeating until a point