Probabilistic generative models for machine vision

Machine Vision BackgroundRegion Annotation by Hierarchical Region Topic Discovery

Probabilistic Generative Models for MachineVision

Davide Bacciu

bacciu@di.unipi.itDipartimento di Informatica

Università di Pisa

Corso di Intelligenza ArtificialeProf. Alessandro Sperduti

Università degli Studi di PadovaA.A. 2008/2009

Davide Bacciu Probabilistic Generative Models for Machine Vision

Outline

1 Machine Vision BackgroundIntroductionVisual Content RepresentationMachine Vision Applications

2 Region Annotation by Hierarchical Region Topic DiscoveryVisual Content RepresentationHierarchical Probabilistic Latent Semantic AnalysisExperimental Evaluation

IntroductionVisual Content RepresentationMachine Vision Applications

Introduction

Visual content representationVisual features - how to identify and describe the single bitsof visual informationImage representation - how to describe the visualinformation within a picture

Machine vision applicationsScene and object recognitionImage annotation

Focus on Probabilistic Generative Models and the Bag ofWords image representation

Visual Features Descriptors

How do we describe the single bits of visual information?

Global descriptorsColor histogramsShape featuresTextureSpatial Envelope

Local descriptorsGabor filtersDifferential filtersDistribution-baseddescriptors

Scale Invariant Feature Transform (SIFT)

Calculate gradient magnitudeand orientation at a givenkeypoint (x , y) and scale σCompute 3D histogram ofmagnitude orientations in a4× 4 neighborhood of thekeypointAngles are quantized to 8directionsRobust to geometric andillumination distortion

Visual Features Detectors

How do we identify the interesting/informative parts of animage?

Feature detectorsSegment/Blob detectorsCorner/Interest point detectorsAffine invariant keypoint detectorsDense sampling

Image Representation

Straightforward for global descriptorsAn image corresponds to its descriptorImages are confronted, classified and indexed based ontheir global color and/or texture histograms

Requires an additional step for local descriptorsLocal descriptors provide only information on a portion ofthe sceneProbabilistic modeling of segmentsBag of words

Bag of Words Document Representation

Count the occurrences of each dictionary word in yourdocumentRepresent the document as a vector of word counts

Definition (Bag of Words Assumption)

Word order is not relevant for determining document semantics

Bag of Words Image Representation

Each image is a document and each visual patch is a wordCount the occurrences of each dictionary visual word(visterm) in your imageRepresent the image as a vector of visterm counts(histogram)

Codebook

CollectionImage

Image Collection

Vector

k−means

Input Image

Local patches

Quantization

Image Understanding Applications

Images are typically represented as vector of featuresDiscover the semantics of an image from its vectorialrepresentation

Object recognition (e.g. Airplane)Scene recognition (e.g. Forest)Image annotation (e.g. Sky, Tree, Boat, Sea, ...)

Scene and Object Recognition

Structured approachesObject/scene categories arerepresented as collections ofconnected parts characterized bydistinctive appearance and spatialpositionPart-based probabilistic models

Weakly structured approachesBased on bag-of-word assumptionLatent aspect models

Place Context

PixelContext

Part Context

Object Context

Latent Aspect Discovery

Rooted into text processingProbabilistic models describing the generative process ofimage content using hidden (i.e. latent) variablesModel likelihood is used to measure the fit of the unknownparameters with respect to the observationsModel parameters are fitted by Expectation Maximization(EM) or variational methods

Probabilistic Latent Semantic Analysis (pLSA) (I)

P(d) P(z|d) P(w|z)

Observations are collected in the integer matrix n(wj ,di)

Every observation is associated to a latent topic k bymeans of the hidden variable zk

Each hidden topic k denotes a particular visual theme (e.g.an object)

Probabilistic Latent Semantic Analysis (pLSA) (II)

P(d) P(z|d) P(w|z)

pLSA is trained by Expectation Maximization (EM)Conditional independence is used to decompose modellikelihood

L(N|θ) =D∏

W∏j=1

P(wj ,di)n(wj ,di )

P(wj ,di) = P(di)P(wj |di) = P(di)K∑

P(zk |di)P(wj |zk )

An Object Recognition Example

Caltech-4 Dataset: cars, faces, motorbikes and airplanes on acluttered backgroundFit a pLSA model with 7 latent topics (4 objects + 3 background)Determine the most likely topic of each visual word in image di

P(zk |wj ,di) =P(zk |di)P(wj |zk )∑K

k=1 P(zk |di)P(wj |zk )

Figure: Images by Sivic et al 2005Davide Bacciu Probabilistic Generative Models for Machine Vision

Latent Dirichlet Allocation (LDA)

pLSA provides a generative model only for the training dataLDA treats P(zk |di) as a latent random variable, samplingit from a Dirichlet distribution P(θ|α)

Maximizes data likelihood

P(w |φ, α, β) =

∫ ∑z

P(w |z, φ)P(z|θ)P(θ|α)P(φ|β)dθ

Hierarchical Latent Topic Models...

Dependent Pachinko Allocation Model (Li and McCallum, 2006)

Image Annotation

Learn association between image content and textTypically exploits region-based image representation

Image segmentationColor and texture descriptors

Non-probabilistic image annotationInformation retrieval approaches (e.g. vector based)Multiple instance learning

Probabilistic Image AnnotationMachine translation approachProbabilistic generativeLatent topic models

Latent Aspect Models in Image Annotation

GM-LDA CORR-LDA

Based on a Mixture of Gaussians modeling of the Bd imageregions and a multinomial distribution for annotations Md

Image Annotation - Is it Worth the Hassle?

The Web is full of annotated multimedia information

How Informative is Image Representation?

BOW assumption is strong in the visual domain: spatialcontext matters!Region-based representation is often too coarse andunstable.How can we retain the robustness of BOW whileintroducing spatial context?

Visual Content RepresentationHierarchical Probabilistic Latent Semantic AnalysisExperimental Evaluation

Introduction

Overtake flat region-based and keypoint-basedapproachesStructured representation conveying richer semantics anddiscriminative power than a BOW model without the rigidityof a part-based approachKey idea

Introduce a multi-resolution representation of visual contentdrawing both on region-based and keypoint-based modelsIntroduce a multi-layered structure at the semantic level oflatent topic discovery

Multi-Resolution Image Representation

Multi-resolution V describing visual information at differentspatial granularityBOW representation is based on assumption that spatialdistribution of visual keywords can be neglected whendetermining the overall image appearance

Biased view holding mostly for corpora characterized bysmall variability of the pictorial contentNeed of structured approach that can isolate semanticallydifferent image portions

Text representationDocuments are collections of paragraphsParagraphs are collections of words

Region-Keypoints Representation

bag−of−words

bag−of−regions

segmented image ...

Use an unsupervised learning model to extract perceptuallycoherent portions of the image rl

Use dense sampling to extract a set of local visual descriptors

Hierarchical Region-Topic Probabilistic LatentSemantic Analysis (HRPLSA)

Complete model likelihood

LC(N|θ) =D∏

Ri∏l=1

K∑k=1

P(zk |di)Zilk

Wl∏j=1

T∑p=1

P(tp|zk )(φjp)Tkjp

Model Fitting

HRPLSA parameters are fitted by applying ExpectationMaximization to the complete log-likelihood, e.g.

P(tp|zk ) =

∑Di=1

∑Ril=1 P(zk |di , rl)

∑Wlj=1 niljP(tp|zk ,wj)∑D

i ′=1∑Ri′

l ′=1 P(zk |di ′ , rl ′)ni ′l ′·

Model parameters θP(zk |di) - Document-specific mixing proportions of regiontopicsP(tp|zk ) - Mixing proportions of keyword topics given regionaspectsφjp - Multinomial keyword-topic distribution

Probabilistic inference is extended to images outside thetraining pool by folding-in

Modeling Images and Words by HRPLSA(HRPLSA-ANN)

streetlight

awning

window

sidewalk

plants

building

P (a|t)

P (a|z)

w1 w2 wW1

w2 wWR

Connection between region-topics (keyword-topics) and word islearned by maximizing constrained annotation log-likelihood

Lcal(N|θ′) =D∑

F∑f=1

mfi logK∑

P(af |zk )P(zk |di)

Probabilistic InferenceA Few Examples

Most likely visual aspect within an image di

z i = arg maxzk∈Z

P(zk |di)

Most likely visual aspect z∗ for an image segment rl

z∗ = arg maxzk∈Z

P(zk |di , rl)

Most likely annotation a∗ for region rl in image dnew

a∗ = arg maxaf∈A

K∑k=1

P(af |zk )P(zk |dnew )

Unsupervised Discovery and Segmentation of VisualClasses

Microsoft Research Cambridge dataset (MSRC-B1)240 images and 9 visual classesGround truth segmentation of visual classes

Image segments are assigned to the most likelydiscovered aspect and segmentation overlap is measured

ρor =GTo ∩ Rr

GTo ∪ Rr.

Semantic Segmentation

Discovery of semantic visual aspects corrects segmentationerrors

MSRC-B1 Semantic Segmentation Result

Table: Segmentation Overlap on the MSRC-B1 Dataset

Algorithm Airplanes Bicycles Buildings Cars Cows Faces Grass Tree Sky Average

dLDA 0.43 0.50 0.16 0.45 0.52 0.44 0.60 0.71 0.74 0.51LDA10 0.11 0.06 0.09 0.14 0.11 0.40 0.41 0.69 0.41 0.27LDA15 0.08 0.56 0.06 0.14 0.14 0.43 0.57 0.62 0.41 0.33LDA20 0.10 0.50 0.40 0.15 0.58 0.45 0.54 0.46 0.50 0.40LDA25 0.14 0.52 0.21 0.17 0.48 0.44 0.45 0.59 0.50 0.39

HPRLSA1015 0.32 0.37 0.59 0.68 0.74 0.44 0.87 0.81 0.90 0.62

HPRLSA1020 0.35 0.56 0.65 0.67 0.74 0.47 0.89 0.47 0.91 0.63

HPRLSA1025 0.35 0.52 0.75 0.63 0.73 0.50 0.89 0.71 0.86 0.66

HPRLSA1520 0.36 0.49 0.56 0.60 0.68 0.52 0.89 0.77 0.91 0.64

HPRLSA1525 0.42 0.52 0.61 0.68 0.69 0.45 0.90 0.61 0.92 0.65

HPRLSA2025 0.44 0.43 0.48 0.52 0.67 0.48 0.88 0.75 0.91 0.61

HPRLSA1025 0.16 0.32 0.15 0.40 0.43 0.22 0.28 0.33 0.38 0.30

HPRLSA1525 0.24 0.31 0.13 0.38 0.37 0.19 0.25 0.21 0.45 0.28

Comparative analysis

HRPLSA

Multi-segmentation Latent Dirichlet Allocation (LDA)

Hierarchical Latent Dirichlet Allocation (hLDA)Davide Bacciu Probabilistic Generative Models for Machine Vision

MSRC-B1 Segmentation Example I

MSRC-B1 Segmentation Example II

MSRC-B1 Segmentation Example III

Image Annotation

Washington Ground Truth dataset854 images manually annotated with 392 words427 training/test images

LabelMe dataset2688 images annotated with ∼ 400 words800 training / 1888 test images

Comparative analysisHRPLSA-ANNpLSA-FEATURESAnnotation Propagation (AP)

Precision/recall analysis

Washington DatasetImage Annotation

Precision/Recall Plot

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.35

recall

plsa245

hrplsa245

plsa188

hrplsa188

plsa158

hrplsa158

Comparative precision/recall results on the Washington dataset

Washington DatasetImage Annotation Example

Washington DatasetRegion Annotation Example (I)

Examples of good HRPLSA-ANN annotations

clouds

people

skyclouds

clouds

clouds50

tree treetree

treetree

treesidewalk

building

window

treetree

building

Washington DatasetRegion Annotation Example (II)

Examples of bad HRPLSA-ANN annotations

flower

treetree

cloudsclouds clouds

skysky

cloudshouse

peopletree

people

overcast50

Washington DatasetAnnotated Segments

LabelMe DatasetImage Annotation

Precision/Recall Plot

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.35

Recall

plsa60

hrplsa60,80

plsa25

hrplsa25,50

LabelMe DatasetExample of Annotated Images

LabelMe DatasetRecalling Specific Terms

Balcony

precision recall0

balcony

hrplsaplsaap

precision recall0

hrplsaplsaap

LabelMe DatasetRegion Annotation Issue I

Region annotation tends to predict only a limited number ofwords

ing car

ad sea

window

0 100 200 3000

400coast

0 100 200 3000

2000forest

0 100 200 3000

1000highway

0 100 200 3000

2000 city

0 100 200 3000

1000mountain

0 100 200 3000

1000country

0 100 200 3000

2000street

0 100 200 3000

2000tallbuilding

LabelMe DatasetRegion Annotation Issue II

Distribution of region-topics in mountain caption

Topic 40

Topic 20

Topic 11

The unbalanced distribution of term occurrences introduces a bias inthe generation of the segment caption that yields to distinct regionaspects being associated to the same frequently co-occurring word

LabelMe DatasetAnnotated Segments

ConclusionConcluding RemarksBibliography and Online Resources

Wrapping up...

Inspiration can be sought in related application fieldsBag of words representationGenerative models for textNeed to pay attention to the underlying assumptions of amodel!

Hierarchical multi-resolution latent aspect model (HRPLSA)Unsupervised semantic segmentation of visual contentMulti-level image captioning

Future challengesGeneralize multi-resolution representation and topichierarchy beyond a two-layered structureVideo/multimedia understanding

Online ResourcesA notable introductory course to object recognition (with code)

http://people.csail.mit.edu/torralba/shortCourseRLOC/

Code repositories

http://www.cs.ubc.ca/~lowe/keypoints/ (OriginalSIFT software by D. Lowe)

http://vlfeat.org/ (SIFT, MSER and VQ routines)

www.robots.ox.ac.uk/~vgg/research/affine/ (Affinedetectors and more)

http://www.di.ens.fr/~russell/projects/mult_seg_discovery/index.html (Multiple segmentation LDA)

Image annotation tool and repository

http://labelme.csail.mit.edu/ (LabelMe)

Bibliography

D. Bacciu, A Perceptual Learning Model to Discover the Hierarchical Latent Structure of Image Collections,Ph.D. Thesis, 2008.

J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos.Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV 2003, pages1470–1477, vol.2, Oct. 2003.

J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T. Freeman. Discovering objects and their location inimages. Proceedings of the Tenth IEEE International Conference on Computer Vision, ICCV 2005,370–377, Oct. 2005.

J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsupervised discovery of visualobject class hierarchies. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2008.

K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. M. Blei, and M. I. Jordan. Matching words and pictures.Journal of Machine Learning Research, vol. 3, 1107–1135, Feb. 2003.

D. M. Blei and M. I. Jordan. Modeling annotated data. Proceedings of the International Conference onResearch and Development in Information Retrieval, SIGIR 2003, Aug. 2003.

F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEETransactions on Pattern Analysis and Machine Intelligence, 29(10):1802–1817, Oct. 2007.

Thank You

Probabilistic generative models for machine vision

Education

Transcript of Probabilistic generative models for machine vision

Probabilistic Topic Modelscocosci.princeton.edu/tom/papers/SteyversGriffiths.pdf · 2008. 12. 19. · topic model is a generative model for documents: it specifies a simple probabilistic

Multimodal Probabilistic Model-Based Planning for Human ... · Multimodal Probabilistic Model-Based Planning ... instead to learn a generative model for human behavior ... another

Learning a Probabilistic Latent Space of Object Shapes …people.csail.mit.edu/tfxue/papers/nips2016_shape.pdfLearning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial

Generative Probabilistic Models of Goal-Directed Users in ... · Generative Probabilistic Models of Goal-Directed Users in Task-Oriented Dialogs Aciel Eshky T H E U NIVE R S I T Y

Classification: Probabilistic Generative Model

Learning recipe ingredient space using generative ...flavourspace.com/the_story_behind/wp-content/uploads/2014/07/flav… · iments using generative probabilistic models on recipe

Modals, conditionals, and probabilistic generative modelsweb.stanford.edu/~danlass/Paris-lectures/L1-slides.pdf · 2019. 12. 3. · Modals, conditionals, and probabilistic generative

Probabilistic Typology: Deep Generative Models of …jason/papers/cotterell+eisner.acl17.pdf · Probabilistic Typology: Deep Generative Models of Vowel Inventories ... (Gordon,2016).

Generative Multi -View Human Action Recognition › wp-content › uploads › 2020 › 04 › ...Center for Research In Computer Vision CAP 6412 – Advanced Computer Vision Generative

Using generative probabilistic models for multimedia retrieval · USING GENERATIVE PROBABILISTIC MODELS FOR MULTIMEDIA RETRIEVAL P RO E F S C H R I F T ter verkrijging van de graad

Generative Adversarial Networks in Computer Vision: A ...Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy ZHENGWEI WANG, V-SENSE, School of Computer Science

Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Exact Inference for Generative Probabilistic Non ...homepages.inf.ed.ac.uk/scohen/emnlp11nonproj.pdf · Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing

Bayesian Optimization for Probabilistic Programs · Probabilistic programming systems (PPS) allow probabilistic models to be represented in the form of a generative model and statements

Generative probabilistic models for multimedia retrieval ... · This paper1 presents the use of generative probabilistic models for multimedia retrieval. We estimate Gaussian mixture

MODELING EARLY VISION: PROBABILISTIC COMPUTATION … · 2019-04-20 · MODELING EARLY VISION: PROBABILISTIC COMPUTATION USING SPIKING NEURONS, POPULATION CODES, AND CUDA by DANIEL

Probabilistic Forecasting of Sensory Data with Generative Adversarial Networks – ForGAN · 2019-04-01 · Digital Object Identiﬁer Probabilistic Forecasting of Sensory Data with

MODELING EARLY VISION: PROBABILISTIC …web.cecs.pdx.edu/~mm/MastersTheses/DanCoatesThesis.pdfMODELING EARLY VISION: PROBABILISTIC COMPUTATION USING SPIKING NEURONS, POPULATION CODES,

Probabilistic Solution of III-Posed Problems in Computational Vision

Progress on Generative Adversarial Networksmac.xmu.edu.cn/valse2017/ppt/APR/12GAN_zwm.pdf · Progress on Generative Adversarial Networks Wangmeng Zuo Vision Perception and Cognition