Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar...

Collective Vision: Using Extremely Large Photograph Collections

Mark Lenz

CameraNet Seminar

University of Wisconsin – Madison

January 26, 2010

Acknowledgments: These slides combine and modify slides provided by Yantao Zheng et al. (National University of Singapore/Google)

Introduction

• Distributed Collaboration

• Google Goggles– Personal object recognition

• World-Wide Landmark Recognition

• Building Rome in a Day– Distributed matching and reconstruction

Distributed Collaboration

• Disaster or emergency– Time is of the essence

• Telecommunication networks down

• No maps or GPS

What can we do to help ourselves and those around us?

Mobile Phones for Distributed Collaboration

• Camera for collecting visual information

• Ad-hoc wireless LAN– e.g. Bluetooth

Goals:– Determine location, exits and hazardous paths

Have I or someone else been here before?

Model Scenarios

• Firefighters

• Trapped miners

• Natural Disasters– Large population exodus– Building collapse

Multiple agents collaborating to traverse an unknown environment

• Visual search using picture as query

• Combination of algorithms– Object recognition– Optical character recognition– Geo-location (GPS & compass)

• Identify– Books and products– Businesses and landmarks

A World-Wide Landmark Recognition Engine with Web Learning

• Goal: Build a landmark recognition engine at earth-scale

Challenge I

No list of landmarks in the world We only have: noisy data on Internet

Tourist web articles

Tourist photos

geographical

location

Challenge II

How to learn landmark visual models

Image search engine

Photo-sharing websites

Challenge III

• Efficiency– Learning from enormous data– Recognizing from huge model

Discovering landmarks in the world

Two approaches: Photos in photo sharing websites

Online tourist articles

Geo-tagged

Landmark

name

Learning landmarks from GPS-Tagged photos

GPS-tagged photos

20M images from picasa.companoramio.com

Geo-clustering

geo cluster = landmarks?

validate by photo authors

Noisy image pool

Visualclustering

Graph clustering based on local features

Validate by photo authors

Analyzing text tags

Compute frequency of n-grams of text tags

Premise: Landmark photos are

• geographically adjacent• visually similar• uploaded by diff. users

Landmarks from GPS-Tagged photos

~20 million GPS-tagged photos• 140k geo-clusters and 14k visual

clusters• 2240 landmarks from 812 cities in

104 countries – biased distribution, mostly in Europe

United States 263Spain 194Italy 183France 141United Kingdom 136Greece 51Portugal 48Russia 45Austria 42

Learning landmarks from tourist web articles

Explore article corpus in wikitravel.com

Assume a geographical hierarchy

Landmark mining = named entity extraction

HTML is a structure tree Node: a HTML tag

Value: text

Classify each tree node , based on semantic clues embedded in the document structure


Heuristic rules nodes are in "To See" or "See"

section nodes are children of “bullet list”

nodes. Nodes indicate bold font format

Extract all named entities as landmark candidates

Validate by visual models


~7000 landmarks from 787 cities in 145 countries

More evenly distributed

Unsupervised learning of landmark images

Geo-clusters

Landmarks from tour

articles

Noisy image pool

Visual clustering

Premise: photos from landmark should be similar

Clustering based on local features

Validate and clean models

Visual model validates landmarks!

Photo v.s. non-photo classifer to filter out noisy images

……

Local Feature Detection

• Find invariant and robust features

• Create distinctive feature descriptions

Laplacian-of-Gaussian (LoG)

• Scale-invariant edge detection

• Gaussian image filter to remove noise

• Laplacian filter to find areas of rapid change

Local Feature Description

• Invariant and distinctive description

• Texture from 118 dimension Gabor wavelet

Object matching based on local features

Sim( ) = image match score,

Image representationInterest points:

Laplacian-of-Gaussian (LoG) filter

Local feature: Gabor wavelets

match score =

Probability that match of and is false positive

Probability of at least m out of n features match, if

Probability of a feature match by chance

Constructing match region graph

Image matching

•Node is match region•2 types of edges:

•match edge: measures match confidence

•overlap region edge: measures spatial overlapping

Graph clustering on match regions

Distance between any two regions = shortest path connecting them

Why hierarchical agglomerative clustering? but not K-means, GMM etc

Because we don't have a priori knowledge of # of clusters. Each cluster should correspond to one aspect of a landmark

intuitively

Agglomerative hierarchical clustering

Match region graph Visual clusters

Visual cluster example

Corcovado, Rio de Janeiro, BrazilAcropolis, Athens, Greece

Visual cluster validation and cleaning Validate by authors or hosting webs of

images reflect the popular appeal of

landmarks Filter out non-photographic images, like

map, logo train Adaboost classifier features: color hist, hough transform, etc.

Clean clusters by detecting large area human face

Efficiency issues

Issue 1: learning landmark image

21.4M photos

Recognition engine: ~5000

landmarksIssue 2: recognizing landmark

Query image

Parallel computing to learn true landmark images

Efficient hierarchical clustering

Indexing local feature for matching Query time: ~0.2 sec in a P4 computer

kd-tree indexing

Experiments: statistics of learned landmarks

From photos

From articles

Total

Landmark # 2240 3246 5486

City # 812 626 1259

Country # 104 130 144

small overlap: 174 landmarks shared

China: 101 landmarksUnder-counted! Why?

U.S.- High internet penetration rate & enourmous tour site

Evaluation of landmark image learning

• Randomly select 1000 visual clusters

• 68 (0.68%) are outliers: maps, logos, human photos

• Apply photographic v.s. non-photographic classifier

• 37 outliers. 0.68%=>0.37%

Evaluation of landmark recognition

• Positive testing images: – 728 images from 124 landmarks

• Negative testing images: • Caltech-256 (30524 ) +

Pascal VOC 07 (9986 ) = 40,510 images.

• For positive images: – 417 images detected to be

landmarks– 337/417 (80.8%) are correct– Identification rate: 337/728

(46.3%)

• For negative images: – 463 images detected to be

landmarks– False acceptance rate:

1.1%

Landmarks canbe similar!

False detected images

Match is technically correct, but match region is not landmark

Match is technically false, due to visual similarity

A problem of model generation

A problem of image feature and matching mechanism

Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar...

Documents

Transcript of Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar...