Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar...

30
Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments: These slides combine and modify slides provided by Yantao Zheng et al. (National University of Singapore/Google)

Transcript of Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar...

Page 1: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Collective Vision: Using Extremely Large Photograph Collections

Mark Lenz

CameraNet Seminar

University of Wisconsin – Madison

January 26, 2010

Acknowledgments: These slides combine and modify slides provided by Yantao Zheng et al. (National University of Singapore/Google)

Page 2: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Introduction

• Distributed Collaboration

• Google Goggles– Personal object recognition

• World-Wide Landmark Recognition

• Building Rome in a Day– Distributed matching and reconstruction

Page 3: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Distributed Collaboration

• Disaster or emergency– Time is of the essence

• Telecommunication networks down

• No maps or GPS

What can we do to help ourselves and those around us?

Page 4: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Mobile Phones for Distributed Collaboration

• Camera for collecting visual information

• Ad-hoc wireless LAN– e.g. Bluetooth

Goals:– Determine location, exits and hazardous paths

Have I or someone else been here before?

Page 5: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Model Scenarios

• Firefighters

• Trapped miners

• Natural Disasters– Large population exodus– Building collapse

Multiple agents collaborating to traverse an unknown environment

Page 6: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

• Visual search using picture as query

• Combination of algorithms– Object recognition– Optical character recognition– Geo-location (GPS & compass)

• Identify– Books and products– Businesses and landmarks

Page 7: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

A World-Wide Landmark Recognition Engine with Web Learning

• Goal: Build a landmark recognition engine at earth-scale

Page 8: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Challenge I

No list of landmarks in the world We only have: noisy data on Internet

Tourist web articles

Tourist photos

geographical

location

Page 9: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Challenge II

How to learn landmark visual models

Image search engine

Photo-sharing websites

Page 10: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Challenge III

• Efficiency– Learning from enormous data– Recognizing from huge model

Page 11: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Discovering landmarks in the world

Two approaches: Photos in photo sharing websites

Online tourist articles

Geo-tagged

Landmark

name

Page 12: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Learning landmarks from GPS-Tagged photos

GPS-tagged photos

20M images from picasa.companoramio.com

Geo-clustering

geo cluster = landmarks?

validate by photo authors

Noisy image pool

Visualclustering

Graph clustering based on local features

Validate by photo authors

Analyzing text tags

Compute frequency of n-grams of text tags

Premise: Landmark photos are

• geographically adjacent• visually similar• uploaded by diff. users

Page 13: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Landmarks from GPS-Tagged photos

~20 million GPS-tagged photos• 140k geo-clusters and 14k visual

clusters• 2240 landmarks from 812 cities in

104 countries – biased distribution, mostly in Europe

United States 263Spain 194Italy 183France 141United Kingdom 136Greece 51Portugal 48Russia 45Austria 42

Page 14: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Learning landmarks from tourist web articles

Explore article corpus in wikitravel.com

Assume a geographical hierarchy

Landmark mining = named entity extraction

HTML is a structure tree Node: a HTML tag

Value: text

Classify each tree node , based on semantic clues embedded in the document structure

Page 15: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Learning landmarks from tourist web articles

Heuristic rules nodes are in "To See" or "See"

section nodes are children of “bullet list”

nodes. Nodes indicate bold font format

Extract all named entities as landmark candidates

Validate by visual models

Page 16: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Learning landmarks from tourist web articles

~7000 landmarks from 787 cities in 145 countries

More evenly distributed

Page 17: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Unsupervised learning of landmark images

Geo-clusters

Landmarks from tour

articles

Noisy image pool

Visual clustering

Premise: photos from landmark should be similar

Clustering based on local features

Validate and clean models

Visual model validates landmarks!

Photo v.s. non-photo classifer to filter out noisy images

……

Page 18: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Local Feature Detection

• Find invariant and robust features

• Create distinctive feature descriptions

Page 19: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Laplacian-of-Gaussian (LoG)

• Scale-invariant edge detection

• Gaussian image filter to remove noise

• Laplacian filter to find areas of rapid change

Page 20: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Local Feature Description

• Invariant and distinctive description

• Texture from 118 dimension Gabor wavelet

Page 21: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Object matching based on local features

Sim( ) = image match score,

Image representationInterest points:

Laplacian-of-Gaussian (LoG) filter

Local feature: Gabor wavelets

match score =

Probability that match of and is false positive

Probability of at least m out of n features match, if

Probability of a feature match by chance

Page 22: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Constructing match region graph

Image matching

•Node is match region•2 types of edges:

•match edge: measures match confidence

•overlap region edge: measures spatial overlapping

Page 23: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Graph clustering on match regions

Distance between any two regions = shortest path connecting them

Why hierarchical agglomerative clustering? but not K-means, GMM etc

Because we don't have a priori knowledge of # of clusters. Each cluster should correspond to one aspect of a landmark

intuitively

Agglomerative hierarchical clustering

Match region graph Visual clusters

Page 24: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Visual cluster example

Corcovado, Rio de Janeiro, BrazilAcropolis, Athens, Greece

Page 25: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Visual cluster validation and cleaning Validate by authors or hosting webs of

images reflect the popular appeal of

landmarks Filter out non-photographic images, like

map, logo train Adaboost classifier features: color hist, hough transform, etc.

Clean clusters by detecting large area human face

Page 26: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Efficiency issues

Issue 1: learning landmark image

21.4M photos

Recognition engine: ~5000

landmarksIssue 2: recognizing landmark

Query image

Parallel computing to learn true landmark images

Efficient hierarchical clustering

Indexing local feature for matching Query time: ~0.2 sec in a P4 computer

kd-tree indexing

Page 27: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Experiments: statistics of learned landmarks

From photos

From articles

Total

Landmark # 2240 3246 5486

City # 812 626 1259

Country # 104 130 144

small overlap: 174 landmarks shared

China: 101 landmarksUnder-counted! Why?

U.S.- High internet penetration rate & enourmous tour site

Page 28: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Evaluation of landmark image learning

• Randomly select 1000 visual clusters

• 68 (0.68%) are outliers: maps, logos, human photos

• Apply photographic v.s. non-photographic classifier

• 37 outliers. 0.68%=>0.37%

Page 29: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

Evaluation of landmark recognition

• Positive testing images: – 728 images from 124 landmarks

• Negative testing images: • Caltech-256 (30524 ) +

Pascal VOC 07 (9986 ) = 40,510 images.

• For positive images: – 417 images detected to be

landmarks– 337/417 (80.8%) are correct– Identification rate: 337/728

(46.3%)

• For negative images: – 463 images detected to be

landmarks– False acceptance rate:

1.1%

Landmarks canbe similar!

Page 30: Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison January 26, 2010 Acknowledgments:

False detected images

Match is technically correct, but match region is not landmark

Match is technically false, due to visual similarity

A problem of model generation

A problem of image feature and matching mechanism