Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by...

Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by : Dror Fadida 1

Introduction Photo-sharing sites on the Internet contain billions publicly accessible images taken virtually everywhere on earth. These images are annotated with various forms of information including geo-location, time, photographer and textual tags. Every image has visual attributes as well. 2

Goal Organizing a global collection of images using all of these sources of information The main idea: Geospatial information provides an important source of structure that can be directly integrated with visual and textual-tag content for organizing global-scale photo collection. 3

Motivation We will see how the techniques developed in this paper could be quite useful in management and organization applications like: Automatically suggest geo-tags. Summarizing large collections of images by one representative image. Automatically mining the information latent in very large sets of images 4

Previous works Existing work has focused primarily either on structure, such as analyses of the social network ties between photographers or on content, such as studies of image tagging. In contrast : our goal is to investigate the interplay between structure and content using text tags and image features for content analysis and geospatial information for structural analysis. 5

Methods In this presentation we will cover the following methods: Mean shift SVM linear Support Vector Machine Sift Spectral Clustering 6

Dataset Dataset was collected by downloading images and photo metadata from Flickr.com. Large and unbiased sample of geo-tagged photos Using a crawler we retrieved 60,742,971 photos taken by 490,048 Flickr users Taking only photos for which the geo-location tags were accurate to within about a city block 33,393,835 photos remain,taken by 307,448 users. The total size of the database is nearly two terabytes 7

Resolutions We consider two spatial resolutions in defining locations Metropolitan-area scale in which we resolve locations down to roughly 100 kilometers. Landmark scale in which we resolve locations down to roughly 100 meters. The chosen kernel function in this paper was a uniform function. 8

Resolutions - Example Metropolitan-area scale Landmark scale 9

Two main tasks 1. Estimating where a photo was taken based on its content, using both image attributes and text tags. 2. Show what is being photographed at a given location, by selecting representative images from a specific location. 10

Finding and characterizing locations using mean shift We want to automatically find popular places at which people take photos. Popular place high number of distinct photographers who have taken a photo there. Process For each scale : We bucket the lat-long values in degrees for each photo. For each photographer we sample a single photo from each bucket. Perform the mean shift procedure seeding by sampling a photo from each bucket, using a uniform disc as the kernel. 11

Mean Shift Mean shift represents a general non-parametric mode finding/clustering procedure. There are no embedded assumptions on the shape of the distribution nor the number of modes/clusters. Operates by directly estimating the gradient of the probability density from the samples. In our case we use the lat-long values in degrees for each photo, treating them as points in the plane. In our case we use the lat-long values in degrees for each photo, treating them as points in the plane. We bucket the lat-long values at the corresponding spatial scale, 1 degree for metropolitan-scale (100 km) and.001 degree for landmark-scale(100 m). We bucket the lat-long values at the corresponding spatial scale, 1 degree for metropolitan-scale (100 km) and.001 degree for landmark-scale(100 m). Konstantinos G. Derpanis, Mean Shift Clustering, August 15, 2005. http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf 12

Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 13

Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 19

mean shift calculation 20

At a given scale, for each photographer we sample a single photo from each bucket. We then perform the mean shift procedure at each scale separately, seeding by sampling a photo from each bucket, using a uniform disc as the kernel. We characterize the magnitude of each peak by simply counting the number of points in the support area of the kernel centered at the peak. This is effectively the number of distinct photographers who took photos at that location. Seeding this mean shift procedure from many initial points, the trajectory from each starting point will converge to a mode of the distribution. 21

Clustering Example Not all trajectories in the attraction basin reach the same mode 2D space representation Final clusters Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 22

Location clustering results Table 1: Clustering results at the metropolitan-scale, showing the most photographed places on Earth ranked by number of distinct photographers. The textual description of each cluster was generated automatically For nearly all of the clusters, the first tag is a city name, with the remaining tags indicating state. 23

Location clustering results cont. Table 2: The seven most photographed landmarks on Earth, and the top seven landmarks in each of the top 25 metropolitan-scale areas, found using mean-shift clustering. 24

Location clustering results cont. Table 4: Cities ranked according to saliency of landmarks. Some cities seem to have a small number of landmarks at which most photos are taken Some popular tourist cities show up in the top rank such as Agra Jerusalem, Prague, rome However other popular tourist cities such as London, Paris and New York have large numbers of photos not taken at landmarks and thus are not ranked highly by this measure. The bottom end of the list contains places whose lack of dominant landmarks or a few locations where it is likely that Flickr usage is sufficiently high among the resident population. 25

Estimating location from visual features and tags Visual features Textual features Strength Strength : Have the advantage that they are inherent to the photo itself. Weakness Weakness: Automatically finding and interpreting visual features is a very challenging problem. Strength Strength : It is very easy to interpret textual tags. Weakness Weakness: Only available if a human user has added them and even then can be irrelevant to geo- classification. 26

Visual features The idea is to identify salient key points in an image that are likely to be stable across a range of image transformations such as scaling, rotation, and perspective distortion. SIFT for We use SIFT for key points detection. For a typical image, SIFT produces several hundred feature points,each one represented by 128-dimensional vector. To reduce computational cost we will create visual vocabulary, and each photo will be labeled by 1000 visual words. 27

Textual features We encode the textual features using a simple vector Any textual tag occurring in more than 2 training exemplars is included as a dimension of the feature vector. The dimensionality of the feature vectors depends on the number of distinct tags that are found in the training set, between 500 and 3,000. 28

SVM - support vector machine In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. 29

SVM - Support Vector Machine Given data points,each belong to one of two classes, the goal is to decide which class a new data point will be in. Linear SVM Given some training data, a set of n points of the form Each Xi is a p-dimensional. y i is either 1 or 1. We want to find the maximum-margin hyper-plane that divides the points. http://en.wikipedia.org/wiki/Support_vector_machine 30

Linear SVM Any hyper-plane can be written as the set of points X satisfying: W X + b = 0 Where : W- normal vector to the hyper- plane b- scalar parameter 31

Linear SVM http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt 32

Linear SVM We need to solve the following constrained problem: Minimize Subject to : By introducing Lagrange multipliers, the previous constrained problem can be expressed as: 33

Linear SVM The solution can be expressed as a linear combination of the training vectors. The corresponding Xi are exactly the support vectors, which lie on the margin and satisfy This problem can now be solved by standard quadratic programming techniques 34

Back to the images We select a set of k landmarks and build a model for each of them by training a classifier using photos taken at the landmark versus those taken elsewhere. We train a separate svm for each of the k landmarks, where the positive exemplars are the photos taken in the landmark while the negative exemplars are those taken in the k 1 other landmarks. To perform geo-location classification on a given test photo, we run each of the k classifiers on it and choose the landmark with the highest score. 35

Training sets vs. testing sets We split our photo dataset into training and testing portions by partitioning the set of photographers. which avoids the possibility that highly similar photos by the same user appear as both test and training images. As we mentioned earlier, we use all the SIFT features in the training set to create a visual vocabulary with 1000 words, by vector quantization. Each image is then represented by a 1000-dimensional vector indicating how many times each SIFT keyword occurs in the image. 36

Geo-location results Shows the correct classification rate using textual tags, visual tags and the combination between them. Classification results for the ten most photographed landmark- scale locations in each of ten most photographed metropolitan-scale regions. The baseline is the rate of guessing uniformly at random. 37

Geo-location results cont. Using textual tags alone is typically 4-6 times better than the baseline. Using visual tags alone performs considerably worse than using textual tags, but still 3-4 times better than the baseline. It is somewhat surprising that the two together outperform text features alone by a significant margin. 38

Geo-location results cont. The same classification task for clusters of cities. The performance on higher-ranked cities is generally better than on lower ranked cities. Greater number of training exemplars. Possibly there are certain properties of the more highly photographed cities that make them more easily classifiable visually. 39

Geo-location results cont. The 25- and 50-way landmark classification task for the top 10 cities. The performance of the visual classifier degrades roughly linearly as the number of landmarks increases. Textual and combined classifiers degrade quite slowly 40

Geo-location results cont. We use the same training and classification paradigm, but for clusters of photos at the metropolitan-scale. Textual tag features remain quite distinctive at this scale and hence perform well.(56.83%) Visual features, on the other hand, are not useful (12.72%). Hence we couldnt distinct between metropolitan-scale photos,using visual features. This result is intuitive: there is relatively little that visually separates a typical scene in one city from a typical scene in another. 41

Adding temporal information Time provides another dimension along which photographs can be connected together. Photos taken at nearby places at nearly the same time are very likely to be related. Temporal information can be exploited both to recover interesting facts about human behavior, and to geo- locate photos more accurately. 42

Geo-tagged and time stamped photos Every time a photo is taken, we have an observation of where a particular person is at a particular moment of time, and create something like GPS tracking device. By aggregating this data together over many people, we can reconstruct the typical pathways that people take as they move around a geospatial region. We plotted the geolocated coordinates of sequences of images taken by the same user, sorted by time, and no more than 30 minutes apart. 43

Visualization of photographer movement in Manhattan and the San Francisco Bay area 44

Improving classification performance We revisit the landmark classification problem of the last section, adding temporal information in addition to the textual and visual features. In classifying a photo, we also examine the photos taken by the same photographer within 15 minutes before and after the picture was taken. We compute the classification distances for each of the k svm classifiers, sum the scores from the different images together to produce a single k-vector, and then make the classification decision using that vector 45

performance on the landmark classification task with and without using temporal information 46

Findings For the classifiers that use only textual tags, the improvement is small,many Flickr users appear to label groups of consecutive photos with the same tags. For the visual tags, however, temporal information improves the results dramatically. photographers take multiple pictures of the same landmark,and thus neighboring frames provide good visual evidence of where the photos were taken. For all of the cities the best performance is achieved by using the full combination of textual, visual, and temporal information. 47

Representative images Given our ability to automatically find and generate textual descriptions of cities and landmarks, it is natural to ask whether it is possible to extract visual descriptions as well. Given a set of photos known to be taken near a landmark, we wish to automatically select a canonical image of the landmark. This problem is non-trivial because the subject of most photos taken near a landmark is actually not the landmark itself, so simple techniques like random selection do very poorly. 48

Intuition People take photos because they think a subject is visually interesting, pleasing, or distinctive. It is as if photos of a landmark are votes for what the visual representation of the landmark should be. Thus we find representative images by looking for subsets of photos that are visually very similar, and choosing an image from among the most salient subset. 49

Reduction of the problem We pose canonical image selection as a graph problem. We construct a graph in which each node represents a photo and between each pair of nodes is an edge with a weight indicating the degree of visual similarity between the two photos (using the Euclidean distance between SIFT descriptors). The goal : finding a tightly-connected cluster of photos that are highly similar. To do this we use a spectral clustering technique. Finally, we choose as the canonical image for each cluster the one corresponding to the node with the largest weighted degree. 50

51 Matthias Hein and Ulrike von Luxburg August 2007

53 Matthias Hein and Ulrike von Luxburg August 2007 A a graph consists of vertices and edges.

Simple example: A graph with 4 nodes and 2 clusters. We put edges between every pair of objects in the same cluster, and put no edges across clusters. The adjacency matrix of the graph is block diagonal. The eigenvectors of this matrix are The eigenvectors of these matrix identifies the clusters. 55

Example cont. If we permute the matrix by swapping rows and columns the eigenvectors will, again identify the clustering. For example 56

Reminder : given a set of photos known to be taken near a landmark, we wish to automatically select a canonical image of the landmark. Finding representative images by looking for subsets of photos that are visually very similar, and choosing an image from among the most salient subset. Results : http://www.Cs.Cornell.Edu/~crandall/photomap/ http://www.Cs.Cornell.Edu/~crandall/photomap/ 57

Conclusion We introduce techniques for analyzing a global collection of geo-referenced photographs. We saw techniques to automatically identify places that people find interesting to photograph. We used classification methods for predicting these locations from visual, textual and temporal features. Finally,we demonstrate that representative photos can be selected automatically. 58

REFERENCES David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg. Mapping the Worlds Photos http://www.cs.cornell.edu/~crandall/papers/mapping09www.pdf http://www.cs.cornell.edu/~crandall/papers/mapping09www.pdf Konstantinos G. Derpanis, Mean Shift Clustering, August 15, 2005. http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf support vector machines http://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machine support vector machines http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt Matthias Hein and Ulrike von Luxburg August 2007 http://www1.idc.ac.il/toky/seminarIP-08/.../SpectralClustering.ppt http://www1.idc.ac.il/toky/seminarIP-08/.../SpectralClustering.ppt A Very Simple Explanation of Spectral Clustering http://www.akrish.net/blog/2012/03/16/simple-spectral-clustering/ http://www.akrish.net/blog/2012/03/16/simple-spectral-clustering/ Mapping the World's Photos figures http://www.cs.cornell.edu/~crandall/photomap/ http://www.cs.cornell.edu/~crandall/photomap/ 60

Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by...

Documents

Transcript of Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by...