Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by...

60
Mapping the World’s Photos Written by: David Crandall, Lars Backstrom , Daniel Huttenlocher and Jon Kleinberg(2009) Presented by : Dror Fadida 1

Transcript of Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by...

  • Slide 1
  • Written by: David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg(2009) Presented by : Dror Fadida 1
  • Slide 2
  • Introduction Photo-sharing sites on the Internet contain billions publicly accessible images taken virtually everywhere on earth. These images are annotated with various forms of information including geo-location, time, photographer and textual tags. Every image has visual attributes as well. 2
  • Slide 3
  • Goal Organizing a global collection of images using all of these sources of information The main idea: Geospatial information provides an important source of structure that can be directly integrated with visual and textual-tag content for organizing global-scale photo collection. 3
  • Slide 4
  • Motivation We will see how the techniques developed in this paper could be quite useful in management and organization applications like: Automatically suggest geo-tags. Summarizing large collections of images by one representative image. Automatically mining the information latent in very large sets of images 4
  • Slide 5
  • Previous works Existing work has focused primarily either on structure, such as analyses of the social network ties between photographers or on content, such as studies of image tagging. In contrast : our goal is to investigate the interplay between structure and content using text tags and image features for content analysis and geospatial information for structural analysis. 5
  • Slide 6
  • Methods In this presentation we will cover the following methods: Mean shift SVM linear Support Vector Machine Sift Spectral Clustering 6
  • Slide 7
  • Dataset Dataset was collected by downloading images and photo metadata from Flickr.com. Large and unbiased sample of geo-tagged photos Using a crawler we retrieved 60,742,971 photos taken by 490,048 Flickr users Taking only photos for which the geo-location tags were accurate to within about a city block 33,393,835 photos remain,taken by 307,448 users. The total size of the database is nearly two terabytes 7
  • Slide 8
  • Resolutions We consider two spatial resolutions in defining locations Metropolitan-area scale in which we resolve locations down to roughly 100 kilometers. Landmark scale in which we resolve locations down to roughly 100 meters. The chosen kernel function in this paper was a uniform function. 8
  • Slide 9
  • Resolutions - Example Metropolitan-area scale Landmark scale 9
  • Slide 10
  • Two main tasks 1. Estimating where a photo was taken based on its content, using both image attributes and text tags. 2. Show what is being photographed at a given location, by selecting representative images from a specific location. 10
  • Slide 11
  • Finding and characterizing locations using mean shift We want to automatically find popular places at which people take photos. Popular place high number of distinct photographers who have taken a photo there. Process For each scale : We bucket the lat-long values in degrees for each photo. For each photographer we sample a single photo from each bucket. Perform the mean shift procedure seeding by sampling a photo from each bucket, using a uniform disc as the kernel. 11
  • Slide 12
  • Mean Shift Mean shift represents a general non-parametric mode finding/clustering procedure. There are no embedded assumptions on the shape of the distribution nor the number of modes/clusters. Operates by directly estimating the gradient of the probability density from the samples. In our case we use the lat-long values in degrees for each photo, treating them as points in the plane. In our case we use the lat-long values in degrees for each photo, treating them as points in the plane. We bucket the lat-long values at the corresponding spatial scale, 1 degree for metropolitan-scale (100 km) and.001 degree for landmark-scale(100 m). We bucket the lat-long values at the corresponding spatial scale, 1 degree for metropolitan-scale (100 km) and.001 degree for landmark-scale(100 m). Konstantinos G. Derpanis, Mean Shift Clustering, August 15, 2005. http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf 12
  • Slide 13
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 13
  • Slide 14
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 14
  • Slide 15
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 15
  • Slide 16
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 16
  • Slide 17
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 17
  • Slide 18
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 18
  • Slide 19
  • Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Objective : Find the densest region Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 19
  • Slide 20
  • mean shift calculation 20
  • Slide 21
  • At a given scale, for each photographer we sample a single photo from each bucket. We then perform the mean shift procedure at each scale separately, seeding by sampling a photo from each bucket, using a uniform disc as the kernel. We characterize the magnitude of each peak by simply counting the number of points in the support area of the kernel centered at the peak. This is effectively the number of distinct photographers who took photos at that location. Seeding this mean shift procedure from many initial points, the trajectory from each starting point will converge to a mode of the distribution. 21
  • Slide 22
  • Clustering Example Not all trajectories in the attraction basin reach the same mode 2D space representation Final clusters Yaron Ukrainitz & Bernard Sarel http://www.wisdom.weizmann.ac.il/~vision/.../mean_shift/mean_shift.ppt 22
  • Slide 23
  • Location clustering results Table 1: Clustering results at the metropolitan-scale, showing the most photographed places on Earth ranked by number of distinct photographers. The textual description of each cluster was generated automatically For nearly all of the clusters, the first tag is a city name, with the remaining tags indicating state. 23
  • Slide 24
  • Location clustering results cont. Table 2: The seven most photographed landmarks on Earth, and the top seven landmarks in each of the top 25 metropolitan-scale areas, found using mean-shift clustering. 24
  • Slide 25
  • Location clustering results cont. Table 4: Cities ranked according to saliency of landmarks. Some cities seem to have a small number of landmarks at which most photos are taken Some popular tourist cities show up in the top rank such as Agra Jerusalem, Prague, rome However other popular tourist cities such as London, Paris and New York have large numbers of photos not taken at landmarks and thus are not ranked highly by this measure. The bottom end of the list contains places whose lack of dominant landmarks or a few locations where it is likely that Flickr usage is sufficiently high among the resident population. 25
  • Slide 26
  • Estimating location from visual features and tags Visual features Textual features Strength Strength : Have the advantage that they are inherent to the photo itself. Weakness Weakness: Automatically finding and interpreting visual features is a very challenging problem. Strength Strength : It is very easy to interpret textual tags. Weakness Weakness: Only available if a human user has added them and even then can be irrelevant to geo- classification. 26
  • Slide 27
  • Visual features The idea is to identify salient key points in an image that are likely to be stable across a range of image transformations such as scaling, rotation, and perspective distortion. SIFT for We use SIFT for key points detection. For a typical image, SIFT produces several hundred feature points,each one represented by 128-dimensional vector. To reduce computational cost we will create visual vocabulary, and each photo will be labeled by 1000 visual words. 27
  • Slide 28
  • Textual features We encode the textual features using a simple vector Any textual tag occurring in more than 2 training exemplars is included as a dimension of the feature vector. The dimensionality of the feature vectors depends on the number of distinct tags that are found in the training set, between 500 and 3,000. 28
  • Slide 29
  • SVM - support vector machine In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. 29
  • Slide 30
  • SVM - Support Vector Machine Given data points,each belong to one of two classes, the goal is to decide which class a new data point will be in. Linear SVM Given some training data, a set of n points of the form Each Xi is a p-dimensional. y i is either 1 or 1. We want to find the maximum-margin hyper-plane that divides the points. http://en.wikipedia.org/wiki/Support_vector_machine 30
  • Slide 31
  • Linear SVM Any hyper-plane can be written as the set of points X satisfying: W X + b = 0 Where : W- normal vector to the hyper- plane b- scalar parameter 31
  • Slide 32
  • Linear SVM http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt 32
  • Slide 33
  • Linear SVM We need to solve the following constrained problem: Minimize Subject to : By introducing Lagrange multipliers, the previous constrained problem can be expressed as: 33
  • Slide 34
  • Linear SVM The solution can be expressed as a linear combination of the training vectors. The corresponding Xi are exactly the support vectors, which lie on the margin and satisfy This problem can now be solved by standard quadratic programming techniques 34
  • Slide 35
  • Back to the images We select a set of k landmarks and build a model for each of them by training a classifier using photos taken at the landmark versus those taken elsewhere. We train a separate svm for each of the k landmarks, where the positive exemplars are the photos taken in the landmark while the negative exemplars are those taken in the k 1 other landmarks. To perform geo-location classification on a given test photo, we run each of the k classifiers on it and choose the landmark with the highest score. 35
  • Slide 36
  • Training sets vs. testing sets We split our photo dataset into training and testing portions by partitioning the set of photographers. which avoids the possibility that highly similar photos by the same user appear as both test and training images. As we mentioned earlier, we use all the SIFT features in the training set to create a visual vocabulary with 1000 words, by vector quantization. Each image is then represented by a 1000-dimensional vector indicating how many times each SIFT keyword occurs in the image. 36
  • Slide 37
  • Geo-location results Shows the correct classification rate using textual tags, visual tags and the combination between them. Classification results for the ten most photographed landmark- scale locations in each of ten most photographed metropolitan-scale regions. The baseline is the rate of guessing uniformly at random. 37
  • Slide 38
  • Geo-location results cont. Using textual tags alone is typically 4-6 times better than the baseline. Using visual tags alone performs considerably worse than using textual tags, but still 3-4 times better than the baseline. It is somewhat surprising that the two together outperform text features alone by a significant margin. 38
  • Slide 39
  • Geo-location results cont. The same classification task for clusters of cities. The performance on higher-ranked cities is generally better than on lower ranked cities. Greater number of training exemplars. Possibly there are certain properties of the more highly photographed cities that make them more easily classifiable visually. 39
  • Slide 40
  • Geo-location results cont. The 25- and 50-way landmark classification task for the top 10 cities. The performance of the visual classifier degrades roughly linearly as the number of landmarks increases. Textual and combined classifiers degrade quite slowly 40
  • Slide 41
  • Geo-location results cont. We use the same training and classification paradigm, but for clusters of photos at the metropolitan-scale. Textual tag features remain quite distinctive at this scale and hence perform well.(56.83%) Visual features, on the other hand, are not useful (12.72%). Hence we couldnt distinct between metropolitan-scale photos,using visual features. This result is intuitive: there is relatively little that visually separates a typical scene in one city from a typical scene in another. 41
  • Slide 42
  • Adding temporal information Time provides another dimension along which photographs can be connected together. Photos taken at nearby places at nearly the same time are very likely to be related. Temporal information can be exploited both to recover interesting facts about human behavior, and to geo- locate photos more accurately. 42
  • Slide 43
  • Geo-tagged and time stamped photos Every time a photo is taken, we have an observation of where a particular person is at a particular moment of time, and create something like GPS tracking device. By aggregating this data together over many people, we can reconstruct the typical pathways that people take as they move around a geospatial region. We plotted the geolocated coordinates of sequences of images taken by the same user, sorted by time, and no more than 30 minutes apart. 43
  • Slide 44
  • Visualization of photographer movement in Manhattan and the San Francisco Bay area 44
  • Slide 45
  • Improving classification performance We revisit the landmark classification problem of the last section, adding temporal information in addition to the textual and visual features. In classifying a photo, we also examine the photos taken by the same photographer within 15 minutes before and after the picture was taken. We compute the classification distances for each of the k svm classifiers, sum the scores from the different images together to produce a single k-vector, and then make the classification decision using that vector 45
  • Slide 46
  • performance on the landmark classification task with and without using temporal information 46
  • Slide 47
  • Findings For the classifiers that use only textual tags, the improvement is small,many Flickr users appear to label groups of consecutive photos with the same tags. For the visual tags, however, temporal information improves the results dramatically. photographers take multiple pictures of the same landmark,and thus neighboring frames provide good visual evidence of where the photos were taken. For all of the cities the best performance is achieved by using the full combination of textual, visual, and temporal information. 47
  • Slide 48
  • Representative images Given our ability to automatically find and generate textual descriptions of cities and landmarks, it is natural to ask whether it is possible to extract visual descriptions as well. Given a set of photos known to be taken near a landmark, we wish to automatically select a canonical image of the landmark. This problem is non-trivial because the subject of most photos taken near a landmark is actually not the landmark itself, so simple techniques like random selection do very poorly. 48
  • Slide 49
  • Intuition People take photos because they think a subject is visually interesting, pleasing, or distinctive. It is as if photos of a landmark are votes for what the visual representation of the landmark should be. Thus we find representative images by looking for subsets of photos that are visually very similar, and choosing an image from among the most salient subset. 49
  • Slide 50
  • Reduction of the problem We pose canonical image selection as a graph problem. We construct a graph in which each node represents a photo and between each pair of nodes is an edge with a weight indicating the degree of visual similarity between the two photos (using the Euclidean distance between SIFT descriptors). The goal : finding a tightly-connected cluster of photos that are highly similar. To do this we use a spectral clustering technique. Finally, we choose as the canonical image for each cluster the one corresponding to the node with the largest weighted degree. 50
  • Slide 51
  • 51 Matthias Hein and Ulrike von Luxburg August 2007
  • Slide 52
  • 52 Matthias Hein and Ulrike von Luxburg August 2007
  • Slide 53
  • 53 Matthias Hein and Ulrike von Luxburg August 2007 A a graph consists of vertices and edges.
  • Slide 54
  • 54 Matthias Hein and Ulrike von Luxburg August 2007
  • Slide 55
  • Simple example: A graph with 4 nodes and 2 clusters. We put edges between every pair of objects in the same cluster, and put no edges across clusters. The adjacency matrix of the graph is block diagonal. The eigenvectors of this matrix are The eigenvectors of these matrix identifies the clusters. 55
  • Slide 56
  • Example cont. If we permute the matrix by swapping rows and columns the eigenvectors will, again identify the clustering. For example 56
  • Slide 57
  • Reminder : given a set of photos known to be taken near a landmark, we wish to automatically select a canonical image of the landmark. Finding representative images by looking for subsets of photos that are visually very similar, and choosing an image from among the most salient subset. Results : http://www.Cs.Cornell.Edu/~crandall/photomap/ http://www.Cs.Cornell.Edu/~crandall/photomap/ 57
  • Slide 58
  • Conclusion We introduce techniques for analyzing a global collection of geo-referenced photographs. We saw techniques to automatically identify places that people find interesting to photograph. We used classification methods for predicting these locations from visual, textual and temporal features. Finally,we demonstrate that representative photos can be selected automatically. 58
  • Slide 59
  • 59
  • Slide 60
  • REFERENCES David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg. Mapping the Worlds Photos http://www.cs.cornell.edu/~crandall/papers/mapping09www.pdf http://www.cs.cornell.edu/~crandall/papers/mapping09www.pdf Konstantinos G. Derpanis, Mean Shift Clustering, August 15, 2005. http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf support vector machines http://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machine support vector machines http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt http://www.cs.tau.ac.il/~bchor/SEM05/IgorSVM+Phosphorylation.ppt Matthias Hein and Ulrike von Luxburg August 2007 http://www1.idc.ac.il/toky/seminarIP-08/.../SpectralClustering.ppt http://www1.idc.ac.il/toky/seminarIP-08/.../SpectralClustering.ppt A Very Simple Explanation of Spectral Clustering http://www.akrish.net/blog/2012/03/16/simple-spectral-clustering/ http://www.akrish.net/blog/2012/03/16/simple-spectral-clustering/ Mapping the World's Photos figures http://www.cs.cornell.edu/~crandall/photomap/ http://www.cs.cornell.edu/~crandall/photomap/ 60