[IEEE 2010 IEEE International Symposium on Multimedia (ISM) - Taichung, Taiwan...

8
Event Clusters Detection on Flickr Images using a Suffix-Tree Structure Massimiliano Ruocco and Heri Ramampiaro Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway {ruocco,heri}@idi.ntnu.no Abstract—Image clustering is a problem that has been treated extensively in both Content-Based (CBIR) and Text- Based (TBIR) Image Retrieval Systems. In this paper, we propose a new image clustering approach that takes both annotation, time and geographical position into account. Our goal is to develop a clustering method that allows an image to be part of an event cluster. We extend a well-known clustering algorithm called Suffix Tree Clustering (STC), which was originally developed to cluster text documents using a document snippet. To be able to use this algorithm, we consider an image with annotation as a document. Then, we extend it to also include time and geographical position. This appears to be particularly useful on the images gathered from online photo-sharing applications such as Flickr. Here image tags are often subjective and incomplete. For this reason, clustering based on textual annotations alone is not enough to capture all context information related to an image. Our approach has been suggested to address this challenge. In addition, we propose a novel algorithm to extract event clusters. The algorithm is evaluated using an annotated dataset from Flickr, and a comparison between different granularity of time and space is provided. Keywords-Event Detection; Image Clustering; Event Cluster- ing; Suffix Tree; Image Annotation; I. I NTRODUCTION The proliferation of web photosharing applications such as Flickr 1 has resulted in large amount of personal photos that are available for public access. Today Flickr contains 4 billion 2 photos, of which 1 milion are geotagged 3 . To access these photos, users often do browsing or text-based search, which are often imprecise and time-consuming processes. Thus, both processes can benefit from image clustering by (1) allowing for better and more user-friendly navigation, and (2) by reducing the search space and improved re- ranking of the search result. In online photosharing applications, images may be ac- companied by textual annotation and geographical informa- tion also known as photo tags. A tag-based clustering appli- cation has already been proposed in Flickr 4 . However, this clustering algorithm does not take the time or geographical information into account, and it is only based on statistics computed on the tags. 1 http://www.flickr.com 2 http://blog.flickr.net/en/2009/10/12/4000000000/ 3 http://code.flickr.com/blog/2009/02/04/100000000-geotagged-photos- plus/ 4 http://blog.flickr.net/en/2005/08/01/the-new-new-things/ An image may belong to several semantic levels. As an example, a photo of a group of people in front of the Tour Eiffel during the New Year’s celebration may belong to three semantic layers: the first layer is related to the event of the New Year’s celebration, identified by the time information. The second layer represents the fact that it is a tourist at- traction, and as such is strictly related to the visual features. The third layer is connected to the geographic location – e.g., the Paris area. This means that an image can be placed in a two-dimensional space, represented by time and space. This calls for a new clustering approach. Traditional text-based clustering only considers image annotations, which can be inaccurate and often insufficient due to its subjectivity and limitation in coverage. The main objective of this paper is to overcome this limitation by including both time and location dimensions as well as the annotations in the event extraction for the clus- tering process. We achieve this by extracting event clusters from large datasets such as Flickr images using temporal and spatial information, and by extending an incremental clustering algorithm, called Suffix Tree Clustering (STC) algorithm [1]. There are two main reasons for choosing STC. First, due to its incremental nature, STC is particularly suitable for dynamic and large amounts of data such as Flickr image collections. Second, STC has been shown to have good performance characteristics [1]. This makes it useful in dealing with the extraction of relevant event clusters. As we will discuss later, this process requires frequent data access and comparison of clusters. Because the cost of access to a STC (tree) structure is logarithmic – i.e., O(n log(n)), where n is the number of nodes in the STC tree, it is ideal for this type of application. Most work in event detection has mainly been done on text documents. An event can here be defined as “something happening in a certain place at a certain time” [2]. From this perspective, the main contributions of this paper are as follows. First, we suggest a new event extraction-based clustering approach that addresses the above challenges and limitations of image annotations, and by taking the noises in the annotations and their different semantic levels into account. As part of this, we extended a dictionary tailored to the Flickr images. Using this dictionary, we can analyze image tags and remove noises from the annotations. As a result, we gain a reduced term space. Second, we propose to extend the well-known clustering algorithm STC to deal 2010 IEEE International Symposium on Multimedia 978-0-7695-4217-1/10 $26.00 © 2010 IEEE DOI 10.1109/ISM.2010.16 41

Transcript of [IEEE 2010 IEEE International Symposium on Multimedia (ISM) - Taichung, Taiwan...

Event Clusters Detection on Flickr Images using a Suffix-Tree Structure

Massimiliano Ruocco and Heri RamampiaroDepartment of Computer and Information Science

Norwegian University of Science and Technology

Trondheim, Norway

{ruocco,heri}@idi.ntnu.no

Abstract—Image clustering is a problem that has beentreated extensively in both Content-Based (CBIR) and Text-Based (TBIR) Image Retrieval Systems. In this paper, wepropose a new image clustering approach that takes bothannotation, time and geographical position into account. Ourgoal is to develop a clustering method that allows an imageto be part of an event cluster. We extend a well-knownclustering algorithm called Suffix Tree Clustering (STC), whichwas originally developed to cluster text documents using adocument snippet. To be able to use this algorithm, we consideran image with annotation as a document. Then, we extend itto also include time and geographical position. This appearsto be particularly useful on the images gathered from onlinephoto-sharing applications such as Flickr. Here image tags areoften subjective and incomplete. For this reason, clusteringbased on textual annotations alone is not enough to captureall context information related to an image. Our approachhas been suggested to address this challenge. In addition,we propose a novel algorithm to extract event clusters. Thealgorithm is evaluated using an annotated dataset from Flickr,and a comparison between different granularity of time andspace is provided.

Keywords-Event Detection; Image Clustering; Event Cluster-ing; Suffix Tree; Image Annotation;

I. INTRODUCTION

The proliferation of web photosharing applications suchas Flickr1 has resulted in large amount of personal photosthat are available for public access. Today Flickr contains 4billion2 photos, of which 1 milion are geotagged3. To accessthese photos, users often do browsing or text-based search,which are often imprecise and time-consuming processes.Thus, both processes can benefit from image clustering by(1) allowing for better and more user-friendly navigation,and (2) by reducing the search space and improved re-ranking of the search result.

In online photosharing applications, images may be ac-companied by textual annotation and geographical informa-tion also known as photo tags. A tag-based clustering appli-cation has already been proposed in Flickr 4. However, thisclustering algorithm does not take the time or geographicalinformation into account, and it is only based on statisticscomputed on the tags.

1http://www.flickr.com2http://blog.flickr.net/en/2009/10/12/4000000000/3http://code.flickr.com/blog/2009/02/04/100000000-geotagged-photos-

plus/4http://blog.flickr.net/en/2005/08/01/the-new-new-things/

An image may belong to several semantic levels. As anexample, a photo of a group of people in front of the TourEiffel during the New Year’s celebration may belong to threesemantic layers: the first layer is related to the event of theNew Year’s celebration, identified by the time information.The second layer represents the fact that it is a tourist at-

traction, and as such is strictly related to the visual features.The third layer is connected to the geographic location – e.g.,the Paris area. This means that an image can be placed in atwo-dimensional space, represented by time and space. Thiscalls for a new clustering approach. Traditional text-basedclustering only considers image annotations, which can beinaccurate and often insufficient due to its subjectivity andlimitation in coverage.

The main objective of this paper is to overcome thislimitation by including both time and location dimensions aswell as the annotations in the event extraction for the clus-tering process. We achieve this by extracting event clustersfrom large datasets such as Flickr images using temporaland spatial information, and by extending an incrementalclustering algorithm, called Suffix Tree Clustering (STC)algorithm [1]. There are two main reasons for choosingSTC. First, due to its incremental nature, STC is particularlysuitable for dynamic and large amounts of data such as Flickrimage collections. Second, STC has been shown to havegood performance characteristics [1]. This makes it useful indealing with the extraction of relevant event clusters. As wewill discuss later, this process requires frequent data accessand comparison of clusters. Because the cost of access to aSTC (tree) structure is logarithmic – i.e., O(n log(n)), wheren is the number of nodes in the STC tree, it is ideal for thistype of application.

Most work in event detection has mainly been done ontext documents. An event can here be defined as “somethinghappening in a certain place at a certain time” [2]. Fromthis perspective, the main contributions of this paper areas follows. First, we suggest a new event extraction-basedclustering approach that addresses the above challenges andlimitations of image annotations, and by taking the noisesin the annotations and their different semantic levels intoaccount. As part of this, we extended a dictionary tailoredto the Flickr images. Using this dictionary, we can analyzeimage tags and remove noises from the annotations. As aresult, we gain a reduced term space. Second, we proposeto extend the well-known clustering algorithm STC to deal

2010 IEEE International Symposium on Multimedia

978-0-7695-4217-1/10 $26.00 © 2010 IEEE

DOI 10.1109/ISM.2010.16

41

with not only the textual annotations, but also time andlocational information. Here we use the set of annotationsin an image as the snippet of a document and apply it in theclustering process. Third, a clustering algorithm such as STCwas originally proposed for text document clustering only.We believe that the way we use and extend this algorithmin our approach is unique. This, in combination with eventextraction on a large set of images, is in itself a contribution.Finally, we do an analysis of the behavior of the algorithmusing different granularity of time and space, and evaluateits performance.

The remainder of this paper is organized as follows. Firstto put our work in perspective, Section II discusses otherwork related to ours. Second, in Section III we formallydefine the problem addressed in this paper. Third, Section IVgives an overview of the STC algorithm and briefly discusseshow it is used in our work. Fourth, Section V describes indetail the principle behind our approach. Fifth, Section VIpresents and discusses the results from the evaluation of ourmethod. Finally, in Section VII we conclude our paper andpresent directions for further research.

II. RELATED WORK

The event detection topic has its origin from the TDT(Topic Detection and Tracking) project [3]. In this section,we discuss how event detection has been used in other workrelated to our approach.

Focusing on approaches that detect events from imagetags in the Flickr collection, the work by Chen and Roy[4] seems to be most related to our approach. Like ours,they use temporal and spatial information (GPS position aslatitude and longitude) to analyze the tags. In contrast toours, they perform wavelet transform on data to reduce thenoise. Three main steps are performed: First, they do anevent tag detection to extract tags related to events, basedon their temporal/spatial distribution. Second, they generateevents to group tags related to the same event together, byusing semantic similarity and considering spatial distancebetween two tags. This distance is computed based on theKL-divergence between the two densities representing thetwo tags. As part of this step, a pre-step process is performedto differentiate tags related to periodic and aperiodic events.Then, the retrieved events is linked to a set of photos(still differentiating periodic and aperiodic events) afterdetermining the time and location of each photo. This kindof approach is called Feature-Pivot approach [5]. Here themain focus is on the determination of bursty events – i.e.,events that are “hot” in a certain period of time – in a textstream chronologically order. Bursty events consists of a setof bursty features, and are mainly used for assisting textclassification. The presented approach aims to extract burstyfeatures to detect the related bursty events. In addition toFeature-Pivot approach, there are other set of approachesthat focus on the document content. These approaches canbe categorized as Document-Pivot approaches [2], [6]–[8].As opposed to Feature-Pivot approaches, the main idea isto group/cluster documents based on events extracted from

the document contents. Thus, a group of similar documentswill form an event.

Other works addressing event detection from Flickr im-ages are presented in [9] and [10]. In the former, the focusis on the extraction of location and event semantic fortags assigned to Flickr photos. In the latter, the authorsalso present an approach for detecting event tags in a usercollection photos. However, here only temporal features areconsidered. The idea is to capture the picture-taking behaviorof the user, using time series – i.e., number of images perday. By analyzing this behavior, their algorithm can extractsignificant events from the image collections. They use theARIMA (auto-regressive Integrated Moving Average) modelas a model representing picture-taking behavior over time.This model takes into consideration an Auto Regressivecomponent that captures the correlation between the historyfrom the previous values in the time data and a movingaverage. They then use this information to detect the changesin the picture-taking behavior over the time. In the secondthe focus of the paper is on the extraction of location andevent semantic for tags assigned to Flickr photos

Event clustering on media-sharing was proposed in [11],where social media document is considered as a set offeatures. They defined similarity metrics for each features,and used this to group social document associated to onespecific event together.

Apart from the above, most previous papers have focusedon detecting events form document streams rather thanimage collections. In particular the two main detection tasks– retrospective detection and online detection – are definedin [2]. The former aims to discover previously unidentifiedevents, while the latter aims to identify new events fromstream of news. Existing papers on the first approach are [6],[7], [12], [13], while the approach on retrospective detectionis proposed in [2]. In this work the author developed a group-average clustering (GAC) algorithm for retrospective detec-tion and Incremental Clustering (INCR). This is a singlepass algorithm for both retrospective and online detection.For a given set of documents, the GAC algorithm producesa forest of cluster trees. The idea is to exploit the temporalproximity of the different stories giving more priority ingrouping temporally consecutive stories. Our approach canbe related to this work in that we also focus on detection ofevents in a set of documents gathered in temporal windows.However, our documents are represented by image featuresand their contextual information and not only text documentcontent.

Some other approaches have been proposed to use otherinformation than time in the event detection process. In[14], [15] location information is used to improve theeffectiveness of TDT. In the first work a retrospective eventdetection on unstructured history documents is presented.Statistical measures are used to analyze the frequency ofco-occurence between date and place names over sentencesand paragraphs. Based on this information, events can be ex-tracted and ranked. In the second work, the authors presents

42

an analysis on contribution of place and time informationin event tracking domain. Here, place names are extractedautomatically from news paper articles. The main idea ofthis work is to apply named entity recognition to gatherplace information from documents. This approach has notbeen extensively treated in TDT. However, while TDT canindeed benefit from the the place information, the authorsrecognize that place names are difficult to process becauseof the incurred ambiguities – e.g, Washington as a personvs. a place name.

Focusing on approaches that use visual features to trackevents, there are few that have been proposed. An exampleof these is the work by Le and Fei-Fei [16]. In this paper,the authors proposed a system to recognize events in staticimages using visual content. Classification is performedon picture of sport events by interpreting the semanticelements in the image, and is done by integrating scenesand object categorization. Another example is the work byLoui and Savakis [17]. In this paper, the authors proposeda novel approach for automatic generation of albums (alsocalled automatic albuming) from a collection of personalphotos. In particular, the system presented includes an eventclustering algorithm that works on two levels, date/time andvisual content. The goal of automatic albuming consists ofhelping users to organize their pictures in a story. A storyis here an organized set of photos with the appropriatedcontext information, that participate in the interpretationof the photo. Personal photos are then classified using thecombination of date and time, and the correlation betweenpicture visual contents.

Finally, in [18] event categorization is treated as a multi-class classification problem. The focus is first on discov-ering and mining compositional features and apply thatfor classification. Gps trace are used as series of geotag,fused with visual content, in order to classify images intopredefined events. This approach uses AdaBoost classifierfor the classification purposes.

III. PROBLEM DEFINITION

Let F be our data collection consisting of a collection ofimages downloaded from Flickr, from a certain geographicalarea and over a period of time. Each image I ∈ F will berepresented in the following way:

I = {T, g, dt} (1)

where T represents the set of annotations for the image,including the title of the image, dt denotes the date andtime when the photo was taken, g = (lat, lon) is the pair ofreal numbers representing latitude and longitude coordinates.All this information is available in every Flickr image wegathered.

As mentioned in Section I, an event may be representedwith three parameters: the time when the photo was taken,the geographical location and the tags . Images relatingto an event can be grouped within the same time slot,within the same geographical area and preferably associatedwith one or more tags. However, the converse does not

work, meaning that we cannot necessarily group imagesin time, geographical area and tag the cluster of imagesbased on a specific event. To illustrate, group of photosof Tour Eiffel taken in a certain day can hardly be usedto describe an event. An event detector that does not takethis assumption into account would risk creating a veryhigh number of false positives. To further eliminate falsepositive hits, we also need the following assumption. For anevent, the set of images taken in an area at a given timewith a particular tag is the same set of images taken in thesame area with the same tag (without taking the time intoaccount). In this case, the event is identified by the patternpi = [T imeiGeoiTagij ].

The goal of our system is to extract the set of events E ={e1, ..., en} from a set of geo-tagged and labelled images Fin accordance with our assumption. Every ei = {I1, ..., Ik}will group the set of pictures representing an event together.Every ei will be also labelled. The labeling process consistsof two steps: (1) selection of terms directly inducted fromthe Suffix Tree data structure to the clusters, (2) selectionof terms inferred by the analysis of the adjacent subtrees toan event cluster.

IV. PRELIMINARY

The core of our system is based on an incremental algo-rithm called Suffix Tree Clustering (STC), mainly used forclustering documents. This type of clustering algorithm wasfirst used in [1] and has by far been used for text documentsand web documents clustering [19]. To our knowledge, thisis the first time this algorithm is used for clustering of a largedataset as Flickr images. STC is snippet tolerant, meaningthat it can produce high quality clusters from the documentsnippet, instead of the whole document (see also below).This was also the main reason we chose to use this algorithmin our work. Here, every set of annotations of the imageswill be considered as the snippet for the clustering process.STC consists of three logical steps: document cleaning,identifying base clusters using a Suffix Tree and mergingthe base clusters. Each cluster is labelled using the SuffixTree structure. The designated clustering algorithm has thefollowing features:

• Phrase based model: this clustering algorithm tends togroup documents with common sub-phrases together.STC does not treat document as bag of words but assequence of words.

• Linear time: the construction of the Suffix Tree can bedone in linear time with respect to the number of items[20].

• Snippet-tolerance: in document clustering, the algo-rithm only uses snippet rather than the entire docu-ment to produce high quality clusters. Snippets areshort representative part of the text document. In ourexperiments an image is represented by the title andthe annotations, which in turns form a set of wordsrepresenting that image.

• Overlap: documents may be shared between differentclusters. This means that they can belong to different

43

topics. In the image context, an image can contain moresemantic concepts and it may be labelled or classifiedwith different labels/concepts.

Compared to existing work, the uniqueness of our approachis the way we use a fast, incremental algorithm, originallyproposed for document clustering to group together largecollections of similar images. This is achieved by expandingthe annotation such that we can both deal with the incom-plete annotations and apply more semantic on the clusteringprocess.

V. EXTRACTING EVENTS FROM EVENT CLUSTERS

The basic idea underlying our approach is as follows.First, following the discussion in the previous section, imageclustering is performed on the document snippets extractedfrom the image tags. The clustering process produces a setof candidate event clusters. These are groups of images thatmost likely represent the events. This set of cluster willbe filtered following our hypothesis to produce the event

clusters. After this, overlapping clusters are merged to onecluster. Finally, the event clusters are labelled by analyzingthe temporal/geographical adjacent clusters.

A. System Overview

Figure 1 shows how our system is built up. It canbe divided in two parts; one offline and another onlinepart. The former deals with the offline data collection andconstruction of support structures, while the latter deals withthe construction of clusters and extraction of events. Thispart also includes a refining step – mentioned above, tomerge adjacent subtree clusters, with event annotations andthe event itself.

Figure 1. Overview of the system

In addition to the main parts, the system also havesubmodules that we elaborate below.

B. Cleaning

The first step after acquiring the data is a preprocessingstep, consisting of cleaning the image annotations. Anyrepresentation of the image I = (T, g, dt) is transformed into

a I � = (T �, g, dt) where all stopwords are removed from theannotation, and stemming is performed. Many annotationsin Flickr images have semantically irrelevant terms, or theyare so common that they do not contribute much in thediscrimination of clusters. For example camera names likeNikon, Canon or other terms like jpeg and geotag arevery common and may safely be removed. Other commonterms that can be omitted are those referring to time infor-mation such as January, Febraury or the short versionsas Jan, Feb. Further, we can safely remove frequent wordscontaining digit such as Oct2009 and 12May. Althoughthey contain intrinsic temporal information, they are notuseful in the clustering process. For this reason, we decidedto remove all terms containing digits. The list of terms inthe stopword vocabulary can be found in Table I.

As can be inferred here, the main benefit of this step isthe reduction of the space needed to construct the SuffixTree. This in addition to the fact that stopword removalalso contributes to improve quality of the clustering process,by avoiding noise in the annotation. Further, by applyingstemming we my both reduce the search space and improvethe retrieval performance [21]. Here, we apply the PorterStemmer algorithm [22] to reduce inflected and derivedwords to their stem.

C. Annotation Expansion

After the cleaning is done, the tags in the image arereduced in size. Thus, I � = (T �, g, dt) is now smaller than I .However to further improve the clustering process, we haveto extend the tags again. So in this step we extend the tagsby including the information about time and location fromthe exif data. Note that this expansion will not have anynegative effects on the performance of the algorithm sincewe can construct the Suffix Tree in linear time, and searchwithin the tree is logarithmic.

Formally, let I � = (T �, g, dt) be the image representationwhere T � = {t’1, t’2, ..., t’l} is the set of tag associated withthe picture I �. Then t’i ∈ T � may be a term or a sequence ofterms. On the set I � will be applied a tag extension process.The output of that process will be the set I �� = (T ��, g, dt)where T �� = {t”1, t”2, ..., t”l} and

t”i = s1(dt) + s2(g) + ti (2)

Figure 2 show an example of how tag expansion isperformed. The original tag set is shown on the left side:T � = {[venice], [architecture]}. After the extensionwith time-string and location-string, the set becomes:T �� = {[Oct2008 45.44:12.33 venice],[Oct2008 45.44:12.33 architecture]}. Thefunctions s1 e s2 denotes the transformation from dataand geographical position to string – i.e., a time and spacediscretization.

The function s1 e s2 will thus define the granularity onwhich time and space are divided. In the Section VI, we willanalyze the effect of event extraction performance when thegranularities of time and space vary.

44

Figure 2. Example of annotation expansion

D. Suffix Tree Construction

At this point, the set T �� in the triplet denoting I �� willbe the snippet representing the image. The Suffix Tree willbe built on this set and stored on the file system. Thisis illustrated in Figure 3. In accordance with the orginalsuggestions on Suffix Trees, our Suffix Tree for a givenstring S is a compact trie containing all the suffixes of S. ASuffix Tree must satisfy the following properties:

• It is a rooted directed tree.• Each internal node, other than the root, has at least two

children.• Each edge leaving a particular node is labelled with a

non-empty substring of S of which the first symbol isunique among all first symbols of the edge labels ofthe edges leaving this particular node.

• For each suffix s of S, there exists a suffix-node whoselabel equals s.

As mentioned before, every image is represented by acollection of tags. Each tag is formed by one or more terms,and each Suffix Tree groups all terms with the same suffix.Every node, representing a possible cluster, is called base

cluster and has a unique label. For a specific node, the labelis composed by a sequence of the labels for all the nodesthat have to be passed through when traversing the tree fromthe root to the node.

E. Event Detection

All the base clusters labelled with a time tag and geotags are possible events. We can detect them by traversingthe Suffix Tree. When this is finished, a set of base clustersSi can be extracted. Each such a cluster will then be taggedwith the sequence [tagtime taggeo tagtext]. This means thateach clusters is a group of images based on a certain timeslice and a certain geographical square.

Further, as mentioned before, not all extracted clusterscan be an event. To avoid false positive hits, a further stepis needed. Let S�

i, labelled as [taggeo tagtext] be the clusterextracted from Suffix Tree. Our hypothesis is that if a clusterSi contains a series of images related to an event, then thisseries will be the same set of images in the cluster S�

i. Thisimage set will also be the collection of photos taken withgeographical position taggeo and tagged with a tag tagtext.

!!!

s2(g)

s1(dt)

!!"

!!!!!!

!!"

!!!

#$

#%$!!!

s2(g)

Figure 3. Suffix Tree construction

This hypothesis always holds because a tag represent-ing an event can only belong to a single combination ofdate/time and a geographical area. It still holds even ifwe have a situation where an object (image object) or aplace appear in several images taken over a long period oftime. Although the tags for these images will have the samegeo tags but different time tags, their combinations are stillunique. To capture this, in our tree structure, the images inSi need to be an underset of the set of images in S�

i.

F. Event Clusters Labelling

Let E = {e1, ..., en} be the set of events extracted fromthe previous step. Each event will be a base cluster andconsists of a set of images labelled with lei = [s1(dt) +s2(g)+ tki], where tki is directly derived from the tree datastructure.

Then the subtree associated with the event cluster ei willbe analyzed to expand the tag label by traversing the subtreefrom the root to each leaf, and by collecting the label on thebranch/node.

Now, the issue to be solved is that there may be two eventclusters ej and ek representing the same event. This happenswhen events are tagged with a set of terms from a subjectbut appears in different clusters. The Suffix Tree clusteringalgorithm (STC) may produce overlapping clusters. There-fore, one or more pictures can belong to more than onegroup, and thus appearing in more than one cluster. A wayto solve this issue is as follows. First we analyze the clustersto see if their tags can be related to the same event. If thisis true, then we merge them into a single cluster. Otherwise,we keep them separate. The label for the new cluster willnow be a label derived from the merging of the labels forthe overlapping clusters.

To be more specific, we perform two merging steps:• merge two or more clusters representing the same event

from the extracted set of clusters E, into one cluster.• add the set of images not previously included in E

and not respecting the hypothesis in Section V-E to theevent clusters.

45

As can be inferred from this, all the semantically similarevent clusters of E are merged into a single cluster. Thissimilarity is computed using the following similarity func-tion:

Ψ(ei, ej) =|ei ∩ ej |

min(ei, ej)(3)

This means that Ψ(ei, ej) measures the degree of overlap-ping between two event clusters ei and ej . Inspired by [1],we will build an event cluster graph to facilitate the mergingof similar clusters, based on the function 3. The nodes of thisgraph will be the event clusters. If two nodes for two eventclusters ei and ej have the similarity value Ψ(ei, ej) = 1,then we add an edge between ei and ej . So, to find outwhich clusters to be merged into single clusters, we traversethe graph to find all nodes that are connected and mergethese.

Next, we will traverse the Suffix Tree to find all eventsnot in E that have the same geo and time prefix. Each suchan event ei will be candidate for merging. Now we can usethe above similarity function to decide whether to merge ornot.

VI. EVALUATION

In this section, we present the results of our experiments.The main goal is to test the performance of our method.First, after a brief presentation of our dataset, we willinvestigate the effects of cleaning of tags, including the useof extended stopword list for stopword removal. Second, wewill analyze the performance (in terms of accuracy) of theextraction of the event clusters. Finally, we will evaluate theperformance over different space and time granularities.

A. Data Set

We use the Flickr Api5 to collect our data set fromFlickr. We gathered all the images taken in the periodfrom 12 june 2008 to 11 june 2010, in a temporal rangeof 729 days. Only geotagged pictures was considered. Allthe pictures come from an area of 290 square meters inthe San Francisco Area. The spatial area is a rectanglewith a minimum latitude of 37.6817, maximum latitude of37.8229 and minimum longitude of -122.5495 and maximumlongitude of -122.3408 (see Figure 4). The total number ofimages collected for our dataset is 342357.

For each image in this dataset we considered the title ofthe picture as part of the tag set.

Let tag be a single annotation of the set of tags for thepicture – e.g., Tour Eiffel is one tag, and let term be asingle word of the tag – i.e., the tag Tour Eiffel containstwo terms. This means that in our dataset there are 2943870tags and 28875 of these are unique. Further, the numberof terms is 4839082, and 26610 of these are unique. Everyimage has an average number of tags of 8.6. Each term isused by 43 users in average, and the maximum term usageis by 316 users, which is the word california. Finally

5http://www.flickr.com/services/api

Figure 4. San Francisco Area considered for gathering Flickr photos

Temporal terms january, february, march, april, may, june,july, august, september, october, november,dicember, jan, feb, mar, apr, may, jun, jul,aug, sep, oct, nov, dec, summer, winter,autumn, fall, spring

Camera related terms agfa, canon, nikon, tamron, sony, power-shot, pentax, eos, reflex, polaroid, epson

General Noise flickr, photo, image, picture, geotagged,geotag, geo, jpeg, cc, stockphoto, resolution,iphone, ipod, img, dsc, raw, lat, lon, jpg, gif

Table IEXTENDED STOPWORD LIST

4655 tags are used by only one user and every tag is usedin average by 2.21 users.

Regarding the term picture frequency, the most frequentterm california was used in 218716 pictures, and 2956terms are used in only one picture.

B. Data Set Tag Cleaning

Again, we need the cleaning process to remove noises,and to keep the most representative tag for a picture. In thisexperiment we first removed all stopwords using a standardEnglish stopword list. Then, we extend this list with themost common terms, as listed in Table I, and omit them aswell. Next, all tags used by only one user were removed. Wealso removed those consisting of terms used in more than10000 pictures. Further, we only kept terms/tags containingletters and removed those with digits. Finally a stemming isapplied on each term using the Porter Stemmer algorithm.

Table II shows the effect of the reduction of the numberof tags and terms after the cleaning process, in terms ofreduced space usage for the Suffix Tree data structure.

Note that after the tag cleaning process is completed, wemay safely remove all images that no long have tags.

C. Space and Time Granularity Analysis

As mentioned previously, we extended the tags with timeand location string. The two functions s1 and s2 transform

46

Before Cleaning After Cleaning# Images 342357 305476

#Tags 2943870 103899#unique Tags 28875 1142

#Terms 4839082 103899#unique Terms 26610 1142Avg.Tags/Pict. 8.599 0.30

Avg.Terms/Tags 1.64 1.0

Table IIEFFECT OF THE CLEANING PROCESS ON THE TAGS AND TERMS

the real values of space and time in two string. Then, thetime is discretized in a time window slice and the locationcoordinates in a square.

Our analysis of the algorithm performance will be doneon two different time granularities and four different dis-cretization of the space. We consider a time slice of 1 dayand 1 week. For the location we will consider square of0.001, 0.002, 0.005, 0.01 decimal precisions in latitude andlongitude unit measures – i.e., a square with side lengths of111, 222, 555 and 1000m.

We will evaluate the clustering performance using in-formation retrieval metrics. This is because we use theclustering results to retrieve event clusters, and to rank themaccording to a score function based on the cardinality ofeach clusters. Such an evaluation has been done before in[19] and [1]. Unfortunately, we do not have any ground truthto evaluate our results. For this reason we manually checkedeach cluster to decide whether they are event clusters or not.For this purpose we evaluated the top-20 of the clusters thatwere extracted from the execution. The precision value werethen computed at each rank level. This computation weredone for each combination of granularity of time and space.Thus, we will have 8 top-20 lists, which we compare againsteach other.

Table III shows the results from our experiments withboth 1 day and 1 week granularity over the different spacegranularity. We can observe from these results that we gotthe best performance with the smallest location square andthe smallest time slice. The results also show that the top-20 precisions decreased when the dimensions increased.Focusing on the overall result, based on top-20 precisionour experiment has shown that we are able to get satisfactoryclustering performance. None of our experiments has givenus precision values less that 70% (two 75% and three 70%using small square).

VII. CONCLUSIONS

We have presented a novel approach to extract eventclusters from a large set of Flickr images. In this approachwe have shown that we could consider a Flickr picture asa document snippet since the annotations are short textualdescription, especially after the noises have been removed.Here we analyzed the content of the annotations to identifythe noises. This includes the removal of stopwords and othercommon words that do not contribute well in cluster discrim-ination. As partly a result of this, we were successfully able

to use an incremental clustering approach that has previouslybeen used to cluster text documents only. This means thatour hypothesis with respect to the use of text clusteringalgorithm on extended tag sets has been verified.

Our evaluation has shown that this approach has a goodpotential. We believe a top-20 precision of 75% for theextracted clusters gives us a good indication for this. Thisalso support our observation that each extracted cluster has ahigh consistency with respect to the number of false positivehits. Further, our experiments have shown that we get thebest results with the smallest time slice and the smallestsquare side dimension, but the result are still satisfactorywith the highest dimension of square and time slice.

The main drawback of our approach is that due to the lackof a good evaluation dataset, we were unable to evaluate ourapproach against any prelabelled cluster dataset and groundtruth. As part of our future work, we will therefore developa dataset that can be labelled by different users and tryto use this in our evaluation. Then subjective test will beperformed to evaluate the quality of the clusters. Other futureimprovements will go in the direction of the use of moresophisticated tree and cluster analysis for the annotationinduction of the clusters and Flickr pictures annotationextension.

REFERENCES

[1] O. Zamir and O. Etzioni, “Web document clustering: afeasibility demonstration,” in SIGIR ’98: Proceedings of the

21st annual international ACM SIGIR conference on Research

and development in information retrieval. New York, NY,USA: ACM, 1998, pp. 46–54.

[2] Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospectiveand on-line event detection,” in SIGIR ’98: Proceedings of the

21st annual international ACM SIGIR conference on Research

and development in information retrieval. New York, NY,USA: ACM, 1998, pp. 28–36.

[3] J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang,J. A. Umass, B. A. Cmu, D. B. Cmu, A. B. Cmu, R. B.Cmu, I. C. Dragon, G. D. Darpa, A. H. Cmu, J. L. Cmu, V. L.Umass, X. L. Cmu, S. L. Dragon, V. M. Dragon, R. P. Umass,T. P. Cmu, J. P. Umass, and M. S. Umass, “Topic detectionand tracking pilot study final report,” in In Proceedings of

the DARPA Broadcast News Transcription and Understanding

Workshop, 1998, pp. 194–218.

[4] L. Chen and A. Roy, “Event detection from flickr data throughwavelet-based spatial analysis,” in CIKM ’09: Proceeding

of the 18th ACM conference on Information and knowledge

management. New York, NY, USA: ACM, 2009, pp. 523–532.

[5] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu, “Parameterfree bursty events detection in text streams,” in VLDB ’05:

Proceedings of the 31st international conference on Very large

data bases. VLDB Endowment, 2005, pp. 181–192.

[6] J. Allan, R. Papka, and V. Lavrenko, “On-line new eventdetection and tracking,” in SIGIR ’98: Proceedings of the 21st

annual international ACM SIGIR conference on Research and

development in information retrieval. New York, NY, USA:ACM, 1998, pp. 37–45.

47

100 mt. 200 mt. 500 mt. 1000 mt.1 Day 1 Week 1 Day 1 Week 1 Day 1 Week 1 Day 1 Week

#Cluster #Ev. Prec. #Ev. Prec. #Ev. Prec. #Ev. Prec. #Ev. Prec. #Ev. Prec. #Ev. Prec. #Ev. Prec.1 1 100% 1 100% 1 100% 1 100% 1 100% 1 100% 1 100% 1 100%2 2 100% 2 100% 2 100% 2 100% 2 100% 2 100% 2 100% 1 50%3 3 100% 3 100% 3 100% 3 100% 3 100% 3 100% 3 100% 2 67%4 4 100% 4 100% 4 100% 4 100% 3 75% 4 100% 3 75% 2 50%5 4 80% 5 100% 4 80% 5 100% 4 80% 4 80% 4 80% 2 40%6 5 83% 6 100% 5 83% 6 100% 4 67% 5 83% 4 67% 2 33%7 5 71% 6 86% 5 71% 6 86% 5 71% 5 71% 5 71% 3 43%8 6 75% 7 88% 6 75% 7 88% 5 63% 5 63% 5 63% 4 50%9 7 78% 7 78% 7 78% 8 89% 5 56% 6 67% 5 56% 5 56%

10 8 80% 7 70% 7 70% 8 80% 6 60% 6 60% 6 60% 6 60%11 9 82% 8 73% 8 73% 8 73% 7 64% 6 55% 7 64% 7 64%12 10 83% 9 75% 8 67% 9 75% 8 67% 7 58% 8 67% 7 58%13 11 85% 10 77% 9 69% 9 69% 9 69% 8 62% 9 69% 8 62%14 11 79% 11 79% 10 71% 10 71% 9 64% 9 64% 10 71% 9 64%15 11 73% 12 80% 11 73% 10 67% 10 67% 10 67% 10 67% 10 67%16 12 75% 12 75% 12 75% 11 69% 11 69% 10 63% 11 69% 11 69%17 13 76% 13 76% 12 71% 11 65% 12 71% 10 59% 12 71% 12 71%18 13 72% 13 72% 13 72% 12 67% 13 72% 11 61% 13 72% 13 72%19 14 74% 13 68% 14 74% 13 68% 13 68% 12 63% 13 68% 13 68%20 15 75% 14 70% 15 75% 14 70% 14 70% 13 65% 13 65% 14 70%

Table IIIPRECISION OF event clusters EXTRACTION OVER DIFFERENT GRANULARITY OF SPACE AND TIME

[7] T. Brants, F. Chen, and A. Farahat, “A system for newevent detection,” in SIGIR ’03: Proceedings of the 26th

annual international ACM SIGIR conference on Research and

development in informaion retrieval. New York, NY, USA:ACM, 2003, pp. 330–337.

[8] D. Trieschnigg and W. Kraaij, “Hierarchical topic detectionin large digital news archives: Exploring a sample based ap-proach,” Journal of Digital Information Management, vol. 3,no. 1, 2005.

[9] M. Das and A. C. Loui, “Detecting significant events inpersonal image collections,” in ICSC ’09: Proceedings of the

2009 IEEE International Conference on Semantic Computing.Washington, DC, USA: IEEE Computer Society, 2009, pp.116–123.

[10] T. Rattenbury, N. Good, and M. Naaman, “Towards automaticextraction of event and place semantics from flickr tags,”in SIGIR ’07: Proceedings of the 30th annual international

ACM SIGIR conference on Research and development in

information retrieval. New York, NY, USA: ACM, 2007,pp. 103–110.

[11] H. Becker, M. Naaman, and L. Gravano, “Learning similaritymetrics for event identification in social media,” in WSDM

’10: Proceedings of the third ACM international conference

on Web search and data mining. New York, NY, USA: ACM,2010, pp. 291–300.

[12] K. Zhang, J. Zi, and L. G. Wu, “New event detection based onindexing-tree and named entity,” in SIGIR ’07: Proceedings

of the 30th annual international ACM SIGIR conference on

Research and development in information retrieval. NewYork, NY, USA: ACM, 2007, pp. 215–222.

[13] M. Hu, A. Sun, and E.-P. Lim, “Event detection with commonuser interests,” in WIDM ’08: Proceeding of the 10th ACM

workshop on Web information and data management. NewYork, NY, USA: ACM, 2008, pp. 1–8.

[14] D. A. Smith, “Detecting events with date and place infor-mation in unstructured text,” in JCDL ’02: Proceedings of

the 2nd ACM/IEEE-CS joint conference on Digital libraries.New York, NY, USA: ACM, 2002, pp. 191–196.

[15] Y. Jin, S. H. Myaeng, and Y. Jung, “Use of place informationfor improved event tracking,” Inf. Process. Manage., vol. 43,no. 2, pp. 365–378, 2007.

[16] L.-J. Li and L. Fei-Fei, “What, where and who? classifyingevents by scene and object recognition,” Computer Vision,

IEEE International Conference on, vol. 0, pp. 1–8, 2007.

[17] A. C. Loui and A. E. Savakis, “Automated event clusteringand quality screening of consumer pictures for digital album-ing,” IEEE Transactions on Multimedia, vol. 5, no. 3, pp.390–402, 2003.

[18] J. Yuan, J. Luo, H. Kautz, and Y. Wu, “Mining gps traces andvisual words for event classification,” in MIR ’08: Proceeding

of the 1st ACM international conference on Multimedia

information retrieval. New York, NY, USA: ACM, 2008,pp. 2–9.

[19] J. A. Gulla, H. O. Borch, and J. E. Ingvaldsen., “Contex-ualized clustering in exploratory web search,” pp. 184–207,2008.

[20] E. Ukkonen, “On-line construction of suffix trees,” Algorith-

mica, vol. 14, no. 3, pp. 249–260, September 1995.

[21] J. Kamps, C. Monz, M. de Rijke, and B. Sigurbjornsson,“Language-dependent and language-independent approachesto cross-lingual text retrieval,” in CLEF, 2003, pp. 152–165.

[22] M. F. Porter, “An algorithm for suffix stripping,” pp. 313–316,1997.

48