Clustering and Exploring Search Results using Timeline Constructions Authors : Omar AlonsoOmar...

75
Clustering and Exploring Search Results using Timeline Constructions Authors : Omar Alonso (University of California) Michael Gertz (University of Heidelberg, Germany) Ricardo Baeza -Yates (Yahoo! Research, Spain) Presenter : Zhong-Yong lustering and exploring search results using timeline constructions mar Alonso, Michael Gertz, Ricardo Baeza-Yates ages: 97-106 IKM 09 itation : 8 1

Transcript of Clustering and Exploring Search Results using Timeline Constructions Authors : Omar AlonsoOmar...

1

Clustering and Exploring Search Results using Timeline Constructions

Authors : Omar Alonso (University of California)Michael Gertz (University of Heidelberg, Germany)Ricardo Baeza-Yates (Yahoo! Research, Spain)Presenter : Zhong-Yong

Clustering and exploring search results using timeline constructions Omar Alonso, Michael Gertz, Ricardo Baeza-Yates Pages: 97-106 CIKM 09Citation : 8

2

• 這篇論文一開始會先將所有的 document做成temporal document profile (explicit, implicit and relative三種時間 ),然後將所有文件的時間通通拿出來排序,依據排序結果來決定該用何種timeline(year, month,…)當成群的 label。

• 假設是使用 year當成群的 label, 那每一年都是一個 cluster, 每一個 doc就能屬於好多個 cluster。

• Cluster中的排序是依據 doc與所下 query的similarity去計算。

• 當使用者點開 cluster時, cluster中還能以month繼續排序下去,這是 exploratory search的方法。

• 此篇論文的衡量方式是讓使用者覺得滿不滿意,還有 precision來當衡量指標。

3

Outline

• 1. Introduction• 2. Related work• 3. Exploration scenarios• 4. Time annotated document model• 5. Timeline construction and document

exploration• 6. Prototype• 7. Evaluation and results• 8. Conclusions and future work

4

Introduction (1/5)

• Time plays a central role in any information space, and it has been studied in other areas like information extraction, question-answering, and summarization.

• A look at any of the current search engines shows that temporal aspects of documents are exclusively used to sort the hit list by date, which is primarily the date a Web page has been created or last modified.

5

Introduction (2/5)

• Hit list clustering has emerged as an alternative mechanism to present similar documents without requiring the user to go through hundreds of items.

• For example :• If one would like to know the earliest or most

recent paper(s) on that topic or even the period of time when the topic was “popular”, organizing relevant documents along some kind of a timeline would be very helpful.

6

Introduction (3/5)

• Similar scenarios can be envisioned for exploring a news repository.

• For example, how would one search for news about acquisitions a company has made before a particular date or in a particular time period?

7

Introduction (4/5)

• It is important to identify diverse types of temporal information associated with documents.

• A temporal expression can be explicit, such as “May 20, 2007” or “December 5th, 2005”, implicit, such as “New Years Eve 2006” or “Labor Day 2001”, or relative to a point of narration, such as “yesterday” or “in two weeks”.

8

Introduction (5/5)

• In this paper :• 1. Utilize temporal information extracted from

the documents.• 2. Make this information explicit in the form of

temporal document profiles (tdp).• 3. Arrange documents in the form of clusters

along a timeline supporting multiple time granularities.

9

Related work (1/8)

• There is some research on using time for a different search applications but only little work has been done on exploiting temporal information associated with documents for clustering and exploring search results.

10

Related work (2/8)

• New research has emerged for future retrieval where temporal information can be used for searching the future.

R. Baeza-Yates. Searching the Future. In SIGIR Workshop MF/IR, 2005. (Citation : 11)

The idea is to use news information to obtain future possible events and then search events related to our current (or future) information needs.

11

Related work (3/8)

• Google has added the view : timeline feature to display search results along a timeline, allowing a limited exploration of a hit list.

12

Related work (4/8)

• Another technique related to the approach in this paper is hit list clustering.

• Hit list clustering groups search results into categories that are derived from the actual search.

O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proc. of 21st International ACM SIGIR Conference, ACM, 46–54, 1998. (C: 811)

13

Related work (5/8)

• Current hit list clustering engines like Vivisimo rely on a separate search engine that provides some information like Web page title, URL, and document snippets for the construction of the clusters.

14

Related work (6/8)

16

Related work (8/8)

• In contrast to most of the approaches mentioned above, the authors establish a solid foundation to combine the aspects of :

• 1. Extracting various types of temporal information from documents.

• 2. Clustering and organizing document based on temporal data.

• 3. Visualizing such information in an exploratory search interface that helps users to study.

17

Exploratory Scenarios (1/6)

• Start the research by conducting a series of user surveys about timelines.

• In the first user study, performed a survey among 30 persons (graduate students and faculty) regarding temporal information.

• The same user survey was conducted using AMT(Amazon Mechanical Turk), using a crowdsourcing paradigm.

• 50 people responded to the survey.

18

19

Exploratory Scenarios (2/6)

1. Do you think current timelines for organizing or clustering search results (such as in Google’s timeline) are useful for some of your daily search activities?

• 76% answered “yes” for question 1.

20

Exploratory Scenarios (3/6)

2. Do you use (or would use) timelines to explore search results?

• 71% answered “yes” for question 2.

21

Exploratory Scenarios (4/6)

3. Please indicate some search scenarios where you use timelines or would like to use timelines to organize search results.Three main categories

22

Given the information need.

23

Exploratory Scenarios (5/6)

4. Please give some examples of search scenarios where current search engines do not sufficiently support the concept of timelines to organize and explore search results?

24

Identify the lack of support in current search engines

25

Exploratory Scenarios (6/6)

5. What other features would you like to see in the context of timelines?

26

Identify presentation and exploration as the main categories where users see the value in using timelines for search.

27

Time annotated document model4.1 Time and Timelines (1/3)

• As the basis for anchoring documents in time, the authors assume a discrete representation of time based on the Gregorian Calendar, with a single day being an atomic time interval called chronon.

• The base timeline, denoted Td, is an interval of consecutive day chronons.

• For example, the sequence “March 12, 2002; March 13, 2002; March 14, 2002”.

28

Time annotated document model4.1 Time and Timelines (2/3)

• Contiguous sequences of chronons can be grouped into larger units called granules, such as weeks, months, years, or decades.

• An example of a week chronon in Tw is “3rd week of 2005”.

29

Time annotated document model4.1 Time and Timelines (3/3)• Assume the four timelines T = {Td, Tw, Tm, Ty}.

• Relationship Tj>>Ti if timeline Tj is composed of granules of timeline Ti.

• There are Ty>>Td, Ty>>Tm, Tm>>Td, but not Tm>>Tw as months are composed of days and not weeks.

• For two chronons, ti, tjЄT, ti≠tj , and then either ti<T tj or tj<T ti.

• For example, for the two day chronons ti = “March 12, 2004” and tj = “January 5, 2004”, tj <Td

ti holds.

30

Time annotated document model4.2 Temporal Expressions (1/5)

• The first type of such information is the document metadata, which appears as the date a document d belongs to D has been created or last modified.

• Denote as document timestamp d.ts.

31

Time annotated document model 4.2 Temporal Expressions (2/5)

• The second type of temporal information is a little bit more involved as it relates to the linguistic analysis of the textual content of documents.

32

Time annotated document model 4.2 Temporal Expressions (3/5)

• Explicit temporal expressions : describe chronons in some timeline, such as an exact date or year.

• For example, “December 2004” is an explicit expression that is anchored in the timeline Tm.

F. Schilder and C. Habel. (2001) From Temporal Expressions to Temporal Information: Semantic Tagging of News Messages. In ACL’01 Workshop on Temporal and Spatial Information Processing, 1–8. (C : 107)

33

Time annotated document model 4.2 Temporal Expressions (4/5)

• Implicit temporal expression : such as names of holidays or events.

• For example, the token sequence “Columbus Day 2006” in the text of a document can be mapped to the expression “October 12, 2006”.

• In general, implicit temporal expressions require that at least a year chronon appears in the context of a named event.

34

Time annotated document model 4.2 Temporal Expressions (5/5)

• Relative temporal expressions : represent temporal entities that can only be anchored in a timeline in reference to another explicit or implicit, already anchored temporal expression.

• For example, the expression “today” alone cannot be anchored in any timeline. However, it can be anchored if the document is known to have a creation date as a reference.

35

Time annotated document model4.3 Temporal document profiles (1/4)

• The process of entity extraction is a function denoted tdp (temporal document profile).

• : denotes the set of explicit, implicit, and relative temporal expressions.

• C : the set of chronons from timelines in T={Td, Tw, Tm, Ty}.

• P : the set of positions of temporal expressions in a document.

36

Time annotated document model 4.3 Temporal document profiles (2/4)

Describe the explicit temporal expressions that have been determined in d with their normalized chronons and positions in d.

37

Time annotated document model 4.3 Temporal document profiles (2/4)

Describe implicit temporal expressions.

38

Time annotated document model 4.3 Temporal document profiles (2/4)

Corresponds to the timestamp d.ts of the document d. It is assumed that every document has such a timestamp. For example, if it is known that the document creation times are exact, then the document timestamp should be considered as an explicit temporal expression.

39

Time annotated document model 4.3 Temporal document profiles (2/4)

Describe relative temporal expressions.

40

Time annotated document model 4.3 Temporal document profiles (3/4)

• There are some important properties of a temporal document profile that need to be recognized.

• 1. All chronons ci, i = 1 . . . l, are normalized. That is, all chronons that are elements of the same timeline which belongs to T have the same format.

• For example, all day chronons that have been associated with temporal expressions are represented in the day/month/year format, such as “15/04/1966”.

41

Time annotated document model 4.3 Temporal document profiles (4/4)

• 2. A chronon c can be associated with many explicit, implicit, and relative temporal expressions.

• In fact, the same chronon can even occur several times in a single profile tdp(d) but then at different positions in the document d.

42

A brief summary

• 4.1 Define the time, timeline and the relationship about time.

• 4.2 Define the type of time :• Explicit, implicit and relative• 4.3 Define the expression of the dtp.

43

Timeline construction and document exploration

• Assume that for a query term q against a document collection D, the retrieval algorithm determines a hit list Lq = [d1, d2, . . . , dk] of k documents.

• Given such a hit list, the temporal document profiles are used to construct a time outline for the documents first.

• The documents are then clustered along this timeline based on their document profiles.

44

Timeline construction and document exploration5.1 Constructing a time outline (1/2)

• The first step in organizing documents along a multiple-granularity timeline is to construct a time outline for the documents in the hit list Lq.

• For this, all chronons are extracted from the temporal document profiles of the documents in Lq.

• Denote this multi-set of chronons ch(Lq), defined as follows:

• Note that the elements in ch(Lq) may come from different timelines.

45

Timeline construction and document exploration 5.1 Constructing a time outline (2/2)

• The range, for example, Lq contains a document with a temporal expression mapped to the year 1974 (as lower bound) and another document with a temporal expression mapped to the year 2007 (as upper bound). Ty is chosen as time outline for Lq.

• Time outline is a timeline representation that describes the temporal range of documents in Lq, independent of the “temporal distribution” of documents along this timeline.

46

For exampled1 : 11/11/1990, 11/12/1990, 11/13/1990d2 : 4/1/1992, 4/2/1992d3 : 5/17/1996, 5/20/1996d4 : 9/1/2000, 9/21/2000d5: 3/10/1997ch(Lq)

Find the upper bound and lower bound.

Ty is chosen as time outline for Lq.

47

Timeline construction and document exploration5.2 Document clustering (1/3)

• The timeline chosen as time outline for Lq is used to normalize the chronons in ch(Lq), here according to Ty.

• Denote such a type of normalization of a chronon c based on a time granule g belongs to {y, m, w, d} as normg(c).

• For example, normy(“15/4/1966“) = “1966“, and normm(“15/4/1996“) = “4/1996“).

48

Timeline construction and document exploration5.2 Document clustering (2/3)

• The labels for the initial document clusters for Lq and time granule g are then determined by the following set.

• Assume there are l cluster labels y1, y2, . . . , yl, yj belongs to Ty in chy(Lq) among which the precedence relationship > holds.

49

Timeline construction and document exploration5.2 Document clustering (3/3)

• The documents in a cluster yj , denoted cluster(yj), are then determined as follows:

• There is a main cluster for each document in Lq.• For example, if the chronons associated with d

refer to n different years, the main cluster for d, denoted c_main(d), would be the year for which d has the most chronons.

50

For exampled1 : 11/11/1990, 11/12/1990, 1/1/1991 11/13/1990d2 : 4/1/1992, 4/2/1992d3 : 5/17/1996, 5/20/1996d4 : 9/1/2000, 9/21/2000d5: 3/10/1997ch(Lq)

Find the upper bound and lower bound.

Ty is chosen as time outline for Lq.

Normalize by Ty.

d1 : 1990d1 : 1991d2 : 1992d3 : 1996d4 : 2000d5: 1997chy(Lq)

6 clusters.

The c_main(d1) is the cluster with label 1990.

51

Timeline construction and document exploration5.3 Ranking documents in a cluster (1/4)

• Documents in a cluster should be ranked to reflect the relevance of documents in cluster(yj) with respect to both the cluster label yj and query terms q.

• The ranking of documents in cluster(yj) takes the cluster label yj and thus time into account.

• Key to such a ranking is the distance of the query terms q to the temporal expressions in the documents in cluster(yj).

52

Timeline construction and document exploration5.3 Ranking documents in a cluster (2/4)

• Let matche(d, yj) denote the number of times the query term q occurs together with an explicit temporal expression e, (e, c, p) belongs to tdp(d).

• Let matchi(d, yj) and matchr(d, yj) denote the number of matches with respect to implicit and relative temporal expressions in tdp(d).

• The more often q occurs together with explicit temporal expressions matching yj in d’s sentences, the more relevant document d is in that cluster.

53

Timeline construction and document exploration5.3 Ranking documents in a cluster (3/4)

• In order to deal with scenarios in which no single temporal expression in d matches yj , the authors simply assume that rank(d, yj) = δi.

54

Timeline construction and document exploration5.3 Ranking documents in a cluster (4/4)

• Given two document d, d’ belongs to cluster(yj). d is ranked higher than d’ in cluster(yj), denoted d >yj

d’, if either of the following two conditions holds :

• This “re-ranking” is an important property of the TCluster algorithms and reflects the algorithm’s focus on temporal information.

55

For exampleExplicit temporal expressions :d1 : 11/11/1990, 11/12/1990, 1/1/1991 11/13/1990d2 : 4/1/1992, 4/2/1992d3 : 5/17/1996, 5/20/1996d4 : 9/1/2000, 9/21/2000d5: 3/10/1997ch(Lq)

Find the upper bound and lower bound.

Ty is chosen as time outline for Lq.

Normalize by Ty.

d1 : 1990d1 : 1991d2 : 1992d3 : 1996d4 : 2000d5: 1997chy(Lq)

6 clusters.

The c_main(d1) is the cluster with label 1990.

The ranking of d1 : (assume δi=0.3, δr=0.2)rank(d1, 1990) = 3+0.3+0.2 ,=3.5rank(d1, 1991) = 1+0.3+0.2 ,=1.5rank(d1, 1992) = 0+0.3+0.2 ,=0.5rank(d1, 1996) = 0+0.3+0.2 ,=0.5rank(d1, 1997) = 0+0.3+0.2 ,=0.5rank(d1, 2000) = 0+0.3+0.2 ,=0.5

56

Timeline construction and document explorationCluster exploration

• The exploration typically starts with the timeline constructed in the first step of the TCluster Algorithm, here clusters with year chronon labels cluster(yj), and occurs in a “drill-down” fashion.

• Assume the user chooses the timeline Tm to refine a particular cluster(yj).

57

Timeline construction and document explorationCluster exploration

• The TCluster algorithm then first constructs a time outline for this particular cluster and its documents, based on months.

• Then, the documents in cluster(yj) are partitioned into clusters labeled by month/year chronons, and ranked in respective clusters.

58

Timeline construction and document explorationTemporal Snippets

• A temporal snippet can be considered as a document preview but one that outlines the main events in a document.

• This means that the snippet should contain the query term and the cluster label.

• The TSnippet algorithm consists of candidate sentence selection and sentence ranking for producing adequate snippets.

O. Alonso, R. Baeza-Yates, and M. Gertz. Effectiveness of Temporal Snippets. WSSP Workshop, WWW Madrid, 2009. (C : 3)

59

Prototype

• The prototype realizes two main components :• (1) A back-end that supports corpora processing and storage as

well as index creation.• (2) A processing unit that realizes query processing and

timeline construction and also implements the user interface to the system.

• Corpora processing involves taking a document collection and processing the documents using the Alembic part of speech tagger (POS tagger).

• GUTime temporal tagger recognizes the extents and normalized values of temporal expressions, producing an XML document.

60

PrototypeDocument annotation pipeline

• Given a set of documents :• 1. Extract time-related metadata from the

documents. This is either the creation or last modified date for a document file or the timestamp provided by the Web server for Web pages.

• 2. Run a POS tagger on every document.• 3. Run a temporal expression tagger on the

POS-tagged version of the document.

61

PrototypeExploratory User Interface

62

PrototypeExploratory User Interface

63

Evaluation and ResultsEvaluation guidelines

• There are two important features of a timeline that the authors are interested in: precision and presentation.

• Precision is defined as the fraction of the retrieved documents that are relevant.

• Presentation involves displaying the timeline in a graphical form.

64

Evaluation and ResultsEvaluation guidelines

• 1. Part of the well-known Internet directory DMOZ.

• 2. A document collection in which temporal expression have been manually annotated using TimeML.

• 3. Relevance assessment for queries using a graded relevance scale.

65

DMOZ

DMOZ is a multilingual open content directory, which is constructed and maintained by a community of volunteers.

66

TimeML

67

Evaluation and ResultsDMOZ

• For the experiment, took the DMOZ category “World Cup” in the larger category soccer/football.

• While this DMOZ entry has only 5 categories that represent the past four tournaments and the forthcoming one, TCluster determined 16 more clusters.

68

69

Evaluations and ResultsTime bank

• Uses the TimeBank 1.2 corpus, which contains news articles that have been annotated using TimeML.

• The objective of the experiments was to show that temporal expressions, if correctly identified and made readily accessible to TCluster, can significantly improve the precision and granularity of timelines along which documents are clustered.

70

Evaluations and ResultsTime bank

• Run a set of search queries against this annotated TimeBank document.

• Selected 20 random queries, the usage of temporal expressions shows a 50% increase in the number of clusters discovered by TCluster.

• The more temporal expressions are explicit, the better the precision of the document clustering.

71

Evaluations and ResultsRelevance Evaluation using AMT• Evaluate the quality of search results using TCluster in combination with

temporal snippets.• Selected 10 random informational queries for Wikipedia featured articles.• Each query was evaluated by 11 different workers, asked workers to

evaluate if the search results were relevant using the following scale (excellent = 5, I don’t know = 1):

72

Evaluations and ResultsRelevance Evaluation using AMT

• The average response was 4.04% (with an 80% agreement level) which indicates that workers found results to be good most of the time.

• Also performed the same evaluation but using the top-10 most active topics on Twitter where the average response was 4.33% (with an 80% agreement level).

73

Evaluations and ResultsAdditional User Feedback

• For both data sets (Wikipedia and Twitter), users provided useful comments.

74

Conclusions and future work (1/2)

• The TCluster algorithm provides great flexibility and allows users to explore clusters of search result documents that are organized along well-defined timelines.

• The prototypical implementation shows that the proposed technique can effectively be realized using existing tools and systems.

75

Conclusions and future work (2/2)

• 1. Future work primarily concerns investigating alternative document ranking techniques in Tcluster.

• 2. Study the weighting of relative temporal expressions.

• 3. Different sentence distance functions for determining the rank of documents in a cluster.