Determining Relevance of Imprecise Temporal Intervals for...

Determining Relevance of Imprecise Temporal Intervals

for Cultural Heritage Information Retrieval

Tomi Kauppinen∗,a, Glauco Mantegarib, Panu Paakkarinena, HeiniKuittinena, Eero Hyvonena, Stefania Bandinic

aSemantic Computing Research Group (SeCo), The Aalto University School of Scienceand Technology and University of Helsinki, Finland

bQUA SI Doctoral and Advanced Research Program on the Information Society,University of Milano-Bicocca, Italy

cComplex Systems and Artificial Intelligence Research Centre (CSAI),University ofMilano-Bicocca, Italy

Abstract

Time is an essential concept in cultural heritage applications. Instances oftemporal concepts such as time intervals are used for the annotation of cul-tural objects and also for querying datasets containing information aboutthese objects. Hence it is important to match query and annotation inter-vals by examining their similarity or closeness. One of the problems is thatin many cases time intervals are imprecise. For example, the boundaries ofthe “Pre-Roman age” and the “Roman age” are inherently imprecise and itmay be difficult to distinguish them with clear-cut intervals. In this paper weapply the fuzzy set theory to model imprecise time intervals in order to deter-mine relevance of the relationship between two time intervals. We present amethod for matching query and annotation intervals based on their weightedmutual overlapping and closeness. We present 1) methods for calculatingthese weights to produce a combined measure and 2) results of comparingthe combined measure with human evaluators as a case study. The case

∗Corresponding author. Address: Department of Media Technology, PL 5500,The Aalto University School of Science and Technology, 02150 TKK, Finland.Tel:+35804513355. Fax: +35804513356. Mobile: +358407156945.

Email addresses: [email protected] (Tomi Kauppinen),[email protected] (Glauco Mantegari), [email protected] (PanuPaakkarinen), [email protected] (Heini Kuittinen), [email protected] (EeroHyvonen), [email protected] (Stefania Bandini)

Preprint submitted to International Journal of Human-Computer Studies March 19, 2010

study takes into consideration archaeological temporal information, which isin most cases inherently fuzzy, and therefore offers a particularly complexand challenging scenario. The results show that our new combined measurethat utilizes different weighted measures together in rankings, performs thebest in terms of precision and recall. It should be used when ranking an-notation intervals according to a given query interval in cultural heritageinformation retrieval. Our approach intends to be generalizable: overlappingand closeness may be calculated between any two fuzzy temporal intervals.The presented procedure of using user evaluation results as a basis for as-signing weights for overlapping and closeness could potentially be used toreveal weights in other domains and purposes as well.

Key words: fuzzy sets, time, information retrieval, cultural heritage

1. Introduction

Time is one of the central concepts in ontologies representing the world,and hence should also be centric in annotations and queries of the SemanticWeb. Time is especially important for managing historical collections, forexample in visualizing them on a timeline (Hyvonen et al., 2009; Schreiberet al., 2006).

However, representing time in Semantic Web ontologies is not straight-forward because the question of when a certain time was or will be is oftenuncertain, subjective or vague (Nagypal and Motik, 2003). For example, itmay not be known when exactly a given archaeological artifact was manu-factured (uncertainty), when “The Middle Ages” was according to opinionsof different historians (subjectivity), or when the spring starts (vagueness,imprecision).

In addition, transitions between different phases, such as historical pe-riods, are usually complex processes which are not identifiable by clear cutdates, even if conventional calendric markers are mostly used in order tosimplify historical sequences. All these elements are at the basis of imprecisetemporal representations especially in the cultural heritage, historical andarchaeological contexts.

Nevertheless, representations of time are needed for representing andmatching annotations and queries. The definition of temporal intervals andtheir relations are crucial in the context of archaeological chronologies. Check-ing whether two time intervals have something in common allows for answer-

2

ing queries like “find all artifacts manufactured around the middle of the 1stcentury B.C.”. At the same time, the often inherently fuzzy nature of histori-cal and archaeological temporal information makes this scenario particularlycomplex and challenging.

In this paper we adapt the idea (Nagypal and Motik, 2003; Visser, 2004) ofusing fuzzy sets (Zadeh, 1965) for the representation of temporal annotationsand queries. The structure of the paper is as follows. In Section 2 we presenta new method for matching queries and annotations using their weightedmutual overlapping and closeness. In Section 3 we present a case study wherethe method has been implemented and provide results of its evaluation withhuman test subjects. The results in Section 5 suggest that a measure thatcombines different single, weighted measures together performs best in termsof precision and recall, and should hence be used when ranking annotationintervals with respect to a given query interval. In Section 6 we providea discussion about the results, and Section 7 provides discussion about therelated work. Finally, Section 8 concludes the paper.

2. Representing Fuzzy Temporal Intervals

2.1. Representing and Reasoning about Temporal Overlappings

To represent and calculate overlappings between temporal intervals weuse fuzzy sets (Zadeh, 1965) to represent temporal intervals, resulting infuzzy temporal intervals (Nagypal and Motik, 2003; Visser, 2004). The fuzzyset theory enables modeling imprecise time ranges, such as “around 1950”that have vague boundaries. In the fuzzy set theory the grade of membershipµ of the item x in the given set A is a value in range [0,1], whereas in thetraditional set theory an item x either belongs to a given set A or not. Inother words, in the fuzzy set theory x more or less belongs to the set A.

Following Visser (2004) we define a fuzzy temporal interval T that rep-resents imprecision of a time period. The fuzzy temporal interval T is aquadruple 〈Tfuzzybegin, Tbegin, Tend, Tfuzzyend〉, where Tfuzzybegin is used to ex-plicate the earliest start of that period, Tbegin for latest start, Tend for theearliest end and finally Tfuzzyend for the time when the time period has endedfor sure. Figure 1 shows an example of a fuzzy temporal interval of Equa-tion (1) modeled using fuzzy sets. It is a subjective expression of the period“from around the beginning of the I century B.C. to the first half of the Icentury A.D.”. As one can observe, there are 10 years from Tfuzzybegin toTbegin and 14 years from Tend to Tfuzzyend.

3

Tfuzzybegin = 105 B.C.Tbegin = 95 B.C.Tend = 43 A.D.Tfuzzyend = 57 A.D.

(1)

0

1

0.5

100

95

105

50

57

43

from the beginning of the 1st century B.Cto the first half of the 1st century A.D.

B.C. A.D.µ

time

Figure 1: The period “from around the beginning of the I century B.C. to the first half ofthe I century A.D.” represented as a fuzzy temporal interval.

We will also make use of Left-Right notation (LR) (Dubois and Prade,1988) because of its’ suitability for arithmetic operations. The example inFigure 1 in LR-notation is given in Equation (2).

TLR = (Tbegin, Tend, Tbegin − Tfuzzybegin, Tfuzzyend − Tend)LR

= (95 B.C., 43 A.D., 10, 14)LR(2)

We use the intersection of two fuzzy intervals in order to calculate howmuch these two intervals proportionally overlap. Proportional overlap func-tion

ot : A, Q → p ∈ [0, 1] (3)

tells how much two fuzzy intervals A and Q overlap. It is represented interms of temporal overlap ot = overlaps(A, Q) = |A ∩ Q|/|A|. Hence theoverlap function answers the question “How much a given query interval Qoverlaps with an annotation interval A?”.

Similarly overlappedBy function answers the question “How much a givenquery interval Q is overlapped by an annotation interval A?”. Hence wedefine the proportional overlappedBy function as ob : A, Q → p ∈ [0, 1] andrepresent it as ob = overlappedBy(A, Q) = |A ∩ Q|/|Q|.

4

The universe X in our case is defined to be infinite, which means that |A|i.e. the cardinality1 of A is defined by |A| =

∫x µA(x)dx. Hence we get:

ot = overlaps(A, Q) = |A ∩ Q|/|A| =

∫x µA∩Q(x)dx∫

x µA(x)dx(4)

Calculating ot (or ob) intuitively amounts to computing and dividing theintegral areas of the two membership functions in the formula.

Figure 2 depicts a case where the fuzzy temporal interval Q=“Romanage” intersects with another fuzzy temporal interval A=“Pre-Roman age”.These were modeled subjectively2 by a domain expert as follows:

Afuzzybegin = 510 B.C.Abegin = 490 B.C.Aend = 222 B.C.Afuzzyend = 89 B.C.

(5)

Qfuzzybegin = 222 B.C.Qbegin = 89 B.C.Qend = 452 A.D.Qfuzzyend = 569 A.D.

(6)

0

1

0.5

500

490

510

Pre-Roman Age

B.C. A.D.µ

222

89

452

569

Roman Age

time

Figure 2: A fuzzy temporal interval “Pre-Roman age” intersects with another fuzzy tem-poral interval “Roman age”.

1See Zimmermann (1996), page 16, for a discussion about cardinalities of fuzzy sets.2The criteria for the representation of the fuzzy dates represented in the figure are

a personal choice of a domain expert, and are related to the specific chronology of theancient history of the case study city.

5

As an example, the results of calculation of the values of the functionsoverlaps and overlappedBy between time intervals Q=“Roman age” and A=“Pre-Roman age” are given below.

The value for overlaps is obtained by calculating

ot = overlaps =|A ∩ Q|

|A|= 0.0974 . . . ≈ 0.1 (7)

Intuitively speaking, this means that “Roman age” overlaps around 10percent of “Pre-Roman age”. Similarly, the same intersection |A∩Q| confirmsto around 5% of the other period “Pre-Roman age” (denoted with A) andhence

ob = overlappedBy =|A ∩ Q|

|Q|= 0.0505 . . . ≈ 0.05 (8)

Overlap values can potentially be used as measures of A’s relevance givenQ. However, a potential problem of using just the overlap to measure rele-vance between Q and A is that it gives value 0 in cases where A and Q areclose but still do not overlap. However, because of the closeness of A and Qthey might still have mutual relevance. For this reason we present next howa closeness function c will be used together with overlaps and overlappedBy

values to calculate the relevance between Q and A.The closeness function c between Q and A is a defuzzified value of the

distance D. D is a fuzzy temporal interval and is calculated using an arith-metic operation fuzzy subtraction ⊖ (see Dubois and Prade, 1988, page 50),between a fuzzy query interval Q and a fuzzy annotation interval A, definedusing the LR-notation (LR) as follows.

D = Q ⊖ A = (Qbegin − Aend,Qend − Abegin,αQ + βA,βQ + αA)LR

(9)

We can then calculate the fuzzy extensions α and β for Q and A, respec-tively:

αQ = Qbegin − Qfuzzybegin = 89 B.C. − 222 B.C. = 133βQ = Qfuzzyend − Qend = 569 A.D. − 452 A.D. = 117αA = Abegin − Afuzzybegin = 510 B.C. − 490 B.C. = 20βA = Afuzzyend − Aend = 89 B.C. − 222 B.C. = 133

(10)

6

and finally we get

D = Q ⊖ A = (89 B.C. − 222 B.C.,452 A.D. − 490 B.C.,133 + 133,117 + 20)LR

= (133, 942, 266, 137)LR

(11)

The distance D in the above example between a fuzzy query interval Qand a fuzzy annotation interval A is hence a fuzzy temporal interval D =(133, 942, 266, 137)LR.

To obtain the value for the closeness function c (the closeness measure)we follow the next steps.

1. Distance D is defuzzified to a crisp value ddf e.g. by calculating the Cen-ter of Area (COA) (see Zimmermann, 1996, pages 212—214) and takingthe absolute value of it. Defuzzification is a procedure where a fuzzy set istransformed to a crisp number. By using COA we aim to take into accountalso the fuzzy extensions, i.e. where µ is in range [0, 1), of the temporalintervals. COA is by definition the center of the area of the whole trapezoid,i.e. in our case the whole area between Tfuzzybegin and Tfuzzyend. For ex-ample the defuzzification method Mean of Maxima (MOM) (Zimmermann,1996) only considers the core of the fuzzy set where µ = 1 i.e. in our casethe area between Tbegin and Tend. In our example, we get ddf = 503.11 forthe defuzzified value of D = (133, 942, 266, 137)LR .

2. The defuzzified distance |ddf | is normalized to dn, meaning its value is thenbetween [0,1]. The normalization is done by dividing |ddf | with the maximumdmax of all the defuzzified distances between all examined temporal intervalpairs (in our case dmax = 1111). After this step, the closer dn is to 0, thecloser the two intervals are considered to be to each other.

3. The closeness measure c is calculated as c = 1−dn. As a result, c gets valuesclose to 1 (i.e. better values3) when dn is close to 0 (i.e. when intervals areclose to each other).

The full equation for calculating closeness c from defuzzified distance ddf is thus

c = closeness = 1−|ddf |

|dmax|. In our example, we get for the value c = 1−

503.11

1111.11≈

0.55.

3Values for overlap and overlappedBy are similarly considered better when close to 1.

7

We then combine the three measures (closeness c, overlaps ot and over-

lappedBy ob) to one relevance measure r that will get values in range [0,1]and provide a way to explicate weighting for them:

r =wc ∗ c + wot ∗ ot + wob ∗ ob

wc + wot + wob

(12)

where wc, wot and wob refer to the weights used for the measures. Ineffect, the combined measure, r, is a weighted average of the three individualmeasures. The calculation of the weights is described in detail in Section5.2.

3. The Case Study

3.1. Case Study: Ancient Milan

In the evaluation of the method we considered a dataset from the “AncientMilan” project (Bandini et al., 2009). “Ancient Milan” aims at improvingaccess to information concerning the archaeological heritage of Milan (Italy)through the utilization of Semantic Web and Web 2.0 techniques and tools.This data comes from heterogeneous data sources, which are integrated bymeans of the CIDOC CRM (Crofts et al., 2009) ontology.

Temporal information is provided mostly in the form of chronologies thatare related to events that happened in ancient times (such as the the produc-tion of artifacts), or in modern or contemporary times (such as the phases ofthe research processes carried out by archaeologists and cultural heritage pro-fessionals). The case study takes into consideration the first kind of chronolo-gies, the ancient times.

Here temporal intervals are defined by:

• references to “absolute chronology”4, in terms of centuries and/or parts ofcenturies: e.g. the production of an artifact is attributed to sometime in thetime span ranging from the end of the I century B.C. to the middle of theII century A.D.;

4A general distinction exists in Archaeology between absolute and relative chronologies.Absolute chronologies are based on a temporal scale of measurement, usually the calendar;therefore, events and periods are temporally qualified by measurement units, such as years,expressing their location on the calendar and their duration. On the contrary, relativechronologies are based on qualitative temporal relationships between events and periods,such as precedence, contemporaneity, succession, etc.

8

• references to historical periods: e.g. an artifact is attributed to the “Romanage”.

3.1.1. About References to Absolute Chronology

References to absolute chronology are provided as highly consistent labelsassuming one of the following forms:

• a single reference to a century: e.g. “I century B.C.”

• a single reference to a part of a century: e.g. “middle I century B.C.”

• a combination of these, e.g.:

– “I century B.C. - IV century A.D.”

– “end I century B.C. - I century A.D.”

– “end III century A.D. - beginning IV century A.D.”

– “third quarter I century B.C. - VI century A.D.”

At a more general level, each label in the dataset is the result of thecombination of a few atomic elements, expressing temporal information withdifferent degrees of imprecision, from parts of centuries to the era according tothe Gregorian calendar. The possible combinations and values these elementscan assume are expressed in the following set of specifications in the Backus-Naur Form (BNF):

<label> ::= <reference> | <reference> “-” <reference>

<reference> ::= <century> “century” <era> |<part-of-century> <century> “century” <era>

<part-of-century> ::= “first half” | “second half” |“beginning” | “middle” | “end” |“first quarter” | “second quarter” |“third quarter” | “last quarter”

<century> ::= “I” | “II” | “III” | “IV” | “V” | “VI”

<era> ::= “BC” | “AD”

Figure 3 provides a schema that exemplifies different levels of imprecisiondenoting labels (X) with reference to the 1st century B.C. and the 1st centuryA.D. (Y).

9

A.D.

beginning

middle

end

second quarter

third quarter

beginning

middle

end

first quarter

second quarter

third quarter

first half

second half

first half

second half

B.C.

first quarter

1st

1st

past

now

last quarter

last quarter

level of imprecision

lowerhigher

2nd

2nd

time

Figure 3: A depiction of the possible references to absolute chronology showing differentlevels of imprecision.

3.1.2. About References to Historical Periods

Historical periods in the dataset of the case study are defined by imprecisetemporal intervals. The link from the intervals to actual calendric dates canassume different forms: from the use of conventional dates (related to e.g.major historical events) to more generic indications (e.g. a whole century).The definition of the temporal boundaries of these intervals is made even moredifficult by the fact that transitions from one historical period to another aregenerally continuous phenomena, and the duration of these transitions mayvary a lot from case to case.

For example, the shift in the control of Milan from the Celts to the Ro-mans happened quite gradually during more than a century, a period whenelements belonging to both cultural spheres were present. Therefore, in or-der to define a flexible model for temporal intervals like the Celtic and theRoman periods a model based on fuzzy sets was adopted.

3.2. Determining Fuzzy Temporal Intervals from Label Information

The annotations in our case study dataset provided a solid base for de-termining fuzzy temporal intervals from the time labels using a combinationof regular expressions and string parsing techniques.

In order to explicate the four temporal parameters needed for the repre-sentation of fuzzy temporal intervals as described in Section 2.1, the labelwas split into atomic elements. Figure 4 depicts examples of the guidelines

10

for creating fuzzy temporal intervals. These examples include e.g. “the be-ginning of a century”, “the second half of a century”, or “the last quarter ofa century”. For each type of atomic element its fuzziness was defined: e.g.the boundaries of a century were considered more fuzzy than boundaries ofa smaller period such as the first quarter of a century.

0

1

0.5 first half second half

0

1

0.5 century

0

1

0.5 beginning end

0

1

0.5first

quarter

middle

secondquarter

thirdquarter

lastquarter

100

103

97

125

128

122

150

153

147

175

178

172

200

203

197

100

105

95

133

138

128

166

171

161

200

205

195

100

107

93

150

157

143

200

207

193

100

110

90

200

210

190

Figure 4: Guidelines for creating fuzzy temporal intervals.

To create these atomic elements the labels were checked whether theyconsisted of a single interval or two intervals (one representing the fuzzy begininterval and the other the fuzzy end interval). In the case of a single interval,the consistently formed label was processed in reverse order, i.e. from rightto left. This procedure was due to the fact that the further right in the labelstring an atomic element is, the coarser is its granularity. For example, in thecase of “first half II century A.D.” the temporal element “A.D.” determinesthe Gregorian era whereas the second element from the right, “II century”,identifies a century within the defined era and the leftmost part, “first half”,specifies a part of that century. As the elements are processed from right toleft, the granularity gets finer. Finer granularity in our proposal aims to leadto more precise representations of the fuzzy boundaries.

11

If the label was formed of two intervals, both intervals were handled sep-arately and were later combined. As a result the combined interval consistedof the fuzzified begin of the first interval and the fuzzified end of the secondinterval.

The resulting fuzzy temporal intervals were saved as RDF triples5 whereeach of the four temporal properties (like Tfuzzybegin) describe the temporalinstance.

4. The Evaluation

4.1. Evaluation Setting: Participants, Materials and Methods

In order to measure the correlation between different measures and humanopinions we used the following evaluation setting. Together there were twelvehuman evaluators that were given the task to evaluate each of the queryintervals according to each of the annotation intervals.

The query periods represent relevant historical phases in the ancient his-tory of Milan, while the annotation periods refer to the temporal intervalswhen artifacts were produced according to archaeologists. This latter is gen-erally called the “chronological attribution” of archaeological artifacts; in ad-dition, the chronological attribution of structures and monuments has beenintegrated where available. Therefore, evaluators were asked to assess the rel-evancy of the chronological attributions of artifacts with respect to a givenhistorical period; in more intuitive terms, this corresponds to evaluating theattribution of an artifact to a given historical period.

Eight of the evaluators were domain experts, with background from thefields of history, archaeology and museology; four evaluators were consideredas average users. The choice of a mixed group of evaluators represents aninteresting choice. On the one hand, expert evaluation offers the possibilityof analyzing the results provided by the system by comparison to the resultsexpected by professionals, and to fine-tune the retrieval mechanisms accord-ingly; on the other hand, evaluations by average users provide the possibilityto understand how non-professionals perceive the interaction with the systemand the relevance of its retrieval capabilities.

Twelve query intervals have been taken into consideration:

5http://www.w3.org/RDF/

12

• Six intervals belong to the periodization proposed by (Caporusso et al.,2007): the origins, the city of Insubres, from the Insubres to the Ro-mans, between the end of the Roman Republic and the first centuriesof the Roman Empire, the age of Maximianus, the christian city. Thisperiodization offers the advantage of being specific to the city, and, atthe same time, easily understandable with little background knowledge.

• Six intervals have been added by the domain expert, for their relevancein the context of the “Milano Antica” project: the pre-roman age, thelate Republican period, the Roman age, Milan capital of the RomanEmpire, the late Roman age, the Late Antique.

The selected query periods mostly refer to the Roman age, which repre-sents the biggest and most significant phase in the ancient history of Milan.They show heterogeneous durations and degrees of temporal uncertainty;therefore, they provide an interesting and varied scenario for the evaluationof the system.

Evaluators rated annotation intervals with respect to query intervals withstar ratings ranging from one to ten. Evaluators were instructed that allquery/annotation pairs with no explicit rating will be treated as having zerostars.

4.2. An Interface for Enabling the Evaluation Setting

In order to efficiently evaluate the suitability of the calculated relevancemeasures, we created a Simile Timeline6 representation of the historical andchronological periods of the case study data, see Figure 5. The interface isdivided into two bands. The upper band shows the query periods, in theform of grey with light grey boundaries representing the fuzzy begin andfuzzy end. Users select one query period by clicking on its bar, and the lowerband reconfigures itself in order to display the annotation periods. Whenusers select an annotation period, the system displays the artifacts (on theright side of the interface) whose chronological attribution coincides withthe selected annotation period. The selected query and annotation periodsbecome highlighted; on the upper-right side, users can then attribute a starrating assessing the evaluation of the relevancy of the annotation period withrespect to the query period. The evaluation happens in a highly interactive

6http://www.simile-widgets.org/timeline/

13

and visual way. The average time employed for evaluating all the possiblecombinations of the query and annotation periods (nearly 800 ratings) wasfour hours i.e. 18 seconds per rating.

Clicking a query period

expands the timeline to

reveal relevant time periods.

Clicking a period on the

expansion timeline will list

artifacts that are annotated to

that period.

User can evaluate the

suitability of the expansion by

giving the query-annotation

pair a star rating.

Figure 5: Interface to evaluate relevance measures.

5. Results

5.1. Ratings from Evaluators

Twelve evaluators (E1 . . . E12) evaluated all 12 query intervals with re-spect to all 66 unique annotation intervals as described in the previous sec-tion. In other words, each query/annotation pair was rated by each of thesame twelve evaluators. Recall that all query/annotation pairs with no ex-plicit rating were treated as having zero stars. The user agreement about rel-evance levels between queries and annotations was analyzed using weightedkappa (Cohen, 1968). Weighted kappa coefficients are commonly used toquantify inter-rater reliability when ordinal categories are used in rating.

14

Weighted kappa assigns less weight to agreement if ordinal categories arefurther apart. For example, a disagreement of eight stars versus nine starsin rating would still be credited with partial agreement, whereas a disagree-ment of ten versus zero stars would be considered as no agreement at all. Aweighted kappa coefficient of 1.0 means maximum possible agreement and aweighted kappa coefficient of 0.0 means the lack of agreement.

In our case the average of weighted kappa’s calculated pairwise was 0.83,ranging from 0.64 up to 0.94. The closer examination of pairwise kappa’srevealed that among twelve evaluators one evaluator gave notably differentrelevance assessments than others. In other words, her agreement with otherevaluators was lower compared to other evaluators’ agreement with eachother. However, because all eleven other evaluators agreed more about therelevance assessments the average of the kappa’s was 0.83. Values between0.81-1.0 are considered as a very good agreement (Altman, 1991). Because ofthis level of user agreement we decided to use the average of the user opinionsas the gold standard for weighting the relevance measures, and for comparingthe final results with. These will be discussed in the next sections.

5.2. Weights for a Combined Relevance Measure

In order to calculate how each of the relevance measures (calculated usingthe methods described in Section 2.1) should be weighted we applied multiplelinear regression to obtain regression coefficients for each relevance measureusing the averaged evaluator opinion as response observations. More pre-cisely, we ran linear regression to obtain weights for different combinations(e.g. all three measures, or just overlaps and closeness, or just overlap andoverlappedBy, and so on). Precision and recall analysis (see the next sec-tion for an outline of it) showed that a measure that combines the weightedoverlaps and closeness measures together (without overlappedBy) performsbest. Linear regression gave the values wc = 0.13 (closeness) and wo = 0.73(overlap) to be further used to weight measures as described in Section 2.1.Because the combination that used also overlappedBy measure did not en-hance the results, we will use wob = 0 in this setting. In other words, we triedalso a measure that combines all three weighted measures together, includ-ing overlappedBy. However, it gave about the same results as if we used thecombination of only two weighted measures, overlaps and closeness together.Thus we used Occam’s razor and did not include overlappedBy as a part ofthe combined measure.

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rating by evaluators

Ran

king

by

com

bine

d m

easu

re

Figure 6: Scatter plot of relevance rankingbased on combination of weighted measures.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rating by evaluators

Ran

king

by

inte

rsec

tion

conf

iden

ce

Figure 7: Scatter plot of relevance rankingbased on intersection confidence.

The correlation between the average of the evaluator ratings (x-axis) andthe combination of these two weighted measures (y-axis) is visualized as ascatter plot graph in Figure 6. This reveals that the correlation between thecombined measure and the evaluator ratings is strong. Figure 7 shows a cor-relation graph between the evaluator ratings and the intersection confidence(Nagypal and Motik, 2003) for a comparison. The intersection confidenceevaluates the degree (in range [0,1]) to which the temporal relationship called“intersects” exists between two fuzzy temporal intervals. More precisely, theintersection confidence expresses the confidence that A∩Q 6= ∅, i.e. the con-fidence that the intersection of A and Q is not empty. Intersection confidencewas calculated using Equation (13).

Intersection confidence = sup min(A, Q) (13)

where min(A, Q) is the fuzzy intersection of the A and Q, and sup min(A, Q)(supremum) is the maximum confidence of membership of any time point inthe intersection.

From Figures 6 and 7 it can be clearly seen that the combination ofweighted measures of Figure 6 has more linear correlation with the users’opinions than that of intersection confidence of Figure 7.

5.3. Precision and Recall Analyzed

Precision figures of standard 11 recall levels were calculated in order toevaluate quantitatively the performance of different measures. First of all,we compared the performance of individual measures overlaps, overlappedBy

16

and closeness, and the combined measure (combination of overlaps and close-

ness). We also calculated the performance of the intersection confidence(Nagypal and Motik, 2003). As a baseline we used a binary overlap measurebetween the crisp intervals: if crisp areas of intervals overlap then the valuewill be 1 (relevant) and if not then the value is 0 (non-relevant).

The averaged user ratings were considered as a golden standard. Ratingsin range 1–10 stars were considered as relevant and annotations that got 0stars (after rounding the averaged user opinions to the closest star rating)were considered as non-relevant for a given query. Figure 8 depicts the re-sulting precision-recall curve. The combined measure clearly performed best,closeness measure was the second, and overlaps and overlappedBy competedfor the third place. However, overlaps is mostly just above the overlappedBy.The intersection confidence measure and the baseline got the lowest values.

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Recall

Pre

cisi

on

combined measuresoverlapsclosenessintersection confidencebaselineoverlappedBy

Figure 8: Average recall versus precision figures for each measures used to rank the results.

The standard calculation of precision and recall evaluates the performanceof methods based on binary relevant vs. non-relevant values. However, theevaluation in our case contained also multiple grade relevance assessments inrange 0 to 10 stars. For this reason we also report the results of using gener-alized precision and recall (Kekalainen and Jarvelin, 2002) that is intendedto reward methods retrieving highly relevant documents. The generalized

17

precision and recall applied to our case are defined as follows. Let R be theset n annotation periods from the knowledge base K = {a1, a2, . . . , an} inresponse to a query Q, R ⊆ K. Let the annotation periods ai in the knowl-edge base have relevance scores r(ai) in range [0,1]. Generalized precisionis then computed using the Equation (14) and generalized recall using theEquation (15).

Generalized precision =∑

a∈R

r(a)/n (14)

Generalized recall =∑

a∈R

r(a)/∑

a∈K

r(a) (15)

Figure 9 depicts the resulting generalized precision-recall curve. Here,the combined measure had the highest values, overlaps was the second, theintersection confidence as the third, and then closeness. The baseline andoverlappedBy have the lowest values. Note that generalized precision-recallcurves often fall a lot lower than traditional precision-recall curves. Finally,the differences between rankings given by each type of measure were foundout to be statistically significant using the Friedman test (p=0). The Fried-man test is a non-parametric statistical test intended to determine if differ-ences between rankings are significant (Friedman, 1937). The Friedman testmakes no assumptions about how the data is distributed (e.g. if the data isnormally distributed or not), and hence it was chosen in our case. A post-hoc analysis was run as a pairwise Friedman test between all combinationsof rankings in order to ensure the statistical significance of differences. Theanalysis showed the statistical significance (p=0) of pairwise differences ofrankings.

6. Discussion

The results of comparing the rankings given by the method and the rat-ings given by the evaluators seem promising as there is an apparent corre-lation between the rankings and the ratings. The precision and recall curveshows that the new combined measure performed best. Among single mea-sures the closeness performed well in traditional precision and recall analysis.This might be due to the fact that other measures measure the level of over-lap (or confidence) in different ways and simply do not notice, if intervals arequite close, but do not overlap.

18

0 10 20 30 40 50 60 70 80 90 10020

30

40

50

60

70

80

90

100

Recall

Pre

cisi

on

combined measuresoverlapsclosenessintersection confidencebaselineoverlappedBy

Figure 9: Generalized precision-recall curve depicting the results.

The generalized precision and recall curve shows that the combined mea-sure is again the best. However, this time the overlaps measure is the second.So it seems that of the single measures, the overlaps measure is very impor-tant; the more the annotation interval was immersed within the query inter-val, the better rating evaluators gave for this query-annotation pair. All inall best results were obtained by combining weighted overlaps and closenessmeasures together.

However, it should be noted that the scope in our case study was narrow:the case study concerned a restricted spatial and cultural area. Moreover, thecase study took into consideration only the chronological information whichthe references to intervals carried, without addressing the characteristics andvalue of cultural attributions.

Some periods are also open to different possible degrees of subjectivetemporal characterization, reflecting e.g. different scientific opinions. Forexample, the “Late Roman age”, while evidently ending together with theend of the Roman age (which in turn is defined by a fuzzy end), can bedifferently characterized with respect to its beginning, on the basis of e.g.different scholarly opinions on the meaning of “late”.

From the modeling point of view the presented approach is generalizable

19

to many different domains (in addition to the cultural heritage domain) andscales (e.g. “the beginning of the work week” vs. “the glacial age”). Inessence, fuzzy temporal intervals are intended for modeling imprecision oftemporal expressions, independent of the scale or the theme. Same appliesto the proposed relationships: overlaps, overlappedBy, and closeness may bedetermined between any two fuzzy temporal intervals in a similar manneras was shown for query and annotation periods. The weighted combinationof overlaps, overlappedBy, and closeness is also generalizable to differentdomains. However, the weights might vary in different domains (e.g. personalplanning vs. history of natural sciences). The presented procedure to assignweights through a user evaluation and linear regression resulted in a combinedmeasure that provided best results in terms of precision and recall in ourcultural heritage case study. Hence we expect that a similar kind of procedurecould be used to reveal weights for temporal relationships for informationretrieval tasks in other domains of interest. However, this should of coursebe confirmed by a future study. For example, one research question couldbe to study if evaluators consider the usefulness of overlap and closeness oftime periods in information retrieval domain- and scale independently.

7. Related Work

The general properties of time ontologies have been explored e.g. in Vila(1994). There have been discussions e.g. whether the basic primitive is theinterval (period) or the point. Other properties for time are characterized bywhether time is discrete or dense, bounded or unbounded, and what type ofprecedence the time ontology allows: linear, branching, parallel or circular(cyclic). Allen (1983) has presented a set of the 13 primitive interval relationsthat exclusively correspond to every possible simple qualitative relationshipthat may exist between a pair of crisp intervals.

Visser (2004) has identified the following different types of boundaries forperiods: 1) exact boundaries, 2) persistent boundaries, 3) unknown bound-aries and 4) fuzzy boundaries. Our interest was in 1) and 4) type of periodboundaries: the presented method can be used to find the annotation inter-vals that are relevant to a query interval.

Several projects dealing with cultural heritage and the web present meth-ods for the representation of temporal information and tools for retrieving it,see e.g. Hyvonen et al. (2009) and Schreiber et al. (2006). Johnson (2008)reviews the approach adopted in the TimeMap-ECAI project and provides

20

an updated set of proposals and advancements towards the integration ofWeb 2.0 techniques and tools. In particular, he introduces some consider-ations about new modalities for developing interactive timelines concerninghistorical events.

Supporting chronological reasoning in Archaeology is the focus of Doerret al. (2004), where a formal framework is introduced, which is the coreof the temporal representation in the CIDOC CRM (Crofts et al., 2009).This framework relies on the idea of “events as meetings”, and discussesdifferent and relevant issues such as temporal indeterminacy; however, beinga theoretical and high-level work, it does not enter into the details of theactual use of the theory in real application scenarios (which can be found inthe documentation of projects using the CIDOC CRM). There is also earlierwork (Accary–Barbier and Calabretto, 2008) that describes a specific case inthe field of digital libraries dealing with the possibility of comparing differenttemporal models of knowledge in archaeological documentation, that mayemerge from fuzzy or even contrasting chronologies.

Schockaert and Cock (2008) examined problems related to reasoningabout qualitative and metric temporal relations between fuzzy time period.Schockaert et al. (2008) discusses about how to measure e.g. the degree towhich “the beginning of a fuzzy temporal interval A” is long before “the be-ginning of a fuzzy temporal interval B”. Their approach uses fuzzy orderingsbetween time points (e.g. between the beginning of A and the beginningof B) to define temporal relations.The difference is that in our approach weaimed at measuring the closeness of the whole A to the whole B using thefuzzy subtraction. Moreover, the generic goal in Schockaert et al. (2008)was to provide a fuzzification of Allen’s temporal interval relations. In ourapproch the goal was to determine the relevance between imprecise temporalintervals, and to provide a combined measure for ranking results accord-ing to a query. While Schockaert et al. (2008) discussed the usefulness oftheir approach within the context of question answering systems our contextwas cultural heritage information retrieval. Another approach, provided byOhlbach (2004), can also be used for expressing imprecise temporal relations.However, they do not provide any handling of closeness of two fuzzy temporalintervals.

Nagypal and Motik (2003) introduced a mechanism to evaluate whethere.g. a crisp temporal relationship called intersects holds between two fuzzytemporal intervals. The result is a value explicating the level of this confi-dence. We evaluated its usability also for relevance calculation. However, as

21

we showed the combined measure (overlaps and closeness together) was bet-ter in terms of precision and recall. Visser (2004) proposed to calculate theoverlap between two fuzzy temporal intervals, but did not provide any evalu-ation results conforming the usability of the overlap relation. Also, closenesswas neither considered nor tested in their study.

8. Conclusions

The proposal and experiments presented in this paper provide an ap-proach for matching time intervals. Imprecision of temporal intervals wasmodeled using fuzzy sets and a method was developed in order to obtaina better match between human and machine interpretations of periods ininformation retrieval. For this we proposed to calculate overlappings andclosenesses between annotation and query intervals, and showed how theycan be combined together. The models and the method were tested in areal environment with data from the archaeology domain. Twelve evaluatorsrated each of the annotation intervals according to each query interval. Theseresults were used as a basis for analyzing the correlation between the rankinggiven by the method and the ratings given by the human evaluators usinglinear regression. The results of the linear regression were further used asweights to fine tune the method. Both the standard precision and recall testand the generalized precision and recall test showed that the combination ofoverlaps and closeness measures (the combined measure) performed betterthan any of the single measures alone. The method could be used in e.g.suggesting items from approximately the same period as the reference queryperiod and also for ranking the relevance of more distant periods of time.

Acknowledgments

Discussions with the members of the Semantic Computing Research Group(SeCo) have greatly influenced the ideas and the research setting presentedhere. We gratefully acknowledge Stina Westman and Mari Laine-Hernandezfor valuable comments. Moreover, the insightful comments from three anony-mous referees provided useful suggestions to improve the content of the pa-per. Our research was done in the EU project SmartMuseum7 supportedwithin the IST priority of the Seventh Framework Programme for Research

7http://smartmuseum.eu/

22

and Technological Development and in the National Semantic Web Ontol-ogy Project in Finland8 (FinnONTO) 2003–2007, 2008-2010 funded mainlyby the Finnish Funding Agency for Technology and Innovation (Tekes). Wegratefully acknowledge the support by Regione Lombardia, the SIRBeC teamand the SIRBeC partners in providing the dataset for this work.

References

Accary–Barbier, T., Calabretto, S., October 2008. Building and Using Tem-poral Knowledge in Archaeological Documentation. Journal of IntelligentInformation Systems 31 (2), pp. 147–159.

Allen, J. F., 1983. Maintaining knowledge about temporal intervals. Com-munications of the ACM 26 (11), pp. 832–843.

Altman, D. G., 1991. Practical Statistics for Medical Research. Chapman &Hall.

Bandini, S., Locatelli, M., Mantegari, G., Simone, C., Vizzari, G., 4–6November 2009. An integrated approach to the valorization of culturalheritage. In: Proceeding of the 2009 AICA Conference. Rome.

Caporusso, D., Donati, M. T., Masseroli, S., Tibiletti, T., 2007. Immagini diMediolanum. Civiche Raccolte Archeologiche e Numismatiche di Milano,Milano.

Cohen, J., 1968. Weighted kappa: Nominal scale agreement provision forscaled disagreement or partial credit. Psychological Bulletin.

Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M., March 2009. Defini-tion of the CIDOC Conceptual Reference Model Version 5.0.1. Tech. rep.,ICOM/CIDOC CRM Special Interest Group.

Doerr, M., Plexousakis, D., Kopaka, K., soula Bekiari, C., 2004. SupportingChronological Reasoning in Archaeology. In: Proceedings of ComputerApplications and Quantitative Methods in Archaeology CAA’04. Prato,Italy.

8http://www.seco.tkk.fi/projects/finnonto/

23

Dubois, D., Prade, H., 1988. Possibility Theory: An Approach to Comput-erized Processing of Uncertainty. Plenum Press, New York.

Friedman, M., 1937. The use of ranks to avoid the assumption of normalityimplicit in the analysis of variance. Journal of the American StatisticalAssociation 32 (200), pp. 675–701.

Hyvonen, E., Makela, E., Kauppinen, T., Alm, O., Kurki, J., Ruotsalo, T.,Seppala, K., Takala, J., Puputti, K., Kuittinen, H., Viljanen, K., Tuomi-nen, J., Palonen, T., Frosterus, M., Sinkkila, R., Paakkarinen, P., Laitio,J., Nyberg, K., 2009. CultureSampo—Finnish culture on the SemanticWeb 2.0. Thematic perspectives for the end-user. In: Proceedings, Muse-ums and the Web 2009, Indianapolis, USA.

Johnson, I., 2008. Mapping the fourth dimension: a ten year retrospective.Archeologia e Calcolatori 19, pp. 31–43.

Kekalainen, J., Jarvelin, K., 2002. Using graded relevance assessments in IRevaluation. Journal of the American Society for Information Science andTechnology 53 (13), pp. 1120–1129.

Nagypal, G., Motik, B., 2003. A fuzzy model for representing un-certain, subjective, and vague temporal knowledge in ontologies. In:CoopIS/DOA/ODBASE. pp. 906–923.

Ohlbach, H. J., 2004. Relations between fuzzy time intervals. In: Proceedingsof the 11th International Symposium on Temporal Representation andReasoning (TIME 2004). Tatihou Island, Normandie, France, pp. 44–51.

Schockaert, S., Cock, M. D., 2008. Temporal reasoning about fuzzy intervals.Artificial Intelligence 172 (8-9), pp. 1158–1193.

Schockaert, S., Cock, M. D., Kerre, E. E., 2008. Fuzzifying allen’s temporalinterval relations. IEEE Transactions on Fuzzy Systems 16 (2), pp. 517–533.

Schreiber, G., Amin, A., van Assem, M., de Boer, V., Hardman, L., Hilde-brand, M., Hollink, L., Huang, Z., van Kersen, J., de Niet, M., Ome-layenko, B., van Ossenbruggen, J., Siebes, R., Taekema, J., Wielemaker,

24

J., Wielinga, B., 2006. MultimediaN E-Culture demonstrator. In: Proceed-ings of the Fifth International Semantic Web Conference (ISWC’06). No.4273 in Lecture Notes in Computer Science. USA, pp. 951–958.

Vila, L., 1994. A survey on temporal reasoning in artificial intelligence. AICommunications 7 (1), pp. 4–28.

Visser, U., 2004. Intelligent information integration for the Semantic Web.Springer-Verlag, Berlin Heidelberg, New York.

Zadeh, L. A., 1965. Fuzzy sets. Information and Control 8 (3), pp. 338–353.

Zimmermann, H.-J., 1996. Fuzzy Set Theory and its Applications, 3rd Edi-tion. Kluwer, Dordrecht.

25

Determining Relevance of Imprecise Temporal Intervals for...

Documents

Transcript of Determining Relevance of Imprecise Temporal Intervals for...