Anchor Modeling27 Feb Paul

Anchor Modeling

O. Regardt

Teracom, Kaknastornet, 102 52 Stockholm

L. Ronnback

re!sight, Kungsholms Strand 179, 112 48 Stockholm

M. Bergholtz, P. Johannesson, P. Wohed

DSV, Stockholms Universitet, Kista

Abstract

Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. The mainreason for this state of affairs is that the environment of a data warehouse is in constant change, while thewarehouse itself needs to provide a stable and consistent interface to information spanning extended periodsof time. In this paper, we propose a modeling technique, called Anchor Modeling, that offers non-destructiveextensibility mechanisms, thereby enabling robust and flexible management of changes. A key benefit ofAnchor Modeling is that changes in a data warehouse environment only require extensions, not modifications,to the data warehouse. This ensures that existing data warehouse applications will remain unaffected by theevolution of the data warehouse, i.e. existing views and functions will not have to be modified as a resultof changes in the warehouse model. We provide a formal and technology independent definition of anchormodels and show how anchor models can be realised as relational databases. We also investigate performanceissues and report on a number of lab experiments, which show that under certain conditions anchor databasesperform substantially better than databases constructed using traditional modeling techniques.

Keywords: anchor modeling, database modeling, normalization, 6NF, data warehousing, agile development,temporal databases, table elimination

1. Introduction

Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. Themain reason for this state of affairs is that the environment of a data warehouse is in constant change, whilethe warehouse itself needs to provide a stable and consistent interface to information spanning extendedperiods of time. Sources that deliver data to the warehouse change continuously over time and sometimesdramatically. The information retrieval needs, such as analytical and reporting needs, also change. In orderto address these challenges, data models of warehouses have to be modular, flexible, and track changes inthe handled information [16]. However, many existing warehouses suffer from having a model that does notfulfill these requirements. One third of implemented warehouses have at some point, usually within the firstfour years, changed their architecture, and less than a third quotes their warehouses as being a success [24].

In this paper, we propose a modeling technique, called Anchor Modeling, that offers non-destructiveextensibility mechanisms, thereby enabling robust and flexible representations of changes. A positiveconsequence of this is that all previous versions of a schema will be available at any given time as parts of

Email addresses: [email protected] (O. Regardt), [email protected] (L. Ronnback), [email protected](M. Bergholtz), [email protected] (P. Johannesson), [email protected] (P. Wohed)

Preprint submitted to DKE February 27, 2010

pajo

Överstruket

pajo

Ersättningstext

indicate

pajo

Överstruket

pajo

Ersättningstext

Using these mechanisms,

the complete schema [2]. Using Anchor Modeling results in data models in which only small changes areneeded when large changes occur in the environment, like adding or switching a source system or analyticaltool. This reduced need for redesign extends the longevity of a data warehouse, reduces the implementationtime, and simplifies maintenance [23].

A key benefit of Anchor Modeling is that changes in a data warehouse environment only require extensions,not modifications, to the data model. Applications thereby remain unaffected by the evolution of the datamodel and thus do not have to be modified [21]. Furthermore, evolution through extensions rather thanmodifications results in modularity, making it possible to decompose data models into small, stable andmanageable components. This modularity is also of great value in agile development where short iterationsare required. It is possible to construct an initial model with a small number of agreed upon business terms,which later can be seamlessly extended into a final model. This way of working can improve on the state ofthe art in data warehouse design, where close to half of current data warehouse projects are either behindschedule or over budget [24, 4], partly due to having a too large initial project scope.

An anchor model that is realised as a relational database schema will have a high degree of normalization,provide reuse of data and offer the ability to store historical data. The high decomposition of the databasestems from the fact that attributes become separate tables in the schema. This differs significantly from thecurrently popular approach of using de-normalized multidimensional models [15] for data warehouses. Eventhough the origin of Anchor Modeling were the requirements found in such environments, it is a genericmodeling approach, also suitable for other types of systems.

The paper is organized as follows. Section 2 defines Anchor Modeling in a technology independent wayand proposes a naming convention, Section 3 introduces a running example, Section 5 shows how an anchormodel can be realised as a relational database, Section 4 suggests a number of Anchor Modeling guidelines,and physical implementation is described in Section 6. Section 7 investigates performance issues of anchordatabases and introduces a number of conditions that influence performance. In Section 8 advantagesof Anchor Modeling are discussed, Section 9 contrasts Anchor Modeling to alternative approaches in theliterature, and Section 10 concludes the paper and gives directions for further research.

2. Basic Notions of Anchor Modeling

In this section, we introduce the basic notions of Anchor Modeling by first explaining them informallyand then giving formal definitions using the relational model. The basic building blocks in Anchor Modelingare anchors, knots, attributes, and ties. A meta model for Anchor Modeling in UML can be seen in Figure 1.

Definition 2.1 (Identities). Let I be an infinite set of symbols, which are used as identities.

Definition 2.2 (Data type). Let D be a data type. The domain of D is a set of data values.

Definition 2.3 (Time type). Let T be a time type. The domain of T is a set of time values.

2.1. AnchorsAn anchor represents a set of entities, such as a set of actors or events. See Figure 2.

Definition 2.4 (Anchor). An anchor A is a string. An extension of an anchor is a subset of I.

An example of an anchor is AC Actor, see Figure 2. An example of an extension of AC Actor is {#4711,#4712, #4713}, a set of identities.

2.2. KnotsA knot is used to represent a fixed set of entities that do not change over time. While anchors are used

to represent arbitrary entities, knots are used to manage properties that are shared by many instances ofsome anchor. A typical example of a knot is GEN Gender, see Figure 3, which includes two instances, ‘Male’and ‘Female’. This property, gender, is shared by many instances of the AC Actor anchor, thus using a knotminimizes redundancy. Rather than repeating the strings a single bit per instance is sufficient.

2

pajo

Infogad text

mainly

pajo

Överstruket

pajo

Ersättningstext

suggests

pajo

Anteckning

Detta avsnitt behöver fixas avseende numreringen

pajo

Överstruket

pajo

Överstruket

pajo

Ersättningstext

is given

pajo

Infogad text

, expressed in UML,

2..*

STATIC TIEKNOTTEDSTATIC TIE

ATTRIBUTE TIE

ANCHOR

1..*

1

domainANCHOR ROLE1 consists_of

1

0..*

1

0..*

type

KNOTTED TIEKNOT ROLE

1..* 11..* 1

consists_ofKNOTTED ATTRIBUTE

KNOT

1

0..*

1

0..*

type

1

0..*

1

0..*

range

HISTORIZED TIEKNOTTEDHISTORIZED TIE

KNOTTED HISTORIZEDATTRIBUTE

KNOTTED STATICATTRIBUTE

TIME TYPE

1

0..*

1

0..*

time range

1

0..*

1

0..*

1

0..*

1

0..* time range

time range

STATICATTRIBUTE

HISTORIZEDATTRIBUTE

10..*

10..*

time range

DATA TYPE

10..*

range

1

0..*

1

0..*

range

0..*

1

0..*

range

Figure 1: A meta model for Anchor Modeling in UML.

AC_Actor

Figure 2: An anchor is shown as a filled square.

GEN_Gender

Figure 3: A knot is shown as an outlined square.

Definition 2.5 (Knot). A knot K is a string. A knot has a domain, which is I. A knot has a range, whichis a data type D. An extension of a knot K with range D is a bijective relation over I× D.

An example of a knot is GEN Gender, see Figure 3. Its domain is I and its range is the data type string.An example of an extension of this knot is the set of identity-value pairs {〈#0, ‘Male’〉, 〈#1, ‘Female’〉}.

2.3. AttributesAttributes are used to represent properties of anchors. We distinguish between four kinds of attributes:

static, historized, knotted static, and knotted historized, see Figure 4. A static attribute is used to representarbitrary properties of entities (anchors), where it is not needed to keep the history of changes to the attributevalues. A historized attribute is used when changes of the attribute values need to be recorded. A value isconsidered valid until it is replaced by one with a later time. Valid time [6] is hence represented as an openinterval with an explicitly specified beginning. The interval is implicitly closed when an instance with a latervalid time is added for the same anchor instance. A knotted static attribute is used to represent relationshipsbetween anchors and knots, i.e. to relate an anchor to properties that can take on only a fixed, typicallysmall, number of values. Finally a knotted historized attribute is used when the relation to a value in theknot is not stable but may change over time.

Definition 2.6 (Static Attribute). A static attribute BS is a string. A static attribute BS has an anchor Afor domain and a data type D for range. An extension of a static attribute BS is a relation over I× D.

An example of a static attribute is ST LOC Stage Location, see Figure 4, with domain ST Stage andthe data type date as range. An example of an extension of this static attribute is the set of identity-valuepairs {〈#55, ‘Maiden Lane’〉, 〈#56, ‘Drury Lane’〉}.

Definition 2.7 (Historized Attribute). A historized attribute BH is a string. A historized attribute BH

has an anchor A for domain, a data type D for range, and a time type T as time range. An extension of ahistorized attribute BH is a relation over I× D× T.

3

pajo

Överstruket

pajo

Ersättningstext

attribute value

pajo

Anteckning

Säger man "value in the knot" eller "knot value"?

pajo

Överstruket

pajo

Ersättningstext

another

ST_LOC_Stage_Location

(a)

ST_NAM_Stage_Name

(b)

GEN_Gender

AC_GEN_Actor_Gender

(c)

PLV_ProfessionalLevel

AC_PLV_Actor_ProfessionalLevel

(d)

Figure 4: A static attribute (a) is shown as an outlined circle and a historized attribute (b) is shown as acircle with a double outline. A knotted static attribute (c) is shown as an outlined circle and a knottedhistorized attribute (d) is shown as a circle with a double outline, both of which must reference a knot.

An example of a historized attribute is ST NAM Stage Name, see Figure 4, with the domain ST Stage,the data type string as range, and the time type date as time range. An example of an extension of ahistorized attribute is the set of identity-value-time point triples {〈#55, ‘The Globe Theatre’, 1599-01-01〉,〈#55, ‘Shakespeare’s Globe’, 1997-01-01〉, 〈#56, ‘Cockpit’, 1609-01-01〉}.

Definition 2.8 (Knotted Static Attribute). A knotted static attribute BKS is a string. A knotted staticattribute BKS has an anchor A for domain and a knot K for range. An extension of a knotted staticattribute BKS is a relation over I× I.

An example of a knotted static attribute is AC GEN Actor Gender, see Figure 4, with domain AC Actorand range GEN Gender. An example of an extension of a knotted static attribute is the set of identity-identitypairs {〈#4711, #0〉, 〈#4712, #1〉}.

Definition 2.9 (Knotted Historized Attribute). A knotted historized attribute BKH is a string. A knottedhistorized attribute BKH has an anchor A for domain, a knot K for range, and a time type T for time range.An extension of a knotted historized attribute BKH is a relation over I× I× T.

An example of a knotted historized attribute is AC PLV Actor ProfessionalLevel, see Figure 4, withdomain AC Actor, range PLV ProfessionalLevel, and the time type date for time range. An example ofan extension of a knotted historized attribute is the set of identity-identity-time point triples {〈#4711, #4,1999-04-21〉, 〈#4711, #5, 2003-08-21〉, 〈#4712, #3, 1999-04-21〉}.

2.4. TiesA tie represents associations between two or more anchor entities and optional knot entities. Similarly to

attributes, ties come in four variants, static, historized, knotted static, and knotted historized. See Figure 5.As the same entity may appear more than once in a tie a occurrences need to be qualified using the conceptof roles.

Definition 2.10 (Anchor Role). An anchor role is a string. Every anchor role has a type, which is ananchor.

An example of an anchor role is atLocation, with type ST Stage.

Definition 2.11 (Knot Role). A knot role is a string. Every knot role has a type, which is a knot.

An example of a knot role is having, with type PAT ParentalType.

Definition 2.12 (Static Tie). A static tie TS is a set of anchor roles. An instance tS of a static tieTS = {R1, . . . , Rn} is a set of pairs 〈Ri, vi〉, i = 1, . . . , n, where vi ∈ I and n ≥ 2. An extension of a statictie TS is a set of instances of TS.

4

pajo

Infogad text

,

pajo

Överstruket

pajo

Infogad text

at least two

PE_wasHeld_ST_atLocation

(a)

ST_atLocation_PR_isPlaying

(b)

PAT_ParentalType

AC_parent_AC_child_PAT_having

(c)

RAT_Rating

AC_part_PR_in_RAT_got

(d)

Figure 5: A static tie (a) is shown as a filled diamond and a historized tie (b) is shown as a filled diamondwith an extra outline. A knotted static tie (c) is shown as a filled diamond and a knotted historized tie (d) isshown as a filled diamond with an extra outline, both of which must reference at least one knot.

An example of a static tie is PE wasHeld ST atLocation = {wasHeld, atLocation}, see Figure 5, wherethe type of wasHeld is PE Performance and the type of atLocation is ST Stage. An example of an extensionof a static tie is a set of sets of role-identity pairs {{〈wasHeld, #911〉, 〈atLocation, #55〉}, {〈wasHeld, #912〉,〈atLocation, #55〉}, {〈wasHeld, #913〉, 〈atLocation, #56〉}}.

Definition 2.13 (Historized Tie). A historized tie TH is a set of anchor roles and a time type T. Aninstance tH of a historized tie TH = {R1, . . . , Rn, T} is a set of pairs 〈Ri, vi〉, i = 1, . . . , n and a time pointp, where vi ∈ I, p ∈ T, and n ≥ 2. An extension of a historized tie TH is a set of instances of TH .

An example of a historized tie is ST atLocation PR isPlaying = {atLocation, isPlaying, date}, seeFigure 5, where the type of atLocation is ST Stage and the type of isPlaying is PR Program. An exampleof an extension of a historized tie is a set of sets of role-identity pairs and a time point {{〈atLocation,#55〉, 〈isPlaying, #17〉, 2003-12-13}, {〈atLocation, #55〉, 〈isPlaying, #23〉, 2004-04-01}, {〈atLocation, #56〉,〈isPlaying, #17〉, 2003-12-31}}.

Definition 2.14 (Knotted Static Tie). A knotted static tie TKS is a set of anchor and knot roles. Aninstance tKS of a static tie TKS = {R1, . . . , Rn, S1, . . . , Sm} is a set of pairs 〈Ri, vi〉, i = 1, . . . , n and 〈Sj , wj〉,j = 1, . . . ,m, where vi ∈ I, wj ∈ I, n ≥ 2, and m ≥ 1. An extension of a knotted static tie TKS is a set ofinstances of TKS.

An example of a knotted static tie is AC parent AC child PAT having = {parent, child, having}, seeFigure 5, where the type of parent and child is AC Actor and the type of having is PAT ParentalType. Anexample of an extension of a knotted static tie is a set of sets of role-identity pairs {{〈parent, #4711〉, 〈child,#4713〉, 〈having, #1〉}, {〈parent, #4712〉, 〈child, #4713〉, 〈having, #0〉}}.

Definition 2.15 (Knotted Historized Tie). A knotted historized tie TKH is a set of anchor and knot rolesand a time type T. An instance tKH of a historized tie TKH = {R1, . . . , Rn, S1, . . . , Sm, T} is a set of pairs〈Ri, vi〉, i = 1, . . . , n, 〈Sj , wj〉, j = 1, . . . ,m, and a time point p, where vi ∈ I, wj ∈ I, p ∈ T, n ≥ 2, andm ≥ 1. An extension of a knotted historized tie TKH is a set of instances of TKH .

An example of a knotted historized tie is AC part PR in RAT got = {part, in, got, date}, see Figure 5,where the type of part is AC Actor, the type of in is PR Program and the type of got is RAT Rating. Anexample of an extension of a knotted historized tie is a set of sets of role-identity pairs and a time point{{〈part, #4711〉, 〈in, #17〉, 〈got, #8〉, 2005-03-12}, {〈part, #4711〉, 〈in, #17〉, 〈got, #10〉, 2008-08-29},{〈part, #4712〉, 〈in, #17〉, 〈got, #10〉, 2006-10-17}}.

Definition 2.16 (Identifier). Let T be a (static, historized, knotted, or knotted historized) tie. An identifierfor T is a subset of T. Further, if T is a historized or knotted historized tie, where T is the time type in T,every identifier for T must contain T.

5

pajo

Infogad text

at least two

pajo

Infogad text

at least two

pajo

Infogad text

at least two

pajo

Infogad text

more

An identifier is similar to a key in relational databases, i.e. it should be a minimal set of roles thatuniquely identifies instances of a tie.

Definition 2.17 (Anchor Schema). An anchor schema is a 13-tuple 〈A,K,BS ,BH ,BKS ,BKH ,RA,RK , TS ,TH , TKS , TKH , I〉, where A is a set of anchors, K is a set of knots, BS is a set of static attributes, BH is a setof historized attributes, BKS is a set of knotted static attributes, BKH is a set of knotted historized attributes,RA is a set of anchor roles, RK is a set of knot roles, TS is a set of static ties, TH is a set of historized ties,TKS is a set of knotted static ties, TKH is a set of knotted historized ties, and I is a set of identifiers. Thefollowing must hold for an anchor schema:

(i) for every attribute B ∈ BS ∪ BH ∪ BKS ∪ BKH , domain(B) ∈ A(ii) for every attribute B ∈ BKS ∪ BKH , range(B) ∈ K

(iii) for every anchor role RA ∈ RA, type(RA) ∈ A(iv) for every knot role RK ∈ RK , type(RK) ∈ K(v) for every tie T in TS ∪ TH ∪ TKS ∪ TKH , T ⊆ RA ∪RK

Definition 2.18 (Anchor Model). Let A = 〈A,K,BS ,BH ,BKS ,BKH ,RA,RK , TS , TH , TKS , TKH , I〉 be ananchor schema. An anchor model for A is a quadruple 〈extA, extK , extB , extT 〉, where extA is a functionfrom anchors to extensions of anchors, extK is a function from knots to extensions of knots, extB is afunction from attributes to extensions of attributes, and extT is a function from ties to extensions of ties.Let proji be the i th projection map. An anchor model must fulfil the following conditions:

(i) for every attribute B ∈ BS ∪ BH ∪ BKS ∪ BKH , proj1(extB(B)) ⊆ extA(domain(B))(ii) for every attribute B ∈ BKS ∪ BKH , proj2(extB(B)) ⊆ proj1(extK(range(B)))(iii) for every tie T and anchor role RA ∈ T , {v | 〈RA, v〉 ∈ extT (T )} ⊆ extA(type(RA))(iv) for every tie T and knot role RK ∈ T , {v | 〈RK , v〉 ∈ extT (T )} ⊆ proj1(extK(type(RK)))(v) every extension of a tie T ∈ TS ∪ TH ∪ TKS ∪ TKH shall respect the identifiers in I, where an extension

extT (T ) of a tie respects an identifier I ∈ I if and only if it holds that whenever two instances t1 andt2 of extT (T ) agree on all roles in I then t1 = t2.

2.5. Naming ConventionEntities in an anchor model can be named in any manner, but having a naming convention outlining how

names should be constructed increases understandability and simplifies working with a model. If the sameconvention is used consistently over many installations the induced familiarity will speed up recognition byseveral orders of magnitude. A good naming convention should fulfil a number of criteria, some of whichmay conflict with others. Names should be short, but long enough be intelligible. They should be unique,but have parts in common for related entities. They should be unambiguous, without too many rules havingto be introduced. Further, in order to be used in many different representations, the less “exotic” charactersthey contain the better.

Definition 2.19 (Naming functions). The following are naming functions, further defined in Figure 6:

(i) NI : A ∪K → string is a total injection(ii) NIA : BS ∪ BH ∪ BKS ∪ BKH ∪ TS ∪ TH ∪ TKS ∪ TKH → string is a total injection

(iii) NIK : BKS ∪ BKH ∪ TKS ∪ TKH → string is a total injection(iv) ND : BS ∪ BH ∪ K → string is a total injection(v) NT : BH ∪ BKH ∪ TH ∪ TKH → string is a total injection

The convention suggested in the table in Figure 6 use only regular letters, numbers, and the underscorecharacter. This ensures that the same names can be used in a variety of representations, such as in a relationaldatabase, in XML, or in a programming language. Names consist of two parts, a mnemonic followed by adescriptor. From the mnemonic, both the type of entity and relations to other entities can be deduced, and italso serves as a unique identifier for an entity, which can be used as a shorter alias for the full name. While

6

pajo

Anteckning

Borde man inte säga något om time range också?

pajo

Överstruket

pajo

Infogad text

significantly

pajo

Infogad text

to

pajo

Infogad text

more

pajo

Överstruket

pajo

Infogad text

s

Parts

NT(T ) ::= f : MT , dl, vfNIK(T ) ::= f : MK, dl, id, dl, f : rK

NIA(T ) ::= f : MA, dl, id, dl, f : rA

NT(B) ::= f : MB, dl, vfND(B) ::= f : BNIK(B) ::= f : MK, dl, idNIA(B) ::= f : MA, dl, idND(K) ::= f : KNI(K) ::= f : MK, dl, idNI(A) ::= f : MA, dl, id

Entities

T ::= f : MTB ::= f : MB, dl, f : DBK ::= f : MK, dl, f : DKA ::= f : MA, dl, f : DA

Descriptors

DB ::= f : DA, dl, uwordsDK ::= uwordsDA ::= uwords

Mnemonics

MT ::= f : MA, dl, f : rA, (dl, f : MA, dl, f : rA)+, (dl, f : MK, dl, f : rK)*MB ::= f : MA, dl, (upper | number){3}MK ::= (upper | number){3}MA ::= (upper | number){2}

Roles

rK ::= lwordsrA ::= lwords

Primitives

vf ::= ‘ValidFrom’id ::= ‘ID’dl ::= ‘ ’lwords ::= (lower)+, (uword)?uwords ::= (upper, (lower | number)*)+number ::= [0-9]upper ::= [A-Z]lower ::= [a-z]

Figure 6: A suggested naming convention described using Extended Backus-Naur Form and a contextsensitive function f that fetches already defined strings while respecting the connections between entities inan anchor model.

the mnemonic helps with the overview it does little for understanding what a model actually represents. Thedescriptors, on the other hand, should be intelligible enough to give this understanding.

The naming convention will yield unique but not unambiguous names for entities, i. e. the order ofstrings in a tie may change. Further, the data type parts are named identically to the entity in which theyare contained and foreign keys keep the same name as the key they are referencing.

7

pajo

Infogad text

more

3. Running Example

The scenario in this example is based on a business arranging stage performances. It is an extensionof the example discussed in [14]. An anchor schema for this example is shown in Figure 7. The businessentities are the stages on which performances are held, programs defining the actual performance, i e what isperformed on the stage, and actors who carry out the performances. Stages have two attributes, a name anda location. The name may change over time, but not the location. Programs just have name, which describesthe act. If a program changes name it is considered to be a different act. Actors have a name, but since thismay be a pseudonym it could change over time. They also have a gender, which we assume does not change.The business also keeps track of an actor’s professional level, which changes over time as the actor gainsexperience. Actors get a rating for each program they have performed. If they perform consistently betteror worse, the rating changes to reflect this. Stages are playing exactly one program at any given time andthey change program every now and then. To simplify casting the parenthood between actors is needed sothat it is possible to determine if an actor is the mother or father of another actor. A performance is anevent bound in space and time that involves a stage on which a certain program is performed by one or moreactors. It is uniquely identified by when and where it was held. The managers of the business also need toknow how large the audience was at every performance and the total revenue gained from it.

Four anchors PE Performance, ST Stage, PR Program and AC Actor capture the present entities.Attributes such as PR NAM Program Name and PE DAT Performance Date capture properties of thoseentities. Some attributes, e.g. AC NAM Actor Name and ST NAM Stage Name are historized, to capturethe fact that they are subject to changes. The fact that an actor has a gender, which is one of two values, iscaptured through the knot GEN Gender and a knotted attribute called AC GEN Actor Gender. Similarly,since the business also keeps tracks of the professional level of actors, the knot PLV ProfessionalLevel andthe knotted attribute AC PLV Actor ProfessionalLevel are introduced, which in addition is historized tocapture attribute value changes.

Furthermore, the relationships between the anchors are captured through ties. In the example thefollowing ties are introduced: PE in AC wasCast, PE wasHeld ST atLocation, PE at PR wasPlayed, ST at-Location PR isPlaying, AC part PR in RAT got, and AC parent AC child PAT having to capture the ex-isting binary relationships.

The historized tie ST atLocation PR isPlaying is used to capture the fact that stages change programs.The tie AC part PR in RAT got is knotted to show that actors get ratings on the programs they are playingand historized for capturing the changes in these ratings.

A small black circle on a tie edge in Figure 7 indicates that the connected anchor is part of the primarykey for the tie, while a white circle indicates that it is not.

4. Guidelines for Designing Anchor Models

Anchor Modeling has been used in a number of industrial projects in the insurance business and retailbusiness, and for the most part in data warehouse environments. Based on this experience and a number ofwell-known modeling problems such as reification of relationships [11], power types [8], update-anomaliesin databases due to functional dependencies between too many attributes modeled within one entity inconceptual schemas [30], the following guidelines have been formulated. These provide a deeper understandingof the approach as well as support the building of anchor models in a way that gives desired properties.

4.1. Modeling Core Entities and TransactionsCore entities in the domain of interest should be represented as anchors in an anchor model. A well-known

problem in conceptual modeling is to determine whether a transaction should be modeled as a relationship oras an entity [11]. In anchor models, the question is formulated as determining whether a transaction shouldbe modeled as an anchor or as a tie. When a transaction has some property, like PE DAT Performance Datein Figure 7, it should be modeled as an anchor. It can be modeled as a tie only if the transaction has noproperties.

8

pajo

Infogad text

a

pajo

Infogad text

,

pajo

Överstruket

pajo

Ersättningstext

The

pajo

Anteckning

De två sista styckena skall ej vara egna stycken utan slås ihop med stycket innan

pajo

Överstruket

pajo

Ersättningstext

results in

AC_Actor

GEN_Gender

AC_GEN_Actor_Gender

PLV_ProfessionalLevel

PAT_ParentalType

AC_parent_AC_child_PAT_having

RAT_Rating AC_part_PR_in_RAT_got

PR_Program

ST_atLocation_PR_isPlaying

ST_Stage

AC_PLV_Actor_ProfessionalLevel

AC_NAM_Actor_Name

ST_NAM_Stage_Name ST_LOC_Stage_Location

PE_Performance

PE_wasHeld_ST_atLocation

PE_at_PR_wasPlayed

PE_in_AC_wasCast

Anchor Knot (Knotted) Static Attribute

(Knotted) Historized Attribute

(Knotted) Static Tie

(Knotted) Historized Tie

Figure 7: An example anchor model illustrating different modeling concepts.

Guideline 1: Use anchors for modeling core entities and transactions. A transaction can only be modeledas a tie if it has no properties.

4.2. Modeling Attribute ValuesHistorized attributes are used when versioning of attribute values are of importance. A data warehouse,

for instance, is not solely built to integrate data but also to keep a history of changes that have taken place.In anchor models, historized attributes take care of versioning by coupling versioning/history information toan attribute value.Guideline 2a: Use a historized attribute if versioning of attribute values are of importance, otherwise use astatic attribute.

A knot represents a type with a fixed set of instances that do not change over time. Conceptually aknot can be thought of as a an anchor with exactly one static attribute combined into a single entity. Inmany respects, knots are similar to the concept of power types [8], which are types with a fixed and small

9

set of instances representing categories of allowed values. In Figure 4 the anchor AC Actor gets its genderattribute via a knotted static attribute, AC GEN Actor Gender, rather than storing the actual gender value(i.e. the string ‘Male’ or ‘Female’) of an actor directly in a static attribute. The advantage of using knots isreuse, as attributes can be specified through references to a knot identifier instead of a value. The latter isundesirable because of the redundancy it introduces, i.e. long attribute values have to be repeated resultingin increased storage requirements and update anomalies.Guideline 2b: Use a knotted attribute if attribute values represent categories or can take on only a fixedsmall set of values, otherwise use a regular attribute.

Guidelines 2a and 2b may be combined, i.e. when both attribute values represent categories and theversioning of these categories are of importance.Guideline 2c: Use a knotted historized attribute if attribute values represent categories or a fixed small setof values and the versioning of these are of importance.

4.3. Modeling RelationshipsA static tie is used for relationships that do not change over time. For example, the actors who took part

in a certain performance will never change. Typically, anchors modeling transactions or events statically relateto other anchors, i.e. a transaction or event happens at a particular time and does not change afterwards. Ahistorized tie is used for relationships that change over time. For example, the program played at a specificstage will change over time. At any point in time, exactly one relationship will be valid.Guideline 3a: Use a historized tie if a relationship may change over time, otherwise use a static tie.

A knotted tie is used for relationships where the instances fall within certain categories. For example, ifwe have two anchors, AC Actor and PR Program, a relationship between the two may be categorized as‘Good’, ‘Bad’ or ‘Mediocre’ indicating how well the actor performed the program. A knot RAT Rating isused to model such categories.Guideline 3b: Use a knotted tie if the instances of a relationship belong to certain categories, otherwise usea regular tie.

Guidelines 3a and 3b can also be combined in the case when the category a relationship belongs to maychange over time.Guideline 3c: Use a knotted historized tie if the instances of a relationship belong to certain categories andthe relationship may change over time.

4.4. Modeling Changing StatesTo be able to manage states of instances in the extensions of anchors, attributes, and ties in an anchor

model, knots have to be used that model the different states. For example, the example in Section 3 could beextended to handle the situations when an actor resigns, when an actor stops having a professional level dueto inactivity, or when a stage is not playing any program at all. The proper way to handle such situations inan anchor model is to introduce a knot encoding values of the type ‘valid’ or ‘not valid’. Since historizedattributes and ties only model the valid time of a relationship they cannot capture the fact that an attributevalue or relationship is no longer valid.Guideline 4a: Use a knotted historized attribute if an entity has changing states.

For example the knot ACT Active with the extension {〈#0, ‘Resigned’〉, 〈#1, ‘Active’〉} and a knottedhistorized attribute AC ACT Actor Active can be used to determine whether an actor is still active or not.Guideline 4b: Use an additional knotted historized attribute if an attribute has changing states, otherwiseuse a single attribute.

For example the knot PLE ProfessionalLevelExpired with the extension {〈#0, ‘Expired’〉, 〈#1, ‘Valid’〉}and a knotted historized attribute AC PLE Actor ProfessionalLevelExpired can be used to determine whetheran actor’s professional level has expired or not.

10

pajo

Infogad text

static

pajo

Överstruket

pajo

Överstruket

pajo

Ersättningstext

Static ties are

pajo

Överstruket

pajo

Ersättningstext

Knotted ties are

pajo

Överstruket

pajo

Ersättningstext

static

pajo

Infogad text

s,

pajo

Överstruket

pajo

Anteckning

Vad betyder det att en entitet har föränderliga tillstånd?

pajo

Anteckning

Vad betyder det att ett attribut har föränderliga tillstånd?

Guideline 4c: Use a knotted historized tie if a relationship has changing states, otherwise use a (knotted)static tie.

For example the knot PST PlayingStatus with the extension {〈#0, ‘Started’〉, 〈#1, ‘Stopped’〉} and aknotted historized tie ST atLocation PR isPlaying PST currentStatus can be used to determine the durationswith which programs are playing at the stages.

4.5. Modeling Large RelationshipsA lot of the advantages of Anchor Modeling such as no partial attributes [? ] (which translate to no

null-values in a relational database), no update anomalies in relational databases based on Anchor Modelingetc. relies on a high degree of decomposition, where attributes are modelled as entities of their own andrelationships are made as small as possible.

For any tie having three or more roles, it may in some cases be necessary to model it as several smaller ties.Three of the ties in the example PE in AC wasCast, PE wasHeld ST atLocation, and PE at PR wasPlayed,see Section 3, could have been modeled as a single tie PE in AC wasCast ST atLocation PR wasPlayed.However, this is not possible if the instances may be only partially complete, i.e. lacking information forsome of the roles. By breaking this relationship into three ties, the information about which actors partookin a certain performance can be added independently and at different times from where it was held and whatwas played.Guideline 5a: Use several ties, if partially complete instances may result from a large relationship.

Another tie AC part PR in RAT got modeling the rating that an actor has got for playing a part in a cer-tain program could be extended to having two different kinds of ratings: artistic performance and technical per-formance, both of which could be modeled as references to one and the same knot RAT Rating with the exten-sion {〈#1, ‘Bad’〉, 〈#2, ‘Mediocre’〉〈#2, ‘Good’〉}. The new tie AC part PR in RAT artistic RAT technicalhaving the extension {{〈part, #4711〉, 〈in, #17〉, 〈artistic, #2〉, 〈technical, #2〉, 2005-03-12}, {〈part, #4711〉,〈in, #17〉, 〈artistic, #3〉, 〈technical, #2〉, 2008-08-29}}}, suffers from the modeling counterpart of updateanomalies [30], e.g. we will not know to which of the two ratings (artistic or technical) the historizationdate refers. The situation occurs in historized ties when more than one role is not in the identifier for thetie. If this situation occurs it is better to model the relationship as several ties for which those that arestill historized have exactly one role outside the identifier. In the example given this would correspond tobreaking up the tie into two ties AC part PR in RAT artistic and AC part PR in RAT technical, for whichthere can be no doubt as to what has changed between versions.Guideline 5b: Include in a (knotted) historized tie at most one role outside the identifier.

For a knotted historized tie having one or more knot roles, it is sometimes desirable to remodel the tiesuch that it only contains anchor roles. This can be achieved by the introduction of a new anchor on whichthe knots instead appear through knotted attributes and a role for that anchor that replaces the knot rolesin the tie.

Consider a situation in which we have a knotted historized tie of high cardinality with all anchor roles and allbut one knot role in its identifier. For every new version that is added only the historization part of the identifierchange and the rest of it will be duplicated, and if that remainder is long it leads to unnecessary duplication ofdata. For example, let an extension of the tie be {{〈RA1, #1〉, . . . , 〈RAn, #n〉, 〈RK , #21〉, 2010-02-25 10:52:21},{〈RA1, #1〉, . . . , 〈RAn, #n〉, 〈RK , #22〉, 2010-02-25 10:52:22}}. A remodeling to a static tie is possible. Thenew tie has {{〈RA1, #1〉, . . . , 〈RAn, #n〉, 〈RAnew, #538〉}}, the introduced anchor has {#538}, and the intro-duced knotted historized attribute has {〈#538, #21, 2010-02-25 10:52:21〉, 〈#538, #22, 2010-02-25 10:52:22〉}as extensions. The new model duplicates less data and makes the retrieval of the history over changes in theknotted identities (and values if joined) optional, but it also puts the values “farther away” from the tie.Whether or not to remodel will depend on the amount of duplication and how fast this duplication will occurby the speed with which new versions arrive.Guideline 5c: Replace the knot roles in a knotted historized tie with an anchor role and introduce its typeas a new anchor with corresponding knotted historized attributes if it has one or more knot roles outside theidentifier and the introduction of new versions leads to excessive duplication of data.

11

pajo

Anteckning

Vad betyder det att ett förhållande har föränderliga tillstånd?

pajo

Överstruket

pajo

Ersättningstext

Many

pajo

Anteckning

Det här stycket behöver fixas till

5. From Anchor Model to Relational Database

An anchor schema can be transformed into a relational database schema through the translation rulesgiven below. For each construct in the anchor model there exists a translation rule that maps the constructinto a relational database construct such as relation schema (table definition), relation (table), column, rowetc. The names used in the relational database are derived from the naming convention in Section 2.5.The highly decomposed relational database schemas that result from anchor models facilitate traceabilitythrough metadata, capturing information such as creator, source, and time of creation. Although important,metadata is not discussed further since its use does not differ from that of other modeling techniques.

5.1. AnchorsAn anchor A is mapped to a relation scheme A(S), where S = NI(A). The domain of S is I. S is the

primary key of the relation scheme A(S).

AC Actor

AC ID (PK)

#4711#4712#4713

integer

Figure 8: Example rows from the anchor AC Actor.

For example the anchor AC Actor with integer as domain, see Figure 8, corresponds to the relationscheme AC Actor(AC ID).

5.2. KnotsA knot K with range D is mapped to a relation scheme K(I, V ), where I = NI(K) and V = ND(K). The

domain of I is I, and the domain of V is D. I is the primary key.

GEN Gender

GEN ID (PK) GEN Gender

#0 ‘Male’#1 ‘Female’

bit string

Figure 9: Example rows from the knot GEN Gender.

For example the knot GEN Gender with string as range and an extension over bit × string, seeFigure 9, corresponds to the relation scheme GEN Gender(GEN ID, GEN Gender).

5.3. Static AttributesA static attribute BS with domain A and range D is mapped to a relation scheme B(I, V ), where

I = NIA(A) and V = ND(B). The domain of I is I and the domain of V is D. The column I is a primarykey of B and a foreign key with respect to the relation scheme A(S).

For example the attribute PE DAT Performance Date with domain PE Performance, datetime asrange, and an extension over integer × datetime, see Figure 10, corresponds to the relation schemePE DAT Performance Date(PE ID, PE DAT Performance Date).

12

pajo

Överstruket

pajo

Ersättningstext

an

pajo

Överstruket

pajo

Ersättningstext

schema

pajo

Överstruket

pajo

Ersättningstext

it

pajo

Överstruket

pajo

Ersättningstext

scheme

pajo

Infogad text

a

pajo

Överstruket

pajo

Infogad text

or

pajo

Överstruket

pajo

Överstruket

pajo

Ersättningstext

schemas

PE DAT Performance Date

PE ID (PK, FK) PE DAT Performance Date

#911 1994-12-10 20:00#912 1994-12-15 21:30#913 1994-12-22 19:30

integer datetime

Figure 10: Example rows from the static attribute PE DAT Performance Date.

5.4. Historized AttributesA historized attribute BH with domain A, range D and time range T is mapped to a relation scheme

B(I, V, P ), where I = NIA(A), V = ND(B) and P = NT(B). The domain of I is I, the domain of V is Dand the domain of P is T. The columns I and P together constitute a primary key of the relation scheme B.The column I is a foreign key with respect to the relation scheme A(S).

AC NAM Actor Name

AC ID (PK, FK) AC NAM Actor Name AC NAM ValidFrom (PK)

#4711 ‘Arfle B. Gloop’ 1972-08-20#4711 ‘Voider Foobar’ 1999-09-19#4712 ‘Nathalie Roba’ 1989-09-21

integer string date

Figure 11: Example rows from the historized attribute AC NAM Actor Name.

For example the historized attribute AC NAM Actor Name with AC Actor as domain, date as timerange, and an extension over integer× string× date, see Figure 11, corresponds to the relation schemeAC NAM Actor Name(AC ID, AC NAM Actor Name, AC NAM ValidFrom).

5.5. Knotted Static AttributesA knotted static attribute BKS with domain A and range K is mapped to a relation scheme B(I1, I2),

where I1 = NIA(A) and I2 = NIK(K). The domain of I1 is I and the domain of I2 is I. The column I1 is aprimary key of B and a foreign key with respect to the relation scheme A(S). The column I2 is a foreign keywith respect to the relation scheme K(I, V ).

AC GEN Actor Gender

AC ID (PK, FK) GEN ID (FK)

#4711 #0#4712 #1#4713 #1

integer bit

Figure 12: Example rows from the knotted static attribute AC GEN Actor Gender.

For example the knotted static attribute AC GEN Actor Gender with AC Actor as domain, GEN Genderas range, and an extension over integer×bit, see Figure 12 corresponds to the relation scheme AC GEN Ac-tor Gender(AC ID, GEN ID).

5.6. Knotted Historized AttributesA knotted historized attribute BKH with domain A, range K, and time range T is mapped to a relation

scheme B(I1, I2, P ), where I1 = NIA(A), I2 = NIK(K) and P = NT(B). The domain of I1 is I, the domain

13

of I2 is I and the domain of P is T. The columns I1 and P together constitute a primary key of the relationscheme B. The column I1 is a foreign key with respect to the relation scheme A(S). The column I2 is aforeign key with respect to the relation scheme K(I, V ).

AC PLV Actor ProfessionalLevel

AC ID (PK, FK) PLV ID (FK) AC PLV ValidFrom (PK)

#4711 #4 1999-04-21#4711 #5 2003-08-21#4712 #3 1999-04-21

integer tinyint date

Figure 13: Example rows from the knotted historized attribute AC PLV Actor ProfessionalLevel.

For example the knotted historized attribute AC PLV Actor ProfessionalLevel with AC Actor as domain,PLV ProfessionalLevel as range, date as time range, and an extension over integer×tinyint×date, see Fig-ure 13, corresponds to the relation scheme AC PLV Actor ProfessionalLevel(AC ID, PLV ID, AC PLV Valid-From).

5.7. Static TiesA static tie TS = {R1, . . . , Rn}, n ≥ 2, is mapped to a relation scheme T (M1, . . . ,Mn) using an order-

preserving map, such that Ri is mapped to Mi given a total order on TS . Every column Mi is a foreign keywith respect to the relation scheme Ai(S), where Ai is the type of the anchor role Ri and Mi = NIA(Ai).The primary key of the relation scheme T is I where I is the identifier of the static tie TS .

PE in AC wasCast

PE ID in (PK, FK) AC ID wasCast (PK, FK)

#911 #4711#912 #4711#913 #4712

integer integer

Figure 14: Example rows from the static tie PE in AC wasCast.

For example the static tie PE in AC wasCast = {in, wasCast} having the type of in as the anchorPE Performance, the type of wasCast as the anchor PE Performance, and an extension over integer ×integer, see Figure 14, corresponds to the relation scheme PE in AC wasCast(PE ID in, AC ID wasCast),with the primary key PE ID in, AC ID wasCast.

5.8. Historized TiesA historized tie TH = {R1, . . . , Rn, P}, n ≥ 2 is mapped to a relation scheme T (M1, . . . ,Mn, P ) using an

order-preserving map, such that Ri is mapped to Mi given a total order on TH . Every column Mi is a foreignkey with respect to the relation scheme Ai(S), where Ai is the type of the anchor role Ri, Mi = NIA(Ai),and P is a column with the domain T. The primary key of the relation scheme T is I where I is the identifierof the historized tie TH .

For example the historized tie ST atLocation PR isPlaying = {atLocation, isPlaying, date} having thetype of atLocation as the anchor ST Stage, the type of isPlaying as the anchor PR Program, and an extensionover integer× integer× date, see Figure 15, corresponds to the relation scheme ST atLocation PR is-Playing(ST ID atLocation, PR ID isPlaying, ST atLocation PR isPlaying ValidFrom), with the primarykey ST ID atLocation, ST atLocation PR isPlaying ValidFrom.

14

ST atLocation PR isPlaying

ST ID atLocation (PK, FK) PR ID isPlaying (FK) ST atLocation PR isPlaying ValidFrom (PK)

#55 #17 2003-12-13#55 #23 2004-04-01#56 #17 2003-12-31

integer integer date

Figure 15: Example rows from the historized tie ST atLocation PR isPlaying.

5.9. Knotted Static TiesA knotted static tie TKS = {R1, . . . , Rn, S1, . . . , Sm}, n ≥ 2 and m ≥ 1, is mapped to a relation scheme

T (M1, . . . ,Mn, N1, . . . , Nm) using an order-preserving map, such that Ri is mapped to Mi and Sj is mappedto Nj given a total order on TKS . Every column Mi is a foreign key with respect to the relation schemeAi(S), where Ai is the type of the anchor role Ri and Mi = NIA(Ai). Every column Nj is a foreign key withrespect to the relation scheme Kj(I, V ), where Kj is the type of the knot role Sj and Nj = NIK(Kj). Theprimary key of the relation scheme T is I where I is the identifier of the knotted static tie TKS .

AC parent AC child PAT having

AC ID parent (PK, FK) AC ID child (PK, FK) PAT ID having (PK, FK)

#4711 #4713 #1#4712 #4713 #0#4712 #4714 #0

integer integer bit

Figure 16: Example rows from the knotted static tie AC parent AC child PAT having.

For example the knotted static tie AC parent AC child PAT having = {parent, child, having} havingthe type of parent and child as the anchor AC Actor, the type of having as the knot PAT ParentalType,and an extension over integer× integer× bit, see Figure 16, corresponds to the relation scheme AC par-ent AC child PAT having(AC ID parent, AC ID child, PAT ID having), with the primary key AC ID parent,AC ID child, PAT ID having.

5.10. Knotted Historized TiesA knotted historized tie TKH = {R1, . . . , Rn, S1, . . . , Sm, P}, n ≥ 2 and m ≥ 1, is mapped to a relation

scheme T (M1, . . . ,Mn, N1, . . . , Nm) using an order-preserving map, such that Ri is mapped to Mi and Sj ismapped to Nj given a total order on TKH . Every column Mi is a foreign key with respect to the relationscheme Ai(S), where Ai is the type of the anchor role Ri and Mi = NIA(Ai). Every column Nj is a foreignkey with respect to the relation scheme Kj(I, V ), where Kj is the type of the knot role Sj and Nj = NIK(Kj).The column P has the domain T. The primary key of the relation scheme T is I where I is the identifier ofthe knotted historized tie TKH .

AC part PR in RAT got

AC ID part (PK, FK) PR ID in (PK, FK) RAT ID got (FK) AC part PR in RAT got ValidFrom (PK)

#4711 #17 #8 2005-03-12#4711 #17 #10 2008-08-29#4712 #17 #10 2006-10-17

integer integer tinyint date

Figure 17: Example rows from the knotted historized tie AC part PR in RAT got.

15

For example the knotted historized tie AC part PR in RAT got = {part, in, got, date} having the type ofpart as the anchor AC Actor, the type of in as the anchor PR Program, the type of got as the knot RAT Rating,and an extension over integer × integer × tinyint × date, see Figure 17, corresponds to the relationscheme AC part PR in RAT got(AC ID part, PR ID in, RAT ID got, AC part PR in RAT got ValidFrom),with the primary key AC ID part, PR ID in, AC part PR in RAT got ValidFrom.

6. Physical Implementation

In this section the physical implementation of an anchor database is discussed.

6.1. Indexes in an Anchor DatabaseIndexes in general, and clustered indexes in particular, are used to delimit the range of a table scan or

remove the need for a table scan altogether as well as removing or delimiting the need to sort searched-for-data.A clustered index is an index with a sort order equal to the physical sort order on disc of the table to whichthe clustered index refers. In this section we will discuss what indexes on the various constructs in an anchordatabase will look like and why it is important to have these indexes from a performance point of view. SeeFigure 21 to see how some indexes are created for different table types in an anchor database.

6.1.1. Indexes on Anchor and Knot TablesSince both anchor and knot tables are referenced from a number of other tables (attribute and tie tables),

a unique clustered index should be created on the primary key in each anchor and knot table to ensure thatthe number of records that have to be scanned in the queried tables will not be too large, e.g. the number oftime-costly disc-accesses will be minimal. During query execution a unique index over the values in a knottable helps to improve performance when a condition is set on the column. Searching can then stop once amatch has been found.

6.1.2. Indexes on Attribute TablesAn attribute table contains a foreign key against an anchor table and optionally information on historization.

A composite unique clustered index should be created on the foreign key column in ascending order togetherwith the historization column in descending order1. This means that on the physical media where the tabledata is stored the latest version for any given identity comes first. This will ensure more efficient retrieval ofdata since a good query optimizer will realize that a sort operation is unnecessary. The optimizer will alsoutilize the created clustered indexes when joining attribute tables with their anchor table, ensuring that nosorting has to be made since both the anchor and attribute tables are already physically ordered on disc.This also means that the accesses to the storage media can be done in large sequential chunks, provided thatdata is not fragmented on the storage media itself.

6.1.3. Indexes on Tie TablesTie tables contain two or more foreign keys against an anchor table, one or more foreign keys towards

knot tables, and optionally information on historization. Differently from anchor, knot and attribute tables,the indexes on tie tables are not given. A tie table should always have a unique clustered index over theprimary key, however the order in which the columns appear can not be unambiguously determined from theidentifier of the tie. An order has to be selected that fits the need best, perhaps according to what type ofqueries will be the most common. In some cases an additional index with another ordering of the columnsmay be necessary.

In our running example in Section 3 one historized knotted tie AC part PR in RAT got models thecurrent rating that an actor has got for performing a certain program, together with the time at which this

1Note that the indexing order of the columns is of less importance. Database engines read data in large chunks into memoryand reading forwards or backwards in memory is equally effective. We do, however, in order to be consistent, use the sameordering in all of our indexes.

16

pajo

Överstruket

pajo

Överstruket

pajo

Ersättningstext

shows

pajo

Infogad text

,

pajo

Anteckning

Vad betyder det?

rating became relevant. The unique clustered index of the corresponding tie table is composed of the foreignkeys against the anchor tables in ascending order, together with the historization column in descending order.This index will ensure that we do not have to scan the entire AC Actor table for each row of the entirePR Program table in order to find the values that we are looking for.

6.1.4. Partitioning of TablesIn addition to indexes, access to the tables in the anchor database may be improved through partitioning.

For example, if the number of actors and programs are very large and you mostly access the ones withthe highest rating (see the tie table above), you can further improve performance by partitioning the tietable based on the rating. Partitioning, if the database engines support it, will split the underlying data onthe physical media into as many chunks as the cardinality of the knot. This allows for having frequentlyaccessed partitions on faster discs and less data to scan during query execution. The two best candidates forpartitioning in an anchor database are knot values and the time type used for historization (i.e. we assumethat users tend to access certain values of knots more frequently than other values, as well as certain datesbeing of more or less importance).

6.2. Loading PracticesWhen loading data into an anchor database a zero update strategy is used. This means that only insert

statements are allowed and that data is always added, never updated. Delete statements are allowed onlywhen applied to remove erroneous data. A complete history is thereby stored for accurate information [20].If you have non-persistent entities or relations in your model you should model these with a state knot. Thatway you will be keeping a history of when they ceased to exist. You must keep this strategy in mind whenbuilding your model.

A zero update strategy together with properly maintained metadata also alleviates the need for transactions.Through metadata it is always possible to find the rows that belong to a certain batch, came from a specificsource system, were loaded by a particular account, or were added between some given dates. Instead ofhaving a fault recovery system in place every time data is added, it can be prepared and used only after anerror is found. Since nothing has been updated, the introduced errors are in the form of rows, and these caneasily be found. Scripts can be made that allow for removal of erroneous rows, regardless of when they wereadded. Data will never change once it is in place. Its integrity is guaranteed.

A disadvantage that is avoided by not using updates is their higher cost in terms of performance whencompared to insert statements. By not doing updates there is also no row locking on existing rows, a factthat lessens the impact on read performance when writing is done in the database.

6.3. Views and FunctionsDue to the large number of tables and the added complexity from handling historical data, an abstraction

layer in the form of views and functions is added to simplify querying in an anchor database. It de-normalizesthe anchor database and retrieves data from a given temporal perspective. There are three different typesof views and functions for each anchor table corresponding to the most common use cases when queryingdata: latest view, point-in-time function, and interval function [26, 20], all of which are based on an abstractcomplete view. Views and functions for tie tables are created in a way analogous with those for anchor tables.See the right side of Figure 21 for a couple of examples of how these views and functions are created.

6.3.1. Complete ViewThe complete view of an anchor table is a de-normalization of it and its associated attribute tables. It is

constructed by left outer joining the anchor table with all its associated attribute tables.

6.3.2. Latest ViewThe latest view of an anchor table is a view based on the complete view, where only the latest values for

historized attributes are included. In order to find the latest version a sub-select is added that ensures thatthe time point used for historization is the latest one for each identity. See Figure 18.

17

pajo

Överstruket

pajo

Ersättningstext

for

pajo

Överstruket

pajo

Infogad text

ing

pajo

Överstruket

pajo

Anteckning

alleviates the need for transactions Vad betyder det?

pajo

Överstruket

pajo

Infogad text

, thereby guaranteeing its integrity.

pajo

Överstruket

pajo

Överstruket

pajo

Infogad text

to

select * from lST Stage

ST ID ST NAM Stage Name ST NAM ValidFrom ST LOC Stage Location

#55 ‘Shakespeare’s Globe’ 1997-01-01 ‘Maiden Lane’#56 ‘Cockpit’ 1609-01-01 ‘Drury Lane’

integer string date string

Figure 18: Example rows from the latest view lST Stage for the anchor ST Stage.

6.3.3. Point-in-time FunctionThe point-in-time function is a function for an anchor table with a time point as an argument returning

a data set. It is based on the complete view where for each attribute only its latest value before or at thegiven time point is included. A sub-select is added that ensures that the historization time is the latest oneearlier than or on the given time point for each identity. See Figure 19.

select * from pST Stage(‘1608-01-01’) where ST NAM Stage Name is not null


#55 ‘The Globe Theatre’ 1599-01-01 ‘Maiden Lane’


Figure 19: Example rows from the point-in-time function pST Stage given the time point ‘1608-01-01’ forthe anchor ST Stage.

The absence of a row for the stage on Drury Lane (see the previous Figure 18) in the returned data set ofFigure 19 indicates that no data is present for it (which in this case depends on the fact that the theatre wasnot yet built in 1608). Without the condition on the column ST NAM Stage Name a row with a null valuein that column would have been shown.

6.3.4. Interval FunctionThe interval function is a function for an anchor table taking two time points as arguments and returning

a data set. It is based on the complete view where for each attribute only values between the given timepoints are included. Here the sub-select must ensure that the time point used for historization lies within thetwo provided time points. See Figure 20.

select * from dST Stage(‘1598-01-01’, ‘1998-01-01’)


#55 ‘The Globe Theatre’ 1599-01-01 ‘Maiden Lane’#55 ‘Shakespeare’s Globe’ 1997-01-01 ‘Maiden Lane’#56 ‘Cockpit’ 1609-01-01 ‘Drury Lane’


Figure 20: Example rows from the interval function dST Stage in the interval given by the time points‘1598-01-01’ and ‘1998-01-01’ for the anchor ST Stage.

6.3.5. Natural Key ViewThe natural key of the anchor PE Performance is composed of when and where it was held. When it

was held can be found in the attribute PE DAT Performance Date and where it was held in the attributeST LOC Stage Location. A performance is uniquely identified by the values in these two attributes. However,these attributes belong to different anchors, which is commonplace in an anchor model. The parts that makeup a natural key may be found in attributes spread out over several anchors.

18

1 create table AC Actor (2 AC ID int not null,3 primary key clustered (4 AC ID asc5 )6 ) ;7 create table GEN Gender (8 GEN ID tinyint not null,9 GEN Gender varchar(42) not null unique,

10 primary key clustered (11 GEN ID asc12 )13 ) ;14 create table AC GEN Actor Gender (15 AC ID int not null foreign key16 references AC Actor ( AC ID ),17 GEN ID tinyint not null foreign key18 references GEN Gender ( GEN ID ),19 primary key clustered ( AC ID asc )20 ) ;21 create table AC NAM Actor Name (22 AC ID int not null foreign key23 references AC Actor ( AC ID ),24 AC NAM Actor Name varchar(42) not null,25 AC NAM ValidFrom date not null,26 primary key clustered (27 AC ID asc,28 AC NAM ValidFrom desc29 )30 ) ;31 create table PR Program (32 PR ID int not null,33 primary key clustered (34 PR ID asc35 )36 ) ;37 create table RAT Rating (38 RAT ID tinyint not null,39 RAT Rating varchar(42) not null unique,40 primary key clustered (41 RAT ID asc42 )43 ) ;44 create table AC part PR in RAT got (45 AC ID part int not null foreign key46 references AC Actor ( AC ID ),47 PR ID in int not null foreign key48 references PR Program ( PR ID ),49 RAT ID got tinyint not null foreign key50 references RAT Rating ( RAT ID ),51 AC part PR in RAT got ValidFrom date not null,52 primary key clustered (53 AC ID part asc,54 PR ID in asc,55 AC part PR in RAT got ValidFrom desc56 )57 ) ;

58 create view lAC part PR in RAT got as59 select60 [AC PR RAT].AC ID part,61 [AC PR RAT].PR ID in,62 [AC PR RAT].RAT ID got,63 [RAT].RAT Rating,64 [AC PR RAT].AC part PR in RAT got ValidFrom65 from66 AC part PR in RAT got [AC PR RAT]67 left join68 RAT Rating [RAT]69 on70 [RAT].RAT ID = [AC PR RAT].RAT ID got71 where72 [AC PR RAT].AC part PR in RAT got ValidFrom = (73 select74 max(sub.AC part PR in RAT got ValidFrom)75 from76 AC part PR in RAT got [sub]77 where78 [sub].AC ID part = [AC PR RAT].AC ID part79 and80 [sub].PR ID in = [AC PR RAT].PR ID in81 ) ;82 create function pAC Actor (83 @timepoint date84 ) returns table return85 select86 [AC].AC ID,87 [AC GEN].GEN ID,88 [GEN].GEN Gender,89 [AC NAM].AC NAM Actor Name,90 [AC NAM].AC NAM ValidFrom91 from92 AC Actor [AC]93 left join94 AC GEN Actor Gender [AC GEN]95 on96 [AC GEN].AC ID = [AC].AC ID97 left join98 GEN Gender [GEN]99 on

100 [GEN].GEN ID = [AC GEN].GEN ID101 left join102 AC NAM Actor Name [AC NAM]103 on104 [AC NAM].AC ID = [AC].AC ID105 and106 [AC NAM].AC NAM ValidFrom = (107 select108 max([sub].AC NAM ValidFrom)109 from110 AC NAM Actor Name [sub]111 where112 [sub].AC ID = [AC].AC ID113 and114 [sub].AC NAM ValidFrom <= @timepoint115 ) ;

Figure 21: Definitions in Transact-SQL, illustrating how primary keys, clustered indexes, and other constraintsare defined in order to enable table elimination.

When data is added to an anchor database it has to be determined if it is for an already known entity orfor a new one. This is done by looking at the natural key, and as this often has to be done frequently, viewsfor each anchor table is provided that act as a look-ups for its natural keys. Such a view joins the necessaryanchor, attribute, and tie tables in order to translate natural key values into their corresponding identifiers.Note that it is possible for a single anchor to have several differently composed natural keys. Parts of the

19

key may also consist of historized attributes, in which case the view may have several rows for the sameidentifier. An example of a natural key view can be seen in Figure 22.

select * from nPE Performance

PE ID PE DAT Performance Date ST LOC Stage Location

#911 1994-12-10 20:00 ‘Maiden Lane’#912 1994-12-15 21:30 ‘Maiden Lane’#913 1994-12-22 19:30 ‘Drury Lane’

integer date string

Figure 22: Example rows from a natural key view nPE Performance for the anchor PE Performance, mappingthe identity PE ID to its natural key, which is composed of the attributes PE DAT Performance Date andST LOC Stage Location.

6.4. Utilizing Table EliminationModern query optimizers utilize a technique called table (or join) elimination [18], which in practice

implies that tables containing not selected attributes in queries are automatically eliminated. The optimizerwill remove table T from the execution plan of a query if the following two conditions are fulfilled:

(i) no column from T is explicitly selected(ii) the number of rows in the returned data set is not affected by the join with T

In order to take advantage of table elimination primary and foreign keys have to be defined in the database.Some examples showing how this is done for anchor, knot, attribute and tie tables can be seen in the leftside of Figure 21. The views and functions defined in Section 6.3 are created in order to take advantage oftable elimination, see the right side of Figure 21 for a couple of examples. The anchor table is used as theleft table in the view (or function) with the attribute tables left outer joined. The left join ensures that thenumber of rows retrieved is at least as many as in the anchor table, provided that no conditions are set inthe query. Furthermore, since the join is based on the primary key in the attribute table, uniqueness is alsoensured, hence the number of resulting rows is equal to the number of the rows in the anchor table. If for ananchor having some attribtues, a subset of these are used in a query over the defined views or functions, theothers can be eliminated and not touched during execution.

If a condition on the values in an attribute table is set in the query, such that it selects a proper subsetof the rows, the foreign key constraint ensures that all rows must have a corresponding instance in theanchor table. The optimizer can use this fact to eliminate the anchor table and work with the subset as anintermediate result set. Once this is done, the same logic as above applies, and attributes not present in thequery can be eliminated.

Typical queries only retrieve a small number of attributes, which implies that table elimination isfrequently used, yielding reduced access time.

7. Performance in Anchor Databases

There are several considerations when it comes to performance in anchor databases, in particular howthey perform when reading and writing data in data warehouse and OLTP environments. Experiments havebeen carried out that test the behaviour of reading data in data warehouse environments, using queries thatscan large amounts of data. As for writing data in data warehouse environments as well as reading andwriting data in OLTP environments we do not have any experimental results but only our experience fromactual running environments. In these cases, performance has not been an issue.

20

7.1. Performance Influencing ConditionsTo summarize the results when reading data in a data warehouse environment there are situations where

an anchor database performs relatively better and worse than databases in which other modeling techniqueshave been used. Given a body of information and the intended searches over it, the following conditionscomparatively improves performance in an anchor database with respect to a less normalized model. It isassumed that this body of information can be described using entities, identifiers, attributes and values.

(a) The data is time dependent and its changes need to be kept.(b) The data is sparse, i. e. if many entities lack some of their attribute values.(c) The ratio between unique data values and their occurrences is low.(d) The ratio between the amount of data and the amount of data residing in identifiers is high.(e) There are many identifiers.(f) There are many attributes.(g) Most searches address only a small proportion of the attributes in the entities over which the search is

done.(h) Most searches include conditions that impose bounds on values.

For data warehouse environments, many of the conditions above are typically fulfilled. It is often requiredthat a history is kept, that data is sparse by nature or by asynchronous arrival, that there are data thatcan be used for categorization and have only a few unique values, that some data values are of considerablelength, that there are many rows in some tables, that some entities have many attributes, that most queriesuse only a few attributes, and that most queries have conditions limiting the number of rows in the resultsets. However, these conditions can be fulfilled to different degrees. Therefore, we carried out an experimentin order to determine at what degree an anchor database performs better than a less normalized database.

Scenario

Condition 1 2 3 4

(a) Number of versions per instance 1 1 1 2(b) Degree of sparsity 0% 0% 50% 50%(c) Number of knots per attributes 0 1/3 1/3 1/3(d) Average data/identifiers ratio 1/4 3/1 3/1 3/1

Tested combinations

(e) Number of rows 0.1, 1, 10 million(f) Number of attributes 6, 12, 24(g) Number of queried attributes 1, 1/3, 2/3, all(h) Number of conditioned attributes none, all

Figure 23: A table showing to what degree different conditions were fulfilled in the four scenarios used in theexperiment and which combinations have been tested for each scenario. The scenarios and combinationsyield a total of 288 tests.

7.2. Experimental VerificationThe query times for varying degrees of condition fulfillment have been measured in a series of tests in

which an anchor database was compared with a less normalized database containing the same information.The experiment consisted of four scenarios, see Figure 23 for which the data and requirements changed. Ineach scenario a number of combinations were tested. In total 288 measurements per model were made andthe query times compared2. The queries were run against the latest view in an anchor database and in a

2The tests were performed using an Intel Core 2 Quad 2.6GHz CPU, 8GB of RAM, and two SATA hard disks. The databaseused was Microsoft SQL Server 2008, set up with data and tempdb on separate disks. Scripts and detailed results are availablefrom http://www.anchormodeling.com.

21

pajo

Överstruket

pajo

Ersättningstext

compared

pajo

Överstruket

database where the anchor and all its attribute tables have been de-normalized into a single table. Ratiosthat compare query times in the anchor database with those for the single table were calculated and can beseen in the tables in Figure 24.

In the first scenario the information consisted solely of one byte attributes together with a four byte key.The number of unique attribute values were assumed to be at least as many as the number of identifiers,which prevented us from using knotted attribute tables in the anchor database. In practice there will ofcourse be several repetitions of values since one byte can store much fewer unique values than the maximumnumber of rows in the tests, but for the sake of getting a high ratio between unique values and the number oftheir occurrences we decided not to make use of this fact. In the second scenario, the information consistedof six different data types together with a four byte key. The average size of a data value was 12 bytes. Inthe cases when the database had 12 and 24 attributes, these six types were repeated. In the third scenario,to simulate sparseness half of the rows in every attribute table found in the second scenario was removed,but the all instances in the anchor table kept. Finally, the fourth scenario was based on the third with theaddition of historization. One out of every six attributes was historized and one extra version was added foreach instance of the key in the table, effectively doubling the number of rows.

7.3. Explanation of the ResultsFrom the results it is obvious that there are situations where an anchor database outperforms a less

normalized database as well as situations where it does not. The more conditions that are fulfilled to somedegree the better the anchor database performs. In some cases, a single condition fulfilled to a degreehigh enough is sufficient for an anchor database to outperform the single table database. This behaviourcan be seen in the graphs in Figure 25, which are extrapolated from the results of the experiment. Notethat the number of measurements in some cases are few and that the graphs should be seen as showing anapproximative behaviour. In the graphs a single condition is allowed to change, while all others remain fixed.What is seen in the graphs can be explained as follows.

In (a), see Figure 25a, historizing a column value in the single table database will duplicate information.In an anchor database, only the table for the affected attribute will gain extra rows as new versions areadded. The impact on performance is evident when using time dependent queries, such as finding the latestversion or the version that was valid at a certain point in time. This is due to the fact that self-joins orsub-selects have to be made and the narrow and optimally ordered attribute tables perform much betterthan the wide single table, even though it had indexes on the historization columns.

In (b), see Figure 25b, sparse data is represented by the absence of a row in an anchor database, therebyreducing the amount of data that needs to be scanned during a query. In the single table database,“holes”in the data will be represented by null values, which a query manager must inspect in order to take theappropriate action. Therefore, the query time is constant for the single table database, and decreasing downto zero as all rows go absent in the anchor database.

In (c), see Figure 25c, the knot construction is beneficial for performance when there is a small number ofunique values shared by many instances, and these values are longer than the key introduced in the knot.Rather than having to repeat a redundant value, a knot table can be used and its key referenced using asmaller data type3. If all attribute tables are knotted, the query time becomes significantly shorter in theanchor database compared to the single table database, in which no changes are made and the query timeremains constant.

In (d), see Figure 25d, for every instance of an entity its identity is propagated into many tables in ananchor database, increasing the total size of the database. The less the overhead is relative the informationnot stored in keys, the less the negative performance impact will be. The single table is not affected, since allattributes reside in the same row as the identity, compared to several attribute tables. If the amount of datais much larger than the amount residing in keys the query time in an anchor database will approach that inthe single table database.

3Although not examined in the experiment, in cases where a knot is unsuitable, perhaps due to values that cannot bepredicted, the candidate attribute table can be compressed. This provides a higher degree of selectivity than can be achieved ina corresponding single table database. Note that compressed tables trade off less disk space for higher cpu utilization.

22

pajo

Överstruket

pajo

Ersättningstext

had

pajo

Överstruket

pajo

Infogad text

. However,

pajo

Överstruket

pajo

Ersättningstext

clear

pajo

Infogad text

being

pajo

Överstruket

pajo

Ersättningstext

shown

pajo

Överstruket

pajo

Infogad text

s

pajo

Överstruket

pajo

Överstruket

pajo

Infogad text

absence of

pajo

Överstruket

pajo

Infogad text

to

pajo

Infogad text

,

pajo

Överstruket

pajo

Ersättningstext

this

Scenario Number of attributes in the query and millions of rows

Modeledattributes

Single attribute 1/3 of attributes 2/3 of attributes All attributes

0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10

1 6 −2 −2 −2 −2 −4 −3 −4 −8 −6 −4 −9 −912 1 −2 1 −3 −6 −4 −5 −13 −10 −7 −17 −1624 −2 −2 1 −6 −10 −7 −8 −21 −7 −9 −27 −16

2 6 1 +2 +2 −7 −2 1 −7 −3 −5 −5 −2 −1212 +2 +3 +4 −6 −2 −3 −8 −2 −12 −9 −4 −2624 +2 +6 +5 −3 −2 −6 −5 −4 −6 −7 −14 −22

3 6 1 +2 +3 −2 1 1 −3 1 1 −3 −2 −412 +2 +4 +5 −2 1 1 −3 −2 −5 −4 −2 −1324 +3 +7 +8 −2 1 −3 −3 −2 −3 −5 −3 −13

4 6 +2 +2 +3 1 +2 +2 1 1 +2 −2 1 112 +2 +4 +3 1 +2 +2 1 1 −2 −2 1 −524 +4 +6 +6 1 1 1 1 1 1 −2 1 −4

(a)

Scenario Number of attributes in the query and millions of rows

Modeledattributes

Single attribute 1/3 of attributes 2/3 of attributes All attributes

0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10

1 6 −2 1 −2 −2 −2 −2 −2 −2 −3 −2 −2 −412 −2 −3 1 1 −2 −2 −2 −3 −4 −4 −4 −724 1 −3 1 −2 −4 −4 −2 −6 −4 −4 −8 −6

2 6 +2 1 +3 +2 1 +2 1 −2 1 1 −3 −212 +2 +2 +5 1 −2 1 −2 −4 −3 −2 −4 −424 +3 +3 +10 −2 −2 −2 −3 −4 −2 −3 −4 −3

3 6 +2 1 +4 +2 +5 +5 +2 −2 +2 +2 −2 112 +2 +2 +7 1 1 +3 1 −2 1 1 −3 124 +3 +3 +14 1 −2 +2 −2 −2 +2 −2 −3 1

4 6 +3 +3 +7 +3 +7 +9 1 1 +3 1 1 +212 +4 +4 +8 +2 +2 +3 1 1 1 1 1 124 +8 +11 +14 +2 1 +2 +2 1 +2 1 1 1

(b)

Figure 24: Query times in an anchor database compared to a de-normalized single table database for querieswithout (a) and with (b) conditions. When the number is positive +n the anchor database is n times fasterthan the single table database, when negative −n the anchor database is n times slower, and when 1 theyare roughly the same.

In (e), see Figure 25e, the larger size of a row in the single table compared to the normalized tables inthe anchor database causes more data having to be read during a query, provided that not all attributes areused in the query. Eventually the time it takes to read this data from the single table will be longer than thetime it takes to do a join in the corresponding anchor database. Furthermore, the less attributes that areused in the query the sooner the anchor database will overtake the single table database.

In (f), see Figure 25f, the query time in an anchor database is independent of the number of attributetables for a query that does not change between tests. In contrast, for the single table the amount of datathat needs to be scanned during a query grows with the number of attributes it contains and queries take

23

pajo

Överstruket

pajo

Infogad text

s

pajo

Infogad text

those of

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 25: Extrapolated graphs comparing query times in an anchor database (dashed line) to a de-normalizedsingle table (solid line) for (a) increased level of historization, (b) increased sparsity, (c) decreased number ofunique values, (d) increased amount of data per key, (e) increased number of rows, (f) increased numberof attributes in the model, (g) increased number of attributes in the query, and (h) increased number ofconditions in the query.

longer to execute. As long as a query does not use all attributes the anchor database will be faster than thesingle table. Furthermore, the more rows that are scanned the sooner the anchor database will overtake thesingle table database, due to the difference in scanned data size being multiplied by the number of rows.

In (g), see Figure 25g, the performance will increase for the anchor database the fewer attributes that areused in the query. This is thanks to table elimination, which effectively reduces the number of joins havingto be done and the amount of data having to be scanned. If the attributes after table elimination are fewenough, the extra cost of the remaining joins with narrow attribute tables will take less time than scanningthe wide single table, in which data not queried for is also contained.

In (h), see Figure 25h, for every added condition limiting the number of rows in the final result set, theintermediate result sets stemming from the join operations in an anchor database will become smaller andsmaller. In other words, progressively less data will have to be handled, which speeds up any remaining joins.In the single table database, the whole table will be scanned regardless of the limiting conditions, unless it isindexed. However, it is not common that a wide table has indexes matching every condition, which is whyno such indexes were used in the experiment.

7.4. Conclusions from the ExperimentOne limitation of the experiment is the small number of steps in which the degrees of condition fulfilment

are allowed to vary. Due to the large number of conditions, a set of tests had to be chosen that wouldyield indicative results without taking too much time to actually perform. A deeper investigation into therelationships between the conditions may be done as further research. The behaviour in other RDBMS couldbe different and therefore need to be verified. Furthermore, the performance on different types of hardwareshould also be measured and compared.

The conclusion is that in many situations anchor databases are less I/O intense under normal queryingoperations than in less normalized databases. This makes anchor databases suitable in situations where I/Otends to become a bottleneck, such as in data warehousing [17]. Some of the results can also be carried overto the domain of OLTP, where queries often pinpoint and retrieve a small amount of data through conditions.In a situation where an anchor database performs worse, the impact of that disadvantage will have to beweighed against other advantages when using Anchor Modeling.

24

pajo

Infogad text

,

pajo

Infogad text

s

pajo

Överstruket

pajo

Ersättningstext

of

8. Benefits

The Anchor Modeling approach offers several benefits. The most important of them are categorized andlisted in the following subsections.

8.1. Ease of ModelingSimple concepts and notation Anchor models are constructed using a small number of simple concepts

(Section 2). This simplicity and the use of modeling guidelines (Section 4) reduce the number of optionsavailable when solving a modeling problem, thereby reducing the risk of introducing errors in an anchormodel.

Historization by design Managing different versions of information is simple, as Anchor Modeling offers na-tive constructs for information versioning in the form of historized attributes and ties (Sections 2.3, 2.4).

Iterative and incremental development Anchor Modeling facilitates iterative and agile development,as it allows independent work on small subsets of the model under consideration, which later can beintegrated into a global model. Changing requirements is handled by additions without affecting theexisting parts of a model (cf. bus architecture [15]).

Reduced translation logic The symbols introduced to graphically represent tables in an anchor model(Section 2) can also be used for conceptual and logical modeling. This gives a near 1-1 relationshipbetween all levels of modeling, which simplifies, or even eliminates, the need for translation logic inorder to move between them.

Reusability and automation The small number of modeling constructs together with the naming conven-tion (Section 2.5) in an anchor model yield a high degree of structure, which can be taken advantage ofin the form of reusability and automation. For example, ready-made templates for recurring tasks canbe made and automatic code generation is possible, speeding up development.

8.2. Simplified Database MaintenanceEase of attribute changes In an anchor database, data is historized on attribute level rather than row

level. This facilitates tracing attribute changes directly instead of having to analyze an entire row inorder to derive which of its attributes have changed. In addition, the predefined views and functions(Section 6.3) also simplify temporal querying. With the additional use of metadata it is also possible totrace when and what caused an attribute value to change.

Absence of null values There are no null values in an anchor database. This eliminates the need tointerpret null values [21] as well as waste of storage space.

Simple index design The clustered indexes in an anchor database are given for attribute, anchor and knottables. No analysis of what columns to index need be undertaken as there are unambiguous rules fordetermining what indexes are relevant.

Asynchronous arrival of data In an anchor database asynchronous arrival of data can be handled ina simple way. Late arriving data will lead to additions rather than updates, as data for a singleattribute is stored in a table of its own (compared to other approaches where a table may includeseveral attributes) [15, pp. 271–274].

25

pajo

Överstruket

pajo

Överstruket

8.3. High Performance DatabasesHigh run-time performance For many types of queries, an anchor database achieves much better per-

formance compared to databases that contain tables with many columns. The combination of fewercolumns per table, table elimination (Section 6.4), and minimal redundancy (Section 2.2) restricts thedata set scanned during a query, yielding less response time.

Efficient storage Anchor databases are often smaller sized than less normalized databases. The high degreeof normalization (Section 9) together with the knot construction (Section 2.2), absence of null values,and the fact that historization never unnecessarily duplicates data means that the total database sizewill be smaller than that of a corresponding less normalized model.

Parallelized physical media access When using views and functions (Section 6.3), the high degree ofdecomposition and table elimination makes it possible to parallelize physical media access by separatingthe underlying tables onto different media [17]. Tables that are queried more often than others canalso reside on speedier media for faster access.

Less index space needed Little extra index space is needed in an anchor database, since most indexesare clustered and only rearrange the actual data. The narrowness of the tables that store information,attribute and knot tables, means that an additional index only in rare occasions provide a performancegain. Most of the time the table can be scanned quickly enough anyway.

Relevant compression Attribute tables in an anchor database can be compressed. Database engines oftenprovide compression on table level, but not on column level. In an anchor database compression can bepinpointed to where the effect will be the largest, compared to a less normalized database.

Reduced deadlock risk In databases with select, insert, update and delete operations, there is a risk ofentering a deadlock if these are not carried out carefully. Due to the fact that in anchor databases onlyinserts and selects are done, which produce less locking, the risk of entering a deadlock is reduced.

Better concurrency In a table with many columns a row lock will keep other queries waiting if the samerow is accesses by all of them, unless column locking is possible. This is regardless of which columnsare accessed by the queries. In an anchor database the corresponding columns are spread over severaltables, and queries will not have to wait for rows to unlock as long as different attribute tables areaccessed by them.

The benefits of the Anchor Modeling approach are relevant for any database but especially valuable fordata warehouses. In particular, the support for iterative and incremental development, the ease of temporalquerying, and the management of asynchronous arrival of data help providing a stable and consistent interfaceto rapidly changing sources and demands of a data warehouse.

9. Related Research

Anchor Modeling is compared to other approaches in the following paragraphs.

9.1. Data Warehousing ApproachesOne well established approach for data warehouse design is the Dimensional Modeling approach proposed

by Kimball [15]. In dimensional modeling, a number of star-join schemas (stars for short) are used to capturethe modeled domain and each star focuses on a specific process, see Figure 26 It is composed of a fact table,for capturing process activites and important messages as well as a number of dimension tables for capturingentities, attributes and descriptions. In contrast to Anchor Modeling, Kimball advocates a high degree ofde-normalization of the dimensions. The rationale for this is to reduce the number of joins needed whenaccessing the data warehouse and in this way speed up the response time. Furthermore, also Inmon pointsout that “lots of little tables” leads to performance problems [12, p. 104], however, he does not to the same

26

pajo

Överstruket

pajo

Ersättningstext

are less normalized

pajo

Överstruket

pajo

Infogad text

have

pajo

Överstruket

pajo

Överstruket

pajo

Ersättningstext

database

pajo

Anteckning

store information, attribute and knot tables, Vad betyder det?

pajo

Överstruket

pajo

Infogad text

d

pajo

Infogad text

,

pajo

Överstruket

pajo

Överstruket

Date FKStage FKProgram FKRevenueAudience

Performance F

Date IDStage IDProgram IDActor ID

Play (PEAC) F

Date IDDate Attr...

Date D

Stage IDStage NameStage Location

Stage D

Program IDProgram Name

Program D

Actor IDActor Namn (1)Actor ProfessionalLevel (1)Actor GenderActorParent IDActorParent Name (1)ActorParent ProfessionalLevel (1)ActorParent Gender

Actor D

Program IDProgram Name

Program D

Stage IDStage NameStage Location

Stage D

Date IDDate Attr...

Date D

Figure 26: A part of the running example (section 3) modeled as a star schema.

extent as Kimball advocate complete de-normalization, but leaves this as an issue for the designers. However,though highly normalized, anchor databases have proven to offer fast retrieval.

Another approach for data warehousing is the Data Vault approach [27]. It contains three primitives:Hub, Satellite and Link. Hubs are used to capture business entities and add a data warehouse identifier tothem, Satellites are used to capture the attributes of the business entities, and Links, as the name indicates,capture the relationships between the business entities. The instances of these three concepts are physicallyrepresented as database tables, which in addition to the business/data warehouse attributes also contain metadata attributes such as Load DTS and Record Source [28] (see Figure 27). In terms of Anchor Modeling,Hubs translate approximately to anchors (and contain in addition business keys), Satelites translate toattributes or a set of attributes, and Links translate to historized ties. We interpret business keys used in theData Vault terminology as natural keys in traditional database design terminology.

PeIDDate (NK)Stage (NK)Program (NK)Load DTSRecord Source

PerformanceHub

AID (FK)Load DTSActor NameRecord Source

Actor Name Satellite

AIDActor Name (NK)Load DTSRecord Source

Actor Hub

PeID (FK)Load DTSRevenueDateAudienceRecord Source

PerformanceSatellite

AID (FK)Load DTSGenderParent IDParenhoodRecord Source

Actor Satellite

AID (FK)Load DTSPLV (Professional Level)Record Source

Actor Prof.Level Satellite

AID (FK)PeID (FK)Load DTSRecord Source

PEAC Link

PrIDProgram Name (NK)Load DTSRecord Source

Program Hub

PrID (FK)Load DTSNameRecord Source

Program NameSatellite

AID (FK)PrID (FK)Load DTSRecord Source

ACPR Link

AID (FK)PrID (FK)Load DTSRatingRating DateRecord Source

ACPR Satellite

PeID (FK)PrID (FK)Load DTSRecord Source

PEPR Link

Figure 27: A part of the running example (section 3) modeled in Data Vault.

27

At first glance, the Data Vault approach may appear similar to the Anchor Modeling approach, however,there are a few essential differences. First, while in Anchor Modeling business keys are treated just as normalattributes, in Data Vault the business keys of entities are integral parts of the Hubs. This leads to thequestion of which natural key to select as business key, when there is more than one natural key for an entityto choose from, for example due to data with different sources. Second, Satellites are not necessarily highlynormalized. In fact, “normalization to the n th degree is discouraged” [29] (see the Actor and PerformanceSatellites). This makes the somewhat risky assumption that attributes in the same Satellite have changedsynchronously and that they will continue to do so. It also makes it impossible to eliminate attributes notqueried for, see table elimination 6.4, if they reside in the same table. Third, while Anchor Modeling is ageneral approach with wide applicability and not limited to data warehousing applications, the Data Vaultapproach only addresses the area of data warehousing.

9.2. Conceptual Modeling ApproachesAnchor Modeling has several similarities to the ORM (Object Role Modeling) approach which was

established already during the 1990s [10]. ORM is a modeling notation widely used for conceptual modelingand data base design. In addition [10] also provides a modeling methodology for designing a domaindescription in an ORM model and translating it into a logical data base design (typically normalized upto the 3NF). An anchor model can be captured in an ORM model by representing the anchors as Objectstypes, attributes as Value types, (static) ties as Predicates, historized attributes and ties as Predicates withTime point as one of the predicate’s roles, etc. See Figure 28. However, there are some essential differencesbetween Anchor Modeling and ORM. ORM does not have any explicit notation for time, which AnchorModeling provides. Furthermore, the methodology provided with ORM [10] for constructing database modelssuggests that these are realized in 3NF, which is typical for relational database design.

Performace(id)

Program(id)

Actor(id)

Revenue

Date

Audience

plays

has

has

has

ProfessionalLevel(code)

Gender(code)

Date

... ... ...

has

ProgamNamehas

... ... ... ...

... ... ...

ActorName

Date

Rating(code)

play

Figure 28: A part of the running example (section 3) modeled in ORM.

Anchor Modeling is also similar to ER (Entity Relationship) modeling [5] and UML (Unified ModelingLanguage) [3], see Figure 29. Three constructs have correspondences in ER schemas: anchors correspondto entities, attributes correspond to attributes (anchors and attributes together hence correspond to aclass in UML), and a tie maps to a relationship or an association. While the knot-construct has noimmediate correspondence to any construct in an ER schema, it is similar to so called power-types [8], i.e.categories of, often intangible, concepts. Power types, and knots, are used to encode properties that areshared by many instances of other, often tangible, concepts. Anchor models offer no general mechanism forgeneralization/specialization as in EER (Enhanced Entity Relationship) models [7], instead anchor models

28

provide three predefined varieties of attributes and ties in order to represent either temporal properties orrelationships to categories.

IdActor NameGender {..}

Actor

IdName

Program

PEAC

DateCode {,,,}

Rating

IdDateRevenueAudience

Performance

0..*

1for

1

*

in

1

0..*

performed by

0..*

1in

10..*pepr

PLV {..}Date

ActorsProfessional

Level0..*

1

for

Figure 29: A part of the running example (section 3) modeled in UML.

9.3. Less Normalized DatabasesA key feature of anchor databases, is that they are highly normalized. This stems mainly from the fact

that every distinct fact (attribute) in an anchor model is translated into a relational table of its own, in theform of anchor-key, attribute-value, and optional historical information. In contrast, in an ordinary 3NFschema several attributes are included in the same table.

A table is in sixth normal form if it satisfies no non-trivial join dependencies, i.e. a 6NF table cannot bedecomposed further into relational schemes with fewer attributes [6]. All anchors, knots and attributes willgive rise to 6NF tables; the only constructs in an anchor model that could give rise to non-6NF tables areties. For an analysis of anchor models and normal forms refer to [19], which is based on the definition of 6NFaccording to [6].

9.4. Temporal DatabasesA temporal database is a database with built-in time aspects, e.g. a temporal data model and a temporal

version of a structured query language [25]. Database modeling approaches, such as the original ER model,do not specific language elements that explicitly support temporal concepts. Extensions [9] to ER schemasencompass temporal constructs such as valid time, the time (interval) in which a fact is true in the realworld, and transaction time, the time in which a fact is stored in a database [1, 6, 13]. Anchor modelsprovide syntax elements for representing the former, i.e. valid time, for both attributes (historized attributes)and ties (historized ties). In addition, if metadata is used transaction time can also be represented, a moredetailed analysis of time concepts in anchor models are discussed in the next section. Anchor Modeling doesnot provide a query language with operators dedicated to querying the temporal elements of the model,however, it does provide views and functions for simplifying and optimizing temporal queries.

9.4.1. Time in anchor modelsIn an anchor model there are three concepts of time, each of which is stored differently in the database. To

avoid confusion with common terminology, we will name them changing time, recording time, and happeningtime. The names try to capture what the times represent: ‘when a value is changed’, ‘when information wasrecorded’, and ‘when an event happened’.

29

9.4.2. Changing TimeThe changing time for a historized attribute or tie is the interval during which its value or relation is

valid in the domain of discourse being modeled, i.e. it corresponds to the concept of valid time [1, 6, 13] asdiscussed in the previous paragraph. In an anchor model, this interval is defined using a single time point.That time point is used as an explicit starting time for the interval in which an instance can be said to have acertain value or relationship. When a new instance is added, having the same identity but a later time point,it implicitly closes the interval for the previous instance. Since partial attributes are not present in an anchormodel, and hence no null values in an anchor database, we also have to explicitly invalidate instances, ratherthan removing or updating them, by modeling a knot holding the state of validity for the attribute or tie.

9.4.3. Recording TimeFor maintenance and analysis purposes, another type of time is often necessary, the recording time.

Recording time in an anchor model corresponds to the concept of transaction time [1, 6, 13] discussedabove. Loosely speaking it can be said to be the time when a certain piece of information is entered into thedomain of discourse, or “the time (period) during which a fact is stored in the database” [13]. Since thisis information about information, this is handled through the addition of metadata in an anchor database.In many scenarios a single recording time per piece of information is sufficient, corresponding to the timewhen the data was loaded into the database. However, it can in some cases be necessary to store an array ofrecording times if data has passed through a number of systems before reaching the model. In an anchordatabase, such metadata is represented through references to a metadata structure, which preferably alsoshould be anchor modeled.

9.4.4. Happening TimeThe happening time is used to represent the moment or interval at which an event took place in the

domain of discourse. In Anchor Modeling this type of time is regarded as being an attribute of the eventitself. It should therefore be modeled as one or two attributes depending on the event being momentaneous(“happened at”) or having duration (“happened between”). Happening times are attributes/properties ofthings in the domain of discourse that take on values of time types. Some examples of such things are: aperson, a coupon, and a purchase, having happening times such as: person birth date, person deceaseddate, coupon valid from, coupon valid to, purchase date, and purchase time. Being attributes, they may inaddition have both changing times and recording times. The reason for introducing “happening time” as aconcept of its own is to avoid it being confused with valid time or transaction time per se.

10. Conclusions and Further Research

Anchor Modeling is a technique for managing data warehouses that has been proven to work in practice.Several data warehouses have been built (by the consulting company Affecto since 2004) using AnchorModeling and are in daily use. These have been built for companies in the insurance, logistics and retailbusinesses, ranging in scope from departmental to enterprise wide. The one having the largest data volume iscurrently close to one terabyte in size, with a few hundred million rows in the largest tables. An interest forAnchor Modeling has also been shown from the health care domain, in which its ability to handle sparse andhistorized data is greatly appreciated. Furthermore, some software vendors are evaluating the use of an anchordatabase as a persistence layer, where its flexibility simplifies development and its non-destructive extensibilitymechanisms alleviates the need to manage legacy data when a system is upgraded. Implementations ofanchor models in other domains and other representations, as well as the deployment of Anchor Modeling forvery large databases, provides directions for further research.

Anchor Modeling is built on a small set of intuitive concepts complemented with a number of guidelinesfor building anchor models, which supports agile development of data warehouses. A key feature of AnchorModeling is that changes only require extensions, not modifications, to an anchor model. This feature isthe basis for a number of benefits provided by anchor models, including ease of temporal querying and highrun-time performance.

30

pajo

Överstruket

Anchor Modeling differs from main stream approaches in data warehousing that typically emphasizede-normalization, which is considered essential for fast retrieval. Anchor Modeling, on the other hand, resultsin highly normalized data models, often even in 6NF. Though highly normalized, these data models stilloffer fast retrieval. This is a consequence of table elimination, where narrow tables with few columns arescanned rather than wide tables with many columns. Performance tests have been carried out indicating thatanchor databases outperform less normalized databases in most data warehouse environments. Comparativeperformance tests as well as physical implementations of views and functions on other DBMS suggest adirection for future work.

Another line of research concerns the actual implementation of the anchor model. Most commercialDatabase Management Systems (DBMS) are mainly row-oriented, i.e. every attribute of one row is stored ina given sequence, followed by the next row and its attributes until the last row of the table. Since anchordatabases to a large degree consist of binary tables, column stores, i.e. column oriented DBMSs [22] that storetheir content by column rather than by row might offer a better solution. Moreover, for OLAP-workloads,which often involve a smaller number of queries involving aggregated columns over all data, column storescan be expected to be especially well suited.

Anchor Modeling dispenses with the requirement that an entire domain or enterprise has to be modeledin a single step. The possibility of an all-encompassing model is not a realistic option. At some point in time,a change will occur that could not have been foreseen. Anchor Modeling is built upon the assumption thatperfect predictions never can be made. A model is not built to last, it is built to change.

11. References

[1] A. Artale and E. Franconi, Reasoning with Enhanced Temporal Entity-Relationship Models, Proc. of the 10th Intl.Workshop on Database and Expert Systems Applications, 1999.

[2] B. Bebel, J. Eder, C. Koncilia, T. Morzy, and R. Wrembel, Creation and Management of Versions in MultiversionData Warehouses, ACM Symposium on Applied Computing, 2004.

[3] G. Booch, J. Rumbaugh, and J. Jacobson, The Unified Modelling Language User Guide, Addison Wesley, 1999.[4] A. Carver and T. Halpin, Atomicity and Normalization, Thirteenth International Workshop on Exploring Modeling

Methods in Systems Analysis and Design (EMMSAD), 2008.[5] P. Chen, The Entity Relationship Model - Toward a Unified View of Data, ACM Transactions on Database Systems, pp.

9–36, Vol. 1, Issue 1, March 1976.[6] C. E. Date, H. Darwen, and N. A. Lorentzos, Temporal Data and the Relational Model, Elsevier Science, 2003.[7] R. Elmasri and S. B. Navathe, Fundamentals of Database Systems, 5th Ed., Addison Wesley, 2006.[8] M. Fowler, Analysis Patterns: Reusable Object Models, Addison-Wesley, 1997.[9] H. Gregersen and J.S. Jensen, Temporal Entity-Relationship models a survey, IEEE Transactions on Knowledge and

Data Engineering, vol. 11, pp. 464–497, 1999.[10] T. Halpin, Information Modeling and Relational Databases: From conceptual analysis to logical design using ORM with

ER and UML, Morgan Kaufmann Publishers, 2001.[11] D.C. Hay, Data Model Patterns: Conventions of Thought, Dorset House Publishing, 1996.[12] W. H. Inmon, Building the Data Warehouse, 3rd Ed., John Wiley & Sons Inc., 2002.[13] C. S. Jensen and R. T. Snodgrass, Temporal Data Management, IEEE Transactions on Knowledge and Data Engineering,

vol. 11, pp. 36–44, 1999.[14] V. V. Khodorovskii, On Normalization of Relations in Relational Databases, Programming and Computer Software, pp.

41–52, Vol. 28, No. 1, 2002.[15] R. Kimball and M. Ross, The Data Warehouse Toolkit: The complete guide to Dimensional Modeling, 2nd Ed., Wiley

Computer Publishing, 2002.[16] X. Li, Building an Agile Data Warehouse: A Proactive Approach to Managing Changes, Proc. of the 4th IASTED Intl.

Conf., 2006.[17] M. Nicola and H. Rizvi, Storage Layout and I/O Performance in Data Warehouses, Proc. of the 5th Intl. Workshop on

Design and Management of Data Warehouses (DMDW’03), pp. 7.1–7.9, 2003.[18] G. N. Paulley, Exploiting Functional Dependence in Query Optimization, PhD thesis, Dept. of Computer Science,

University of Waterloo, Waterloo, Ontario, Canada, Sept. 2000.[19] O. Regardt, L. Ronnback, M. Bergholtz, P. Johannesson, and P. Wohed, Analysis of normal forms for anchor

models, http://www.anchormodeling.com/tiedostot/6nf.pdf, valid at May 13th 2009.[20] S. Rizzi and M. Golfarelli, What Time is it in the Data Warehouse?, LNCS, pp. 134–144, Vol. 4081, Springer

Berlin/Heidelberg, 2006.[21] J. F. Roddick, A Survey of Schema Versioning Issues for Database Systems, Information and Software Technology, Volume

37, Issue 7, pp. 383–393, 1995.[22] Stonebraker et al., C-Store: A column-oriented DBMS, Proc. of the 31st VLDB Conference, VLDB Endowment, pp.

553–564, 2005.

31

[23] D. Theodoratos and T. Sellis, Dynamic Data Warehouse Design, LNCS, pp. 802, Vol. 1676, Springer Berlin/Heidelberg,1999.

[24] H. J. Watson and T. Ariyachandra, Data Warehouse Architectures: Factors in the Selection Decision and the Success ofthe Architectures, Technical Report, Terry College of Business, University of Georgia, Athens, GA, July 2005.

[25] Wikipedia, http://en.wikipedia.org/wiki/Temporal_database, valid at April 11th 2009.[26] E. Zimanyi, Temporal Aggregates and Temporal Universal Quantification in Standard SQL, ACM SIGMOD Record, pp.

16–21, Vol. 35, Issue 2, June 2006.[27] D. Linstedt and K. Graziano and H. Hultgren, The Business of Data Vault Modeling, Daniel Linstedt, 2nd Ed.,

August 2009.[28] Data Vault Series 2 - Data Vault Components, http://www.tdan.com/view-articles/5155/, valid at Februray 4th

2010.[29] Abouth the Data Vault, http://www.danlinstedt.com/AboutDV.php, valid at Februray 4th 2010.[30] E. F. Codd, Further Normalization of the Data Base Relational Model, in Rustin [1972], pages 33-64, 1972.

32

Anchor Modeling27 Feb Paul

Documents

Transcript of Anchor Modeling27 Feb Paul