DWh Historization Libre

download DWh Historization Libre

of 73

description

DWH Historization Course

Transcript of DWh Historization Libre

  • R. Marti

    3-1 Data Warehouse Historization

    Data Warehousing

    Spring Semester 2012

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 2

    The Data Warehouse in the DWh Reference Architecture

    Data Ware-

    house

    Source Database

    Source Database

    Source Database

    Data

    Mart

    Data

    Mart

    Dashboards

    Reports

    Interactive Analysis

    Data Warehousing

    Focus Architectural options and variations in data warehouse projects

    Design of the single integrated data warehouse, in particular - how to handle temporal aspects (historization)

    - how to ensure common dimensions ( Master Data Management)

    Master

    Data

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 3

    Preliminaries: Notions of Time in Databases

    Valid Time (sometimes also effective time, as of time, or business time)

    is the time when a fact in the real world was, is, or will be true. (More precise wording: the time a fact was or is believed to be true or is believed to become true.)

    Note: Valid time must be entered by the user.

    Transaction Time (sometimes also system time)

    is the time when a fact in the real world was or is stored in the database

    (correctly or incorrectly).

    Note: Transaction time is automatically determined by the system

    (once the user decides to update the corresponding data, of course ... ) .

    Example of a fact stored in a DB on October 1 2010 (= transaction time):

    David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).

    Note: We will mostly be looking at valid time!

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Page 4

    (Valid) Time in Star Schema Designs (1)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 5

    (Valid) Time in Star Schema Designs (2)

    Rows in fact tables are associated with a specific time, via the foreign key

    value referencing the time dimension, indicating when they were valid.

    However, rows in dimension tables are not associated with any time !

    - new rows (rows with an unknown source system IDs) are simply added

    - usually, no rows are deleted from a dimension table,even if rows with known

    source system IDs are missing from a batch load:

    . existing (old) facts still refer to objects corresponding to these missing rows

    . if sources do not send explicit information on deletions, it is unclear whether

    the corresponding dimensional objects have effectively become invalid or not

    (Note: Sending this information might mean re-designing the source system!)

    - changes in values of dimension rows with known source system IDs are

    (1) either simply overwritten,

    (2) or a new row with a new surrogate (but the old source system ID)

    is added (see topic slowly changing dimensions)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    Analysis of yearly salaries grouped by year and by employee rank.

    Schema

    DATE_ID, EMP_ID: warehouse-internal object identifiers (surrogates)

    EMP_NO: external source system identifier, must be stable across subsequent loads

    Page 6

    Motivating Example: Star Schema

    COMPENSATION DATE_ID

    EMP_ID SALARY

    EMPS EMP_ID

    EMP_NO EMP_NAME

    EMP_RANK

    EMP_TITLE

    DATES DATE_ID

    DATE_YEAR

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    Analysis of yearly salaries grouped by year and by employee rank.

    select

    DATE_YEAR, EMP_RANK, EMP_TITLE,

    sum(SALARY) as SALARY

    from

    COMPENSATION c

    join DATES d on d.DATE_ID = c.DATE_ID

    join EMPS e on e.EMP_ID = c.EMP_ID

    group by

    DATE_YEAR, EMP_RANK, EMP_TITLE

    order by

    DATE_YEAR, EMP_RANK

    ;

    Page 7

    Motivating Example: Query

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    Load

    - Generate ID for new year

    - Generate IDs for new employees

    - Project contents of source into target

    tables EMPS, COMPENSATION

    8

    Motivating Example: 2010 Data

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 9

    Motivating Example: 2010 Compensation Report

    select

    DATE_YEAR, EMP_RANK, EMP_TITLE,

    sum(SALARY) as SALARY

    from

    COMPENSATION c

    join DATES d on d.DATE_ID = c.DATE_ID

    join EMPS e on e.EMP_ID = c.EMP_ID

    group by

    DATE_YEAR, EMP_RANK, EMP_TITLE

    order by

    DATE_YEAR, EMP_RANK

    ;

    Result

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 10

    Motivating Example: 2011 Data

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    11

    Issue: 2010 + 2011 Compensation Report

    Old 2010 Result

    2010+2011 Result

    By destructively updating the

    rank/title of employee with ID 2

    from C to B, the 2010 report

    has been unintentionally altered

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Slide 12

    Kimballs Types of Slowly Changing Dimensions

    Ralph Kimball proposed 3 solutions regarding the historization of

    dimensions in the context of the Star Schema called slowly

    changing dimensions (SCD)

    SCD Type 1: no history of the dimensional attribute is needed/kept

    simply overwrite the value in the existing row ok for e.g. the correction of mistakes in names, birthdays etc.

    SCD Type 2: versions of some dimensional attributes are needed

    store new rows in the dimension table, with a new warehouse ID,

    the existing stable source system ID,

    and the new (changed) values e.g. a change in the rank of an employee

    SCD Type 3: current and original (or previous) versions are needed

    keep both a current and an original attribute in the dimension table e.g. the current rank and the original rank of each employee

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Slide 13

    Assessment of SCD Type 1 (see previous solution)

    Advantages

    Simple to understand for business users and simple to implement

    (especially when using MOLAP tools)

    Requires the least space and has the best response time

    Disadvantages

    Simplicity is deceiving

    A change in a dimensional attribute effectively changes the context

    for all facts captured prior to the change

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 14

    Motivating Example with SCD Type 2: 2011 Data

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    15

    2010 + 2011 Compensation Report with SCD Type 2

    Old 2010 Result

    2010+2011 Result

    2010 salaries get linked to old

    version of employee,

    2011 salaries get linked to new

    version of employee

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Slide 16

    Assessment of SCD Type 2

    Advantages

    Reasonably understandable and simple to implement

    (regardless of MOLAP / ROLAP tool)

    Captures parts of the history

    Disadvantages

    The time of a change in a dimension is not captured

    Requires more space since a single dimensional object is potentially

    represented in several rows (but this is usually not an issue)

    Can be confusing since changed dimensional data objects appear

    more than once, with identical source system IDs, but at least one

    changed attribute value

    Checking when it is ok to refer to which DWh IDs is not possible

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 17

    Motivating Example with SCD Type 3: 2011 Data

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    2010+2011 Result in Terms of Original Ranks

    2010+2011 Result in Terms of Current Ranks

    2010 + 2011 Compensation Report with SCD Type 3

    Both reports are incorrect

    (red attribute values)!

    Note: The query for the resullts

    in terms of original ranks is left

    as an exercise ...

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Slide 19

    Assessment of SCD Type 3

    Advantages

    Reasonably simple to implement

    (regardless of MOLAP / ROLAP tool)

    Captures parts of the history

    Disadvantages

    Can only have 2 versions of any attribute (usually original and current)

    Each historized attribute A must be represented by 2 attributes

    (namely, A and A_Original)

    Requires more space since there are now 2 attributes instead of 1

    (but this is usually not an issue)

    Interpretation of results is confusing to most users

    Unclear when original and current versions are/were valid

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 20

    Temporal Database Systems and Languages in General

    Recap: For some types of analysis, dimensions should be historized,

    especially for comparisons of measures across different time periods.

    Example:

    How did buying habits of customers change over the last few years,

    grouped by where they live.

    History of addresses of customers should also be kept!

    Since 1980, a lot of research has been conducted in general temporal data

    models, temporal query languages, and temporal database systems.

    Generic support for temporal data is beginning to emerge in products:

    Teradata Database 13.10, IBM DB2 V10, Oracle Workspace Manager (see later)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    Associating Time with Data A Theoretical Model

    21

    time

    tuples

    attributes

    Assumption: For each relation, a clock with

    a given temporal granularity is specied,

    e.g., a day, a second, or a millisecond."

    "

    Conceptually, the extension of a temporal

    relation R can then be viewed as a

    sequence of snapshot relations

    Rt = t(R)

    for every time point t of this clock."

    "

    "

    "

    "

    t is called snapshot operator

    (sometimes also timeslice operator)"

    snapshot at time t

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 22

    Benefits and Pitfalls of Sequence of Snapshots Model

    Good for theoretical considerations, in particular

    determining equivalence of different temporal representations

    measuring the expressive power of temporal query languages

    impractical as an implementation model if it requires lots of space,

    especially when

    granularity of time is fine-grained (minutes, seconds, milliseconds, ... )

    represented facts do not change often, i.e. stay the same over a longer

    period of time (usually because they describe states rather than events)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 23

    From Sequence of Snapshots Model to Time Intervals

    Remedy:

    Dont store data that did not change since the previous clock tick

    Tuples (or even attributes) whose values are identical across different

    snapshots are associated with time intervals (also called periods)

    rather than time points

    Alternatives:

    (1) associate temporal intervals to each tuple

    (2) associate temporal intervals to each attribute value

    (but this approach requires complex attributes, violating 1NF)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 24

    Valid Time Relations capturing State

    Conceptually, every tuple which captures a state is timestamped with a time

    interval [tfrom, tto] indicating the validity of the (non-temporal) data

    represented in the tuple

    Remarks:

    Transformation into 1NF by replacing V_INTERVAL

    by V_FROM (valid from) and V_TO (valid to)

    The symbol ? means unknown, until now or until further notice.

    In standard SQL, it is usually represented by null or by the date 9999-12-31,

    both of which are not entirely satisfactory ...

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 25

    Side Issue: Representation of Time Intervals (Periods)

    Closed-closed time intervals [tfrom, tto] tend to be preferred by end-users:

    A fact was true from date tfrom up to and including date tto .

    This choice also allows querying using the SQL between predicate:

    valid at time t in SQL: :t between V_FROM and V_TO

    Mathematically, closed-open time intervals [tfrom, tto) sometimes also

    depicted as [tfrom, tto[ are preferable (see e.g. Allen)

    A fact was true from date tfrom up to but excluding date tto .

    valid at time t in SQL: :t >= V_FROM and :t < V_TO

    Note:

    Unless otherwise stated, I have used the closed-closed representation.

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 26

    Typical Queries (1): Snapshot of Valid Time Relation

    Snapshots of the previous valid time relation:

    Remarks:

    We assume that ID is the primary key at every point in time (in every snapshot).

    Producing a snapshot from a valid time relation is a simple selection in rel. algebra:

    select ID, NAME, FNAME, ADDR, SAL

    from EMP

    where :t in V_INTERVAL -- actually: :t between V_FROM and V_TO

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 27

    Valid Time Relations capturing Recurring States

    A specific state of affairs can recur several times ( several time periods)

    transformation to 1NF

    The first two tuples are called value equivalent since they have the same

    values in all attributes except the temporal attributes V_FROM and V_TO.

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 28

    Options in the Representation of Time

    Canonical representation using maximal time intervals (as on previous slide):

    One (of many) possible alternative representations using two (non-maximal)

    contiguous intervals (assuming a temporal granularity of a day):

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 29

    Issues with Non-canonical Representations

    Non-canonical representations may lead to incorrect answers (for unsuspecting

    users).

    Example Query: Who left the company before 2008-01-01 and when?

    select ID, NAME, FNAME, V_TO

    from EMP

    where V_TO < date '2008-01-01'

    (Incorrect) Result:

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 30

    Constraint to Avoid Non-canonical Representations

    Ensure that intervals remain maximal when inserting or updating:

    Let R be a valid time relation in canonical form (i.e., with maximal time intervals)

    - n be a new valid time tuple to be inserted into the relation R

    - x1, ... , xn (n 0) be all existing valid time tuple in relation R which are

    value equivalent to x (cf. p. 12)

    Then, for all i, 0 i n, the following must hold (in pseudo-SQL notation):

    not exists (

    select *

    from R xi

    where xi = n

    and (n.V_FROM - 1 between xi.V_FROM and xi.V_TO

    or n.V_TO + 1 between xi.V_FROM and xi.V_TO)

    )

    (This could be specified as declarative check constraint if your DBMS implementation supports it )

    value equivalence

    intervals do not touch or overlap

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 31

    Typical Queries (2): Temporal Projection

    Unfortunately, (intermediate) query results may turn out to be non-canonical,

    even if applied to a canonical representation:

    Example: Where did employees live and when (irrespective of salary)?

    select ID, NAME, FNAME, ADDR, V_FROM, V_TO from EMP

    Result:

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 32

    Coalescing to Avoid Non-canonical Representations

    Non-canonical representations can be transformed into the canonical

    representation by an operator called temporal coalescing (TCOALESCE below)

    which maximizes the length of all intervals by coalescing adjacent and

    overlapping intervals of value-equivalent tuples.

    Coalesced form:

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 33

    Temporal Coalescing in (Pseudo-) SQL

    with recursive Rclos as (

    -- initial ("anchor") query

    select R.values, R.V_FROM, R.V_TO from R

    union

    -- recursive query: executed until no new data generated

    select R.values, R.V_FROM, Rclos.V_TO

    from R, Rclos

    where Rclos.values = R.values -- values of Rclos and R are equivalent

    and Rclos.V_FROM >= R.V_FROM

    and Rclos.V_FROM-1 Rclos.V_TO )

    )

    more efficient

    implementation

    uses window

    functions

    (see [Zhou et al 2006])

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 34

    Typical Queries (3): Temporal Join

    Sometimes, the history of information stored in two relations is of interest:

    Example: Who worked on which projects and when?

    Result:

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 35

    Temporal Join in SQL (without temporal coalescing!)

    Construct time intervals of result by intersecting time intervals of operands

    (and keeping rows with non-empty intervals):

    select * from (

    select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,

    case when e.V_FROM > w.V_FROM

    then e.V_FROM

    else w.V_FROM

    end as V_FROM,

    case when e.V_TO < w.V_TO

    then e.V_TO

    else w.V_TO

    end as V_TO

    from WORKS_ON w, EMP e

    where e.ID = w.EMP_ID

    ) where V_FROM

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 36

    Transaction Time Relations

    Note that transaction time should be automatically determined by the

    system at insert/update/delete time (or, more precisely, commit time),

    not by the user; granularity is typically as fine as possible

    Transaction time can be represented exactly like valid time,

    by associating a time interval with tuples.

    Example: Transaction time history of employee 676 (also see slide 10)""

    "1. 2006-07-01: insert 676 lives in Baar und earns 7000.

    "2. 2008-04-01: update 676 lives in Bern."3. 2009-11-01: update 676 earns 7500."

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 37

    Using DBMS Logging to capture Transaction Time

    Since transaction time can be automatically determined by the system,

    the DBMS logging facilities can be used.

    This is/was done e.g. in Postgres/PostgreSQL/Illustra (and in Oracle).

    Example: Transaction time history of employee 676 (see slide 15)""

    "1. 2006-07-01: insert 676 lives in Baar and earns 7000."2. 2008-04-01: update 676 lives in Bern.

    "3. 2009-11-01: update 676 earns 7500.

    Normal (snapshot) table

    containing current contents.

    Undo log table containing

    changes to produce

    previous contents of

    associated snaphsot table

    (before images).

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 38

    Implementing Logging Using Triggers

    create or replace trigger TR_AU_EMP after update on EMP for each row

    declare

    l_log EMP_UNDO_LOG%rowtype;

    begin

    l_log.X_TIME := current_timestamp; l_log.UNDO_OP_CODE := 'update';

    l_log.ID := :old.ID; l_log.NAME := :old.NAME; l_log.FNAME := :old.FNAME; l_log.ADDR := :old.ADDR;

    l_log.SAL := :old.SAL; insert into EMP_UNDO_LOG values l_log; end TR_AU_EMP; /

    written in Oracle PL/SQL

    similar triggers required for inserts and deletes

    should probably check that ID has not changed

    and raise an application error if this were the case

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 39

    Bitemporal Relations

    Valid time and transaction time can be combined to allow for a complete

    history of what information was/is believed to be true and when this was

    stored in the database.

    Example: Complete (bitemporal) history of employee 676""

    "1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of 2006-08-01.

    "

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 40

    Bitemporal Relations (2)

    Example (continued): Complete (bi-temporal) history of employee 676"

    "2. 2008-04-01: update 676 lives in Bern as of 2008-03-01.""

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 41

    Bitemporal Relations (3)

    Example (continued): Complete (bi-temporal) history of employee 676"

    "3. 2009-11-01: update 676 earns 7500 as of 2010-01-01.""

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 42

    Bitemporal Relations (4)

    Example (continued): Complete (bi-temporal) history of employee 676"

    "4. 2009-11-11: update correction: 676 earns 7700 as of 2010-01-01.

    ""

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 43

    Design of Temporal Databases

    Basic idea

    Do non-temporal database design

    Annotate which tables / attributes need to be historized (especially valid time)

    and how (state-based vs. event-based)

    Generate temporal data structures ... but how?

    Questions:

    Entity integrity (implemented by primary keys)

    temporal entity integrity

    Referential integrity (implemented by foreign keys)

    temporal referential integrity

    Arbiter: sequence of snapshots model

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 44

    Temporal Entity Integrity (1)

    Temporal entity integrity = for every snapshot, entity integrity should hold.

    Pro memoria:

    - primary keys should consist of a minimal number of attributes

    which unqiuely identify each tuple

    - these attributes should ideally not change over time

    Alternatives for the primary key of a valid time relation (e.g. for table EMP)

    (1) ID, V_FROM

    (2) ID, V_TO

    (3) ID, V_FROM, V_TO (non-minimal primary key!)

    (4) ID, SEQ_NO (where SEQ_NO is a sequence number or counter)

    Since all attributes except ID (and SEQ_NO) can change over the lifetime of

    the identified tuple

    - alternative (4) is probably the best,

    - followed by alternative (1) as V_FROM only changes in case of an error

    (and should not be referenced by foreign keys, as well see)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 45

    Temporal Entity Integrity (2)

    In addition, it might be desirable to enforce other constraints, including

    Time intervals must not be empty

    Time intervals should be maximal (unless e.g. queries like what was the

    case before or after a specific point in time are not of importance)

    create table EMP (

    ID integer not null,

    SEQ_NO integer not null,

    NAME varchar(20) not null,

    ...

    V_FROM date not null,

    V_TO date default date '9999-12-31',

    primary key (ID, SEQ_NO),

    check ( V_FROM

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 46

    Referential Integrity between Snapshot Relations

    The foreign key (FK) attribute value(s) in the referencing relation must exist as

    primary key (PK) values in the referenced relation:

    Example: Works_On[Emp_Id] Emp[Id]

    Note: In relational theory, this is sometimes also called an inclusion dependency.

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 47

    Temporal Referential Integrity (1)

    Temporal referential integrity = for every snapshot, referential integrity must hold.

    Problem:

    - primary keys now have a temporal part (on top of the non-temporal part)

    - valid time periods in the foreign key (referencing) relation are not

    necessarily the same as those of the primary key (referenced) relation

    At every point in time when the FK value was valid, the referenced PK value must be valid.

    t ( t(Works_On[Emp_Id]) t (Emp[Id]) )

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 48

    Temporal Referential Integrity (2)

    t ( t(Works_On[Emp_Id]) t (Emp[Id]) ) holds for employee 676 because

    projection followed by temporal coalescing would result in:

    Of course, performing temporal coalescing for

    - adding tuples to and/or extending time intervals of the referencing relation

    - deleting tuples from and/or shrinking time intervals in the referenced relation

    would be an expensive proposition

    Recommendation: Track complete lifetimes of objects in a separate relation

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 49

    Temporal Referential Integrity (3)

    Split valid time relation on referenced (PK) side into

    (1) an object relation (suffix _OBJ) and (2) a property relation (suffix _PROP)

    Add a referential integrity constraint from property relation to object relation.

    Re-route non-temporal referential integrity constraints from other relations

    to the object relation.

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 50

    Temporal Referential Integrity (4)

    In referencing relations, it might be desirable to enforce referential integrity

    non-temporal part: as usual

    temporal part: time interval contained in time interval of referenced object

    create table WORKS_ON (

    EMP_ID integer not null,

    PROJ_ID integer not null,

    SEQ_NO integer not null,

    V_FROM date not null,

    V_TO date default date '9999-12-31',

    primary key (EMP_ID, PROJ_ID, SEQ_NO),

    check ( V_FROM

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 51

    Temporal Normalization (1): Time-invariant Attributes

    Assume that attribute FName cannot change over the lifetime of an Emp

    (except to correct mistakes).

    In other words, the functional dependency (FD) Id FName holds

    relation Emp_Prop below is not in 2NF (attribute depends on part of PK)

    relation Emp_Prop exhibits update anomalies

    when having to fix a mistake in Sues first name (e.g. change to Susan)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 52

    Temporal Normalization (2): Time-invariant Attributes

    Recommendation:

    Consider moving time-invariant attributes (e.g. FName) from the property

    relation (e.g. Emp_Prop) to the object relation (e.g. Emp_Obj).

    In Emp_Obj, the FD Id FName still holds (and is enforced by the PK),

    so the relation does not exhibit update anomalies.

    In Emp_Prop, all attributes are now fully dependent on the PK but there is still an issue ...

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 53

    Temporal Normalization (3): Asynchronous Changes

    Example: After having inserted the salary raise to employe 676 as of beginning

    of 2010, we learn that she actually moved to Aarau as of Dev 1 2009.

    update anomaly: several tuples need to be changed (in addition to insert)

    Recommendation:

    Attributes whose values change independently of other attributes should be put

    into different relations

    (somewhat like achieving 4NF in the face of multi-valued dependencies).

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 54

    Temporal Normalization (4): Asynchronous Changes

    Example: Since address and salary of an employee may change independently

    (and asynchronuously), these attributes should be put into different relations.

    no update anomaly: only one tuple needs change (in addition to insert)

    Employee salaries remain untouched:

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti

    Summary of Design Recommendations

    For kernel entity types (with objects whose existence is independent of other

    entities), consider the introduction of an object relation to capture the lifetime

    of these objects main benefits:

    - referential integrity checking over time

    - home for time-invariant attributes

    For relations representing object properties (or relationships between objects)

    and their history, consider choosing a temporal primary key consisting of the

    non-temporal primary key attributes plus a (meaningless) sequence number.

    For relations representing object properties (or relationships between objects),

    consider decomposing them into groups of attributes which

    - are either time-invariant

    this attribute group is moved to the object relation

    - or change independently of one another (i.e., potentially at different times)

    each such attribute group is moved into a separate relation keeping

    track of the history of the values

    Remember: Following

    them is no free lunch!

    55

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 56

    Proposals for Temporal Support in SQL

    There are proposals to hide all this temporal complexity in SQL,

    e.g., the SQL/Temporal part of a future SQL3 standard.

    Originally, a temporal join (including temporal coalescing) was supposed to be

    specified as follows:

    validtime

    select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,

    from WORKS_ON w, EMP e

    where e.ID = w.EMP_ID

    see Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann, 1999.

    Note: This publication is out of print, but available electronically as pdf a

    http://www.cs.arizona.edu/people/rts/publications.html

    Apparently, DB2 10 for z/OS (see following slides for some examples) and

    Teradata Database V13.10 support most of the SQL/Temporal proposal.

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 57

    Example: Temporal Support in IBM DB2 10 (1)

    Non-temporal table POLICY capturing information about insurance policies for

    cars (vehicles):

    ID: unchanging IDentifier

    VIN: Vehicle Identification Number

    rental_car: is the car a rental car (legal values: Y and N)

    annual_mileage: approximate distance in miles per year

    coverage_amt: maximum amount paid by insurance company,

    presumably in US Dollars (Are there any other currencies on this planet? :-)

    Fig. 1: Sample POLICY table (without temporal support)

    ID VIN annual_mileage rental_car coverage_amt

    1111 A1111 10000 Y 500000

    Lets explore how DB2s temporal support can help you ma

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 58

    Example: Temporal Support in IBM DB2 10 (2)

    Declaring tables to capture system time (= transaction time) + history of changes

    -- Step 1: Create a table with a SYSTEM_TIME period.

    CREATE TABLE policy ( id INT PRIMARY KEY NOT NULL,

    ... sys_start TIMESTAMP(12) GENERATED ALWAYS AS ROW BEGIN NOT NULL,

    sys_end TIMESTAMP(12) GENERATED ALWAYS AS ROW END NOT NULL,

    trans_start TIMESTAMP(12) GENERATED ALWAYS AS TRANSACTION START ID IMPLICITLY HIDDEN,

    PERIOD SYSTEM_TIME (sys_start, sys_end) );

    -- Step 2: Create an associated history table. CREATE TABLE policy_history LIKE policy;

    -- Step 3: Enable versioning.

    ALTER TABLE policy ADD VERSIONING USE HISTORY TABLE policy_history;

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 59

    Example: Temporal Support in IBM DB2 10 (3)

    Result of previous create table statements:

    Fig. 2: Sample tables for our system time scenario

    POLICY table (contains current data) ID VIN annual_mileage rental_car coverage_amt sys_start sys_end trans_start

    POLICY_HISTORY table (contains historical data) ID VIN annual_mileage rental_car coverage_amt sys_start sys_end trans_start

    You can also use the ALTER TABLE statement to modify existing tables to track system

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 60

    Example: Temporal Support in IBM DB2 10 (4)

    Insertions do not affect the history table:

    INSERT INTO policy(id,vin,annual_mileage,rental_car,coverage_amt)

    VALUES (1111, 'A1111', 10000, 'Y', 500000);

    INSERT INTO policy(id,vin,annual_mileage,rental_car,coverage_amt)

    VALUES (1414, 'B7777', 14000, 'N', 750000);

    -- both statements executed on November 15, 2010

    Fig. 3: Current and history table contents after INSERTs on Nov. 15, 2010

    POLICY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 10000 Y 500000 2010-11-15 9999-12-31

    1414 B7777 14000 N 750000 2010-11-15 9999-12-31

    POLICY_HISTORY (empty) ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    The SYSTEM_START values in the POLICY table reflect when the rows were inserted

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 61

    Example: Temporal Support in IBM DB2 10 (5)

    Updates do affect the history table (as do deletions ... see later)

    UPDATE policy

    SET coverage_amt = 750000

    WHERE id = 1111;

    -- statement executed on January 31, 2011

    POLICY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 10000 Y 750000 2011-01-31 9999-12-31

    1414 B7777 14000 N 750000 2010-11-15 9999-12-31

    POLICY_HISTORY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 10000 Y 500000 2010-11-15 2011-01-31

    As you might expect, any subsequent updates to policies are handled in a similar manner.

    rental_car

    rental_car

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 62

    Example: Temporal Support in IBM DB2 10 (6)

    Another update, 1 year later ...

    UPDATE policy

    SET annual_mileage = 5000, rental_car='N', coverage_amt = 250000

    WHERE id = 1111;

    -- statement executed on January 31, 2012

    POLICY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 5000 N 250000 2012-01-31 9999-12-31

    1414 B7777 14000 N 750000 2010-11-15 9999-12-31

    POLICY_HISTORY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 10000 Y 500000 2010-11-15 2011-01-31

    1111 A1111 10000 Y 750000 2011-01-31 2012-01-31

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 63

    Example: Temporal Support in IBM DB2 10 (7)

    And a deletion ...

    DELETE FROM policy

    WHERE id = 1414;

    -- statement executed on March 31, 2012

    POLICY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 5000 N 250000 2012-01-31 9999-12-31

    POLICY_HISTORY ID VIN annual_mileage rental_car coverage_amt sys_start sys_end

    1111 A1111 10000 Y 500000 2010-11-15 2011-01-31

    1111 A1111 10000 Y 750000 2011-01-31 2012-01-31

    1414 B7777 14000 N 750000 2010-11-15 2012-03-31

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 64

    Example: Temporal Support in IBM DB2 10 (8)

    Retrieving current data (from the current table shown on the previous slide):

    SELECT coverage_amt

    FROM policy

    WHERE id = 1111;

    -- returns 250000

    Retrieving historical data (from the current/historical tables shown on the previous slide):

    SELECT coverage_amt

    FROM policy FOR SYSTEM_TIME AS OF TIMESTAMP(2010-12-01)

    WHERE id = 1111;

    -- returns 500000

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 65

    Example: Temporal Support in IBM DB2 10 (9)

    Declaring a table to capture business time (= valid time)

    CREATE TABLE policy (

    id INT PRIMARY KEY NOT NULL, ...

    bus_start DATE NOT NULL, bus_end DATE NOT NULL,

    PERIOD BUSINESS_TIME (bus_start, bus_end)

    PRIMARY KEY (id, BUSINESS_TIME WITHOUT OVERLAPS) );

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 66

    Example: Temporal Support in IBM DB2 10 (10)

    Insertions are straightforward and require appropriate values for business time start / end:

    INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)

    VALUES (1111, 'A1111', 10000, 'Y', 500000, '2010-01-01', '2011-01-01');

    INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)

    VALUES (1111, 'A1111', 10000, 'Y', 750000, '2011-01-01', '9999-12-31');

    INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)

    VALUES (1414, 'B7777', 14000, 'N', 750000, '2008-05-01', '2010-03-01');

    INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)

    VALUES (1414, 'B7777', 12000, 'N', 600000, '2010-03-01', '2011-01-01');

    Fig. 7: POLICY table after INSERT statements

    ID VIN annual_mileage rental_car coverage_amt bus_start bus_end

    1111 A1111 10000 Y 500000 2010-01-01 2011-01-01

    1111 A1111 10000 Y 750000 2011-01-01 9999-12-31

    1414 B7777 14000 N 750000 2008-05-01 2010-03-01

    1414 B7777 12000 N 600000 2010-03-01 2011-01-01

    It may help to summarize the contents of this table in business terms. Very briefly, the

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 67

    Example: Temporal Support in IBM DB2 10 (11)

    An insertion with a business time period that overlaps with business time period(s) of

    existing rows raises an error:

    INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)

    VALUES (1111, 'A1111', 10000, 'Y', 900000, '2010-06-01', '2011-09-01');

    -- overlap with 2 existing rows => rejected by system

    Use an update statement instead:

    UPDATE policy

    FOR PORTION OF BUSINESS_TIME

    FROM '2010-06-01'

    TO '2011-09-01'

    SET coverage_amt = 900000

    WHERE id = 1111;

    Fig. 8. Row splits caused by the UPDATE statement

    row row

    2010-01-01 2011-01-01 9999-12-31

    UPDATE FROM 2010-06-01 TO 2011-09-01

    row row

    2010-01-01 2011-01-01 9999-12-31

    2010-06-01 2011-09-01

    row row

    Before the update (Fig. 7):

    After the update (Fig. 9):

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 68

    Example: Temporal Support in IBM DB2 10 (12)

    Table resulting after execution of update statement shown on previous slide:

    Fig. 9. POLICY table after UPDATE of Policy 1111

    ID VIN annual_mileage rental_car coverage_amt bus_start bus_end

    1111 A1111 10000 Y 500000 2010-01-01 2010-06-01

    1111 A1111 10000 Y 900000 2010-06-01 2011-01-01

    1111 A1111 10000 Y 900000 2011-01-01 2011-09-01

    1111 A1111 10000 Y 750000 2011-09-01 9999-12-31

    1414 B7777 14000 N 750000 2008-05-01 2010-03-01

    1414 B7777 12000 N 600000 2010-03-01 2011-01-01

    Deleting data from a table with business time

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 69

    Example: Temporal Support in IBM DB2 10 (13)

    Deletion from table shown on previous slide:

    DELETE FROM policy

    FOR PORTION OF BUSINESS_TIME

    FROM '2010-06-01' TO '2011-01-01'

    WHERE id = 1414;

    ID VIN annual_mileage rental_car coverage_amt bus_start bus_end

    1111 A1111 10000 Y 500000 2010-01-01 2010-06-01

    1111 A1111 10000 Y 900000 2010-06-01 2011-01-01

    1111 A1111 10000 Y 900000 2011-01-01 2011-09-01

    1111 A1111 10000 Y 750000 2011-09-01 9999-12-31

    1414 B7777 14000 N 750000 2008-05-01 2010-03-01

    1414 B7777 12000 N 600000 2010-03-01 2010-06-01

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 70

    Example: Temporal Support in IBM DB2 10 (14)

    Retrieving data across all business time periods from table shown above:

    SELECT COUNT(*) FROM policy WHERE id = 1111;

    -- returns 2

    Retrieving data as of a specific business time from table shown on previous slide:

    SELECT coverage_amt

    FROM policy FOR BUSINESS_TIME AS OF TIMESTAMP(2010-12-01)

    WHERE id = 1111;

    -- returns 500000

    Fig. 7: POLICY table after INSERT statements

    ID VIN annual_mileage rental_car coverage_amt bus_start bus_end

    1111 A1111 10000 Y 500000 2010-01-01 2011-01-01

    1111 A1111 10000 Y 750000 2011-01-01 9999-12-31

    1414 B7777 14000 N 750000 2008-05-01 2010-03-01

    1414 B7777 12000 N 600000 2010-03-01 2011-01-01

    It may help to summarize the contents of this table in business terms. Very briefly, the

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 71

    Example: Temporal Support in IBM DB2 10 (15)

    Retrieving data as of specific business times (from table on previous slide):

    SELECT coverage_amt

    FROM policy

    FOR BUSINESS_TIME FROM TIMESTAMP(2009-01-01)

    TO TIMESTAMP(2011-01-01)

    WHERE id = 1414;

    Fig. 12: Query result

    ID VIN annual_mileage rental_car coverage_amt bus_start bus_end

    1414 B7777 14000 N 750000 2008-05-01 2010-03-01

    1414 B7777 12000 N 600000 2010-03-01 2011-01-01

    Temporal queries against tables with business time are internally re-written to a query

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti 72

    Example: Temporal Support in IBM DB2 10 (16)

    Declaring a bitemporal table, capturing business time and system time:

    (1) Declare a table with business time and system time.

    CREATE TABLE policy (

    id INT PRIMARY KEY NOT NULL, ...

    bus_start DATE NOT NULL, bus_end DATE NOT NULL,

    sys_start TIMESTAMP(12) GENERATED ALWAYS AS ROW BEGIN NOT NULL,

    sys_end TIMESTAMP(12) GENERATED ALWAYS AS ROW END NOT NULL, trans_start TIMESTAMP(12) GENERATED ALWAYS AS

    TRANSACTION START ID IMPLICITLY HIDDEN, PERIOD BUSINESS_TIME (bus_start, bus_end),

    PERIOD SYSTEM_TIME (sys_start, sys_end),

    PRIMARY KEY (id, BUSINESS_TIME WITHOUT OVERLAPS) );

    (2) Then declare a history table like the previous table

    (3) Associate this history table with the table declared in step (1)

  • DWh 2012: 3-1 Data Warehouse - Historization R. Marti Slide 73

    Literature

    General Temporal Database Concepts

    [Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann,

    1999. (see http://www.cs.arizona.edu/people/rts/publications.html)

    [Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in

    Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems

    Applications - DEXA '06, 2006.

    [Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design,

    Update and Query Temporal Data. Morgan Kaufmann, 2010.

    [Sacacco et al 2010] Cynthia M. Saracco, Matthias Nicola, Lenisha Gandhi: A Matter of Time Temporal Data

    Management in DB2 for z/OS. IBM Silicon Valley Laboratory, 2010 (?).

    Data Warehouse Design

    [Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to

    Dimensional Modeling, 2nd Edition. John Wiley, 2002.

    [Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design:

    Relational and Dimensional Techniques. John Wiley, 2003.

    [Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and

    Methodologies. McGraw Hill, 2009.

    [Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.