oh3

30
Lecture 3 Themes in this session Basics of the multidimensional data model and star-join schemata The process of, and specific design issues in, multidimensional data modelling

Transcript of oh3

Lecture 3Themes in this session Basics of the multidimensional data model and star-join schemata The process of, and specific design issues in,multidimensional data modelling Why use Multidimensional modelling Simple, intuitive design basis easy to create a multidimensional model easy to communicate the meaning of the model easy to gain an overview of the model A logical model which can be implemented in a variety of databases relational databases multidimensional databases object-oriented databases Supports the reporting and analytical needs of business users Tried and tested General structure for a multidimensional model A central fact table, referred to as a multidimensional data subject Surrounding dimension tables, referred to as single dimensional data subjects oins connecting the fact table and its surrounding dimension tables! "nly one join per dimension table A concatenated or multipart #ey in the fact table which is comprised of one #ey from each dimension table!$"T%& the multidimensional model represents a n-dimensional matri', with n being the number of dimensions The fact table (easures or facts reflect focal events or snapshots of states of being vary continuously over time All the facts have a specific granularity )acts should ideally be additive but this is not always the case A set of foreign #eys constituting a concatenated #ey *ontains the major volume of data The dimension tables +imensions are often referred to as causal dimensions, they contain the causal factors responsible for the collected measures The time dimension is not a causal dimension, it is however one of the most important dimensions for structuring and analysing data The dimension tables contain dimensional attributes, these are usually te'tual and discrete and must have a relevant business meaning ,ood dimensional attributes are stable across time -f the attributes are connected in one or more hierarchies then these are usually captured in the dimension table Skinny fact tables As the fact table contains the vast volume of records it is important that it is memory space efficient! )oreign #eys are usually represented in integer from and do not re.uire much memory space )acts too are often numeric properties and can usually be represented as integers /contrast to dimensional attributes which are usually long te't strings0 This space efficiency is critical to the memory space consumption of the data warehouse Aggregation 1owest level of aggregation is determined by the granularity of the fact table! Aggregations can be created on-the-fly or by the process of pre-aggregation 2re-aggregation demands more storage space but provides better .uery performance Aggregation is easier when facts are all additive Sparsity The matrices, represented by multidimensional models are often 334 sparse! Sparsity is dealt with by simply not creating records for the cells that are not filled in the matri'! -f nothing has happened no record is created! 2re-aggregation and storage of aggregates can however lead to sparsity failure which places large demands on data storage De-normalisation-n 5rd normal from all mutually independent and fully dependent on the primary #ey The fact table is by nature highly normalised in a star-join schema +imensions are however usually not normalised Sno!akes and normalised dimension tables 6Any attempts to normalise dimension tables in order to save dis# space are a waste of time7 Affects the intuitive understandability of the diagram $ormalised dimension tables destroy the ability to 6browse7 $ormalised tables demand e'tra joins and the .uerying of snowfla#es ta#e s longer than the .uerying of standard star-join schemata The process of multidimensional data modelling What to focus on in MDM 8uery optimised database 9hole business entities :ey business activities and influences Transaction history 2eople, places and things Time +imension and rollup "asic steps in modelling Select a business subject area -dentify which business process/es0 is being modelled -dentify the basic measures or facts +etermine at what level of detail /granularity0 active analysis is conducted +etermine what the measures have in common /identify the dimensions0 -dentify the relevant attributes in the various dimensions +etermine if the attributes are stable or variable over time and if their cardinality is bounded or unbounded #dentifying facts The purpose of the analyst must be supported by the facts in the fact table, there must be measures which have relevance to the business goals which the organisation see#s to fulfil )acts are by nature dynamic and variable over time They do not have a limited cardinality )acts have their origins in the wor#ing of the organisation and the activities it performs #dentifying dimensional attributes +imensional attributes are those predictive variables business users believe are of significance to the measures in the fact table +imensional attributes are often present in hierarchies in the causal dimensions +imensional attributes usually have a limited cardinality and are non-variable across time! Supporting multiple hierarchies in dimensions +imensions should be able to support multiple independent hierarchies Alternate hierarchies are easily supported *yclic paths in two hierarchies demand that the hierarchies be split into two entirely separate hierarchies or even two separate dimensions A dimension can also contain attributes that do not have any hierarchical relationships to the other attributes in the dimension$"T%& Any of the attributes, whether in a hierarchy or not, can be used in the drill down process $actless fact tables Some fact tables .uite simply have no measured facts; "ften used to represent many-to-many relationships The only thing they contain is a concatenated #ey, they do still however represent a focal event which is identified by the combination of conditions referenced in the dimension tables There are two main types of factless fact tables& event trac#ing tablescoverage tables Dealing ith many-to-many relationships among dimensional attributes (any to many relationships are difficult to deal with in a any database design situation! ,reat efforts should be ta#en to identify any in the data model 9hen creating a (+( it is necessary to separate the two entities and capture their relationship in a factless fact table Dealing ith semi-additi%e and non-additi%e facts Semi-additive facts are those which are not additive across all dimensions warn users prohibit the addition of these facts across the relevant dimensions $on-additive facts are not additive across any dimensions most ratios and all measures depicting snapshots of a state fall into this class in some cases other calculatory methods can be used to aggregate these measures average over the number of time periods calculate the ratio of the sums not the sum of the ratios Degenerate dimensions A dimension which has been cannibalised by other dimensions demand for the product the supply side - the steps needed to manufacture the products from original ingredients or parts The chain consists of a se.uence of inventory and flow star-join schemata joining the different star-join schemata is only possible when two se.uential schemata have a common, identical dimension Sometimes the represented chain can be e'tended beyond the bounds of the business itself A family of starsCommon dimensionCommon dimensionCommon dimensionCommon dimensionStar schema 1 Star schema 4 Star schema 3 Star schema 2 'reating mini-dimensions for really large dimensions (any dimension attributes are used very fre.uently as browsing constraints, in big dimensions these constraints can be hard to 6find7 among the lesser used ones 1ogical groups of often used constraints can be separated into small dimensions which are very well inde'ed and easily accessible for browsing All variables in these mini-dimensions must be presented as distinct bands or classes The #ey to the mini-dimension can be places as a foreign #ey in both the fact table and dimension table from which it has been bro#en off (ini-dimensions, as their name suggests, should be #ept small and compact Sloly changing dimensions (ost dimensions are not constant over time (ost dimensions are however almost constant over time Almost constant dimensions are referred to as slowly-changing dimensions There are three main methods of handling slowly changing dimensions& "verwriting *reating additional dimensional records *reating new current fields within the original dimension$ote& "ne of the #ey functions of the data warehouse is to trac# events over e'tended periods of time! The validity of the data warehouse is thus dependent on how well changes in the its dimensions are trac#ed! Sloly changing dimensions ()* The dimensional attribute record is overwritten with the new value $o changes are needed elsewhere in the dimension record $o #eys are affected anywhere in the database ?ery easy to implement but the historical data is now inconsistent Two basic .uestions need to be as#ed before overwriting a dimension attribute& =ow important is the value to the end-users analysis needs@ =ow important is the trac#ing of history@ Sloly changing dimensions (+* -ntroduce a new record for the same dimensional entity in order to reflect its changed state A new instance of the dimensional #ey is created which references the new record -n order tois best dealt with by using version digits at the end of the #ey! These allow up to ABB snapshots of a changing dimensional entity All these #eys need to be created, maintained and managed by someone and trac#ed in the metadata The database maintains its consistency and the versions can be said to partition history Sloly changing dimensions (3* Cse slightly different design of dimension table which has fields for& original status of dimensional attribute current status of dimensional attribute an effective date of change field This allows the analyst to compare the as-is and as-was states against each other "nly two states can be traced, the current and the original Some inconsistencies are created in the data as time is not properly partitioned Special mention - ,eterogeneous productsSome products have many, many distinguishing attributes and many possible permutations /usually on the basis of some customised offer0! This results in immense product dimensions and bad browsing performance -n order to deal with this, fact tables with accompanying product dimensions can be created for each product type - these are #nown as custom fact tables 2rimary core facts on the products types are #ept in a core fact table The core facts are copied in each of the customer fact tables -re-aggregated data in S.S =ow to deal with aggregates is one of the biggest issues in the design of a +9 *hoice of pre-aggregationDon-the fly aggregation has great relvance to data storage and .uery performance Two main strategies e'ist for the creation of aggregates& the creation of new fact tables for aggregates the creation of new level fields for aggregates