Dimensional Models Intro
-
Upload
manthriravi -
Category
Documents
-
view
222 -
download
0
description
Transcript of Dimensional Models Intro
-
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 7
BRIDGING THE GAP
Dimensional modeling is a database design technique devel-
oped specifically for designing data warehouses. Its objectives
are to create database structures that end users can easily
understand and write queries against, and to optimize query
performance. It has become the predominant approach to
designing data warehouses in practice and has proven to be a
major breakthrough in developing databases that can be used
directly by end users. Dimensional modeling is not based on
any theory, but has clearly been very successful in practice.
This article, the first in a two-part series, examines the nature
of dimensional modeling and proposes a possible explanation
for why it has been so successful.
This article also explodes the popular myth that traditional ER
modeling and dimensional modeling are fundamentally different
and somehow incompatible. It shows that a dimensional model
is just a restricted form of an ER model, and that there is a
straightforward mapping between the two.
An ER model can be transformed into a set of dimensional
models by a process of selective subsetting, denormalization,
and (optional) summarization. Understanding the relationship
between the two types of models can help to bridge the gap
between operational system (OLTP) design and data ware-
house (OLAP) design. It can also help to resolve the difficult
problem of matching supply (operational data sources) and
demand (end-user information needs) in data warehouse
design. Finally, it results in a more complete dimensional
design, which is less dependent on the designers ability to
choose the right dimensions.
From ER Models toDimensional Models:Bridging the Gap between OLTP
and OLAP Design, Part I
Daniel L. Moody and Mark A. R. Kortink
Daniel L. Moody is a Visiting Professor in the Departmentof Software Engineering, Charles University, Prague
(visiting from Monash University, Melbourne, Australia)[email protected]
Mark A. R. Kortink is the Managing Director of Kortinkand Associates, an Australianbased IT management
consultancy company. [email protected]
-
IntroductionThe data warehousing or OLAP (online analyticalprocessing) environment is very different from theoperational or OLTP (online transaction processing)environment, and therefore requires a differentapproach to database design (Kimball, 1995, 1996,1997, 2002). There are two fundamental differencesthat suggest the need for different design approaches:
End-user access (visibility of database struc-ture): In a data warehousing environment, endusers access data directly using user-friendlyquery and analysis (OLAP) tools (Glassey, 1998).This means users need to understand how data isstructured in order to use the database effectively(though some of the details may be hidden bythe OLAP tool). In operational systems, thedatabase structure is effectively hidden by anapplication front endusers only access thedatabase via data entry and enquiry screens.Read-only access (query versus update per-formance): In a data warehousing environ-ment, users can retrieve data but cannot modifyit, whereas in an operational environment,users make online updates to the database inresponse to business events. For most practicalpurposes, therefore, the data warehouse can beconsidered as a read-only database, as it is onlyupdated via batch extract, transform, and load(ETL) processes.
Traditional database design techniques such as ERmodeling and normalization are primarily designedfor getting data efficiently and reliably into databases.However, structuring databases in this way also actsas a barrier to getting data out. The reason for this isthat such methods tend to result in database struc-tures which are too complex for end users to under-stand and write queries against. The averageoperational database consists of around a hundredtables, linked by a complex web of relationships,which most end users (not to mention most IT
specialists) would find overwhelming. Even simplequeries require multi-table joins, which are beyondthe capabilities of most end users.
The primary cause of complexity in operationaldatabases is the use of normalization. Normalizationtends to multiply the number of tables, as it involvessplitting out functionally dependent groups ofattributes into separate tables. This minimizes dataredundancy and maximizes update efficiency becauseeach change can be made in a single place. It alsomaximizes data integrity by avoiding inconsistencyand preventing insertion, deletion, and updateanomalies (Codd, 1970). However, on the downside,it tends to penalize retrieval because of the number ofjoins required (Kent, 1983). Thus, normalizationresults in databases that are optimized for updaterather than retrieval. Given that data warehouses areeffectively read-only databases, normalization is notwell suited for designing data warehouses.
Dimensional ModelingDimensional modeling, a database design techniquedeveloped specifically for designing data warehouses,addresses the limitations of traditional databasedesign methods for such purposes. The approach wasfirst defined by Kimball (1996), though he claimsnot to have invented it but to have merely docu-mented what people had been doing in practice formany years. In particular, it was based on his obser-vations of vendors in the business of providing datain a user-friendly form to their customers. Theobjectives of dimensional modeling are to:
Produce database structures that are easy forend users to understand and write queriesagainst. Optimize query performance (as opposed toupdate performance).
Dimensional modeling was a pivotal development indata warehousing, as it provided a viable means of
8 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
-
representing data in a way that end users couldunderstand and write queries against. Effectively thistransformed the dream of end users being able tosatisfy their own information needs (the self servicemodel of decision support as expounded by earlierdata warehousing theorists) into reality. Dimensionalmodeling has become the predominant approach todesigning data warehouses in practice and has provento be a major breakthrough in designing databasestructures that can be understood and used directlyby end users.
Star Schemas
Like most good ideas, dimensional modeling is verysimple. It is based on a single, highly regular datastructure called a star schema. This consists of acentral table called the fact table, surrounded by anumber of dimension tables.
The fact table forms the center of the star, whilethe dimension tables form the points of the star.
Fact tables typically correspond to particular types ofbusiness eventsfor example, orders, shipments,payments, bank transactions, airline reservations, andhospital admissions. Such tables usually containnumerical values (e.g., dollar amounts), which maybe analyzed using statistical functions. Dimensiontables provide the basis for analyzing data in the facttableways of slicing and dicing the data.Dimensions typically answer who, what, when,where, how, and why questions about the busi-ness events stored in the fact table. A star schemamay have any number of dimensions.
The terminology of dimensional model derivesfrom the fact that a star schema may be visualized asa data cube where each dimension table representsa different spatial dimension (or more generally ahypercube, as a star schema may have any number ofdimensions). As shown in Figure 2, each cell in thecube represents the intersection of values from each
dimension. Of course, this metaphor loses its intu-itive appeal once there are more than three dimen-sions, as most people have difficulty thinking inhyper-dimensional space!
Dimensions are often highly denormalized tables,typically consisting of one or more embedded hierar-chies. For example, Customer, which forms a singledimension table in the star schema of Figure 1, con-sists of three independent hierarchies when they arenormalized out (Figure 3). The hierarchical structureof dimensions provides the basis for analyzing data atdifferent levels of detailthe ability to drill downand roll up in OLAP tools.
Snowflake Schemas
A snowflake schema is a star schema with fully nor-malized dimensionsit gets its name because itforms a shape similar to a snowflake. Rather thanhaving a regular structure like a star schema, thearms of the snowflake can grow to arbitrary lengthsin each direction. Figure 4 shows the equivalentsnowflake schema for the star schema of Figure 1.While these two structures are equivalent in datacontent and the queries they support, the snowflakeschema has a much more complex structure.
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 9
BRIDGING THE GAP
Figure 1. Star Schema Example
Date
Product
WHAT Dimension
WHO Dimension
WHERE Dimension
WHEN Dimension
RetailOutletCustomer
Sales Summary(Fact Table)
-
However, the advantage of the snowflake structure isthat it explicitly shows the structure of each dimen-sion rather than appearing as an unstructured collec-tion of data items.
Kimball (1996, 2002) arguesthat snowflaking is undesir-able because it adds unneces-sary complexity, reduces queryperformance, and doesnt sub-stantially reduce storage space.In the example, the snowflakeschema has more than fivetimes as many tables as theequivalent star schema.However, empirical tests showthat the performance impact ofsnowflaking depends on theDBMS and/or OLAP toolused: some favor snowflakeswhile others favor star schemas(Spencer and Loukas, 1999).An advantage of the snowflakerepresentation is that it explic-itly shows the hierarchicalstructure of each dimension
in a star schema this requires tacit knowledge on thepart of the user. While this may be overwhelmingwhen presented all at once, it may be useful to allowend users to see the underlying structure of individualdimensions to understand how data can be sensibly
10 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 2. Data "Cube"
Product
Customer
Date
Figure 3. Embedded Hierarchies in Dimension Tables
MarketSector
IndustryGroup
Country
IndustryHierarchy
MarketSegmentation
Hierarchy
LocationHierarchy
IndustryClass
Customer
Customer IDCustomer NameMarket SegmentMarket SectorIndustry ClassIndustry SectorIndustry GroupCityStateCountry
IndustrySector
MarketSegment
CityState
Figure 4. Snowflake Schema
Date
FinancialQuarter SeasonHoliday
FianacialYear
Product
PackageSize
Brand
ProcessType
ManufacturingGroup
ManufacturingSubgroup
ProductCategory
ProductSubcategory
Department
RetailOutlet Suburb City
MarketingRegion
MarketingDivision
StoreLayout
StoreType
State CountryCustomerIndustryClassIndustrySector
MarketSegment
MarketSector
IndustryGroup
CityStateCountry
Sales Summary(Fact Table)
ProductSource
-
analyzed. This suggests that whether a snowflake or astar schema is used at the physical level, views shouldbe constructed so that the user can see the structure ofeach dimension if required.
Strengths and Weaknesses of Dimensional ModelsThe major advantage of using dimensional models to represent data is that they drastically reduce thecomplexity of the database structure. This makes the database easier for people to understand andwrite queries against, by minimizing the number oftables and therefore the number of joins required.Dimensional models also optimize query performance.A star schema has a fixed structure that has no alterna-tive join paths, which greatly simplifies the evaluationand optimization of queries (Raisinghani, 2000).
However, its strength is also its weakness. The fixedstructure of the star schema limits the queries thatcan be written to the dimensions that have beendefined. This means the designer must have a goodidea in advance about the sort of questions users willwant to ask (Craig, 1998). Star schemas make themost common queries easy to write but also restrictthe way the data can be analyzed (Gallas, 1999).Another weakness of dimensional models is that notall information can be represented in dimensionalform. Dimensional models assume an underlyinghierarchical structure of data and exclude data that isnaturally non-hierarchical (e.g., network structureddata). Thus, dimensional modeling at best provides apartial solution (an 80/20 solution) to the problemof database query and analysis.
Theoretical Explanation: Why Dimensional Modeling WorksDimensional modeling is not based on any theory,but has clearly been very successful in practice. It is amajor contribution to the database design fieldasimportant in practical terms as Codds contributionof normalization (Codd, 1970) and Chens contribution
of ER modeling (Chen, 1976) and its influence onthe practice of data warehouse design cannot be overstated. For these reasons, it is important tounderstand why it works. We offer a theoreticalexplanation for dimensional modelings success indesigning databases end users can understand andwrite queries against.
Principle #1: Chunking
Psychological studies show that due to limits onshort-term memory, humans have a strictly limitedcapacity for processing informationthis is esti-mated to be the magical number seven, plus orminus two concepts concurrently (Miller, 1956;Newell and Simon, 1972; Baddeley, 1994). If theamount of information received exceeds these limits,information overload ensues and comprehensiondegrades rapidly (Lipowski, 1975).
In one of the most important papers in the history ofpsychology, Miller (1956) described the results of aseries of experiments using a wide range of differentexperimental stimuli (sounds, tastes, colors, points,numbers, and words). He observed a surprising simi-larity in human channel capacity across all theseexperiments, with a mean of 6.5 concepts, and onestandard deviation, including 4 to 10 concepts (Figure5). This has been described as the inelastic limit ofhuman capacity (Uhr et al., 1962) or cognitivebandwidth (Moody, 2002), and represents one of theenduring laws of human cognition (Baddeley, 1994).
The primary mechanism used by the human mind tocope with large amounts of information is to orga-nize it into chunks of manageable size (Weber,1996; Baddeley, 1999; Miller, 1956). The ability torecursively develop information-saturated chunks isthe key to peoples ability to deal with complexityevery day. The process of organizing data into a set ofseparate star schemas effectively provides a way oforganizing a large amount of data into cognitivelymanageable chunks. Although Kimball never refers
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 11
BRIDGING THE GAP
-
to the seven plus or minus two principle, all theexamples he uses in his original book (Kimball,1996) implicitly obey this rule: an average of 6.6entities per star schema, with a maximum of 11 anda minimum of 4. This corresponds remarkably well(in fact, almost perfectly) to the limits of humanchannel capacity defined by Miller!
Principle #2: Hierarchical Structuring
Hierarchy is one of the most common ways of organiz-ing complexity for the purposes of human understand-ing, and is one of the basic principles of systems theory(Flood and Carson, 1993; Klir, 1985). Herbert Simon,the dual Nobel Prize winner, observed the use of hierar-chy to organize complexity in organizational systems,social systems, biological systems, physical systems, andsymbolic systems (Simon, 1996). Based on this, he pro-posed hierarchical organization as a general architecturefor complexity. Hierarchical structures act as a com-plexity management mechanism by reducing thenumber of items one has to deal with at each level ofthe hierarchy. For example, instead of having to dealwith a million customers, one can group them into ahundred market sectors and then into 10 market seg-ments. Hierarchical structures are a familiar and naturalway of organizing information, and are all around us in
everyday lifefor example, books are organized hierar-chically, Web sites are organized hierarchically, the orga-nizations we work in are organized hierarchicallyeventhis article is organized hierarchically.
As discussed earlier, each dimension in a star schematypically consists of one or more hierarchies. Theseprovide a way of classifying business events stored inthe fact table, thereby reducing complexity. It is thishierarchical structure that provides the ability toanalyze data at different levels of detail, and to rollup and drill down in OLAP tools. As we will seelater, the process of building dimensional models islargely one of extracting hierarchical structures fromenterprise data.
Implications for Practice
Together these principles provide a possible explana-tion for why dimensional modeling has been so suc-cessful in practice. Unlike traditional database designtechniques, dimensional modeling is not based onany theory but simply evolved as a result of practice.However, closer examination shows that it is basedon some general organizational principles (from psy-chology and systems theory) that have been used tomanage complexity in a wide range of domains.
This is hardly surprising, as one of the most impor-tant problems in data warehousing is complexitymanagement: how to present enormous volumes ofdata in a way that decision makers can understand.Current practice in dimensional modeling has moreof the characteristics of an art than a science, andrelies heavily on common-sense, experience, andrules of thumb. Understanding the underlyingprinciples on which dimensional models are basedmay help to develop principles for good dimen-sional modeling, which are largely lacking. Forexample, the seven plus or minus two principlecould be used to define a limit on the number ofdimensions in a star schema, to ensure that it repre-sents a conceptually manageable chunk of data.
12 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 5. Human Channel Capacity or Cognitive Bandwidth (Miller, 1956)
Number of Variable Aspects
PointsColors
Tastes
Tones
Chan
nel C
apac
ity
00
2
4
6
8
10
1 2 3 4 5 6 7
-
From ER Models to Dimensional Models:Exploding the MythsThere is a common misconception that dimensionalmodeling is fundamentally different from, andincompatible with, ER modeling. For example, asKimball (1996) says:
Entity relation models are a disaster for queryingbecause they cannot be understood by users andcannot be navigated usefully by DBMS software.Entity relation models cannot be used as the basisfor enterprise data warehouses.
As a result, the starting point for most dimensionaldesign approaches is that you must forget everythingyou ever learned about database design (Kimball,1996, 1997, 1995, 2002; Oates, 2000). Sadly, this isa common theme in the introduction of any newtechnology, where proponents of the new technologyclaim that a whole new approach is needed, oftenwhen they do not fully understand the oldapproaches themselves. It is also part of good market-ing to claim that something is totally new. We thinkit is important to dispel this notion, as it holds backthe discipline of data warehouse design in twoimportant ways:
It acts as a barrier for people trained in tradi-tional database design techniques to learndimensional design. ER modeling and normal-ization have been used in practice to designdatabase schemas for over two decades, and aretaught as part of virtually every university levelIT curriculum. Clearly, it would be useful tobuild on this wealth of knowledge and exper-tise than to simply discard itfor one thing,people learn better by building on what theyalready know.
It acts as a barrier to establishing a cumulativetradition in the field, which is essential for sci-entific progress (Kuhn, 1970). Rather than
simply saying that data warehousing is some-thing totally new and different and trying toestablish a new design discipline from its foundations, dimensional modeling should be linked to the existing body of knowledge in database design.
On closer examination, we find that a star schema isjust a restricted form of an ER model (Figure 6):
There is a single entity called the fact table,which is in third normal form (3NF).Violations to second normal form (2NF)would result in double counting in queries.The fact table forms an n-ary intersectionentity (where n is the number of dimensions)between the dimension tables, and includes thekeys of all dimension tables.There are one or more entities called dimen-sion tables, each of which is related to the facttable via one or more one-to-many relation-ships. Dimension tables have simple keys, andare in at least 2NF. Transitive dependencies(3NF violations) are allowed, but partial depen-dencies (2NF violations) and repeating groups(1NF violations) are not.
As we will see, there is a straightforward transforma-tion between a normalized ER model and a dimen-sional model, involving selective subsetting,denormalization, and (optional) summarization.
Deriving Dimensional Models from ER ModelsCurrent approaches to dimensional design mostlytake a top-down approach, in which dimensionalmodels are developed directly from user query andanalysis requirements (demand side analysis). While this may seem like a sound, customer-drivenapproach, it ignores one of the fundamental realitiesof the data warehouse environment. To a large extent, a data warehouse is simply a repackaging of
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 13
BRIDGING THE GAP
-
operational data in a more accessible form. Datawarehouse design is therefore a highly constraineddesign problem limited by what data is available from operational systems (supply issues).
In this article, we describe how to derive dimensionalmodels from (normalized) ER models. This provides away of repackaging operational data (usually stored innormalized tables) into a form that end users canunderstand and write queries against. This helps toresolve the difficult problem of matching supply(operational data sources) and demand (managementinformation needs) in data warehouse design (Winterand Strauch, 2003). Deriving dimensional modelsfrom ER models also provides a more structuredapproach to dimensional design than starting fromfirst principles. There is a large conceptual leap ingetting from end-user analysis requirements to dimen-sional models, with which all but the most gifteddesigners have difficulty. Experience shows that there isa very steep learning curve involved and a great deal ofexperience required before people can successfullydevelop dimensional models on their own. Theapproach described in this paper can reduce this learn-ing curve by providing a stepwise approach to develop-ing dimensional models starting with the source data.
To illustrate the approach, we use an ER model foran order processing system (Figure 7). This con-sists of 42 entities, which is less than half the sizeof the average application data model, but largeenough to illustrate the main principles of theapproach. We will refer to this example through-out this article and the next (on advanced designissues, to appear in the Fall issue of the BusinessIntelligence Journal).
Even in a model of this size, the problem of complex-ity in normalized ER models is clearly evident. Mostend users would find this schema incomprehensible,as it exceeds the limits of human cognitive capacitymany times over. Even quite simple queries wouldrequire multi-table joins and/or subqueries, and as aresult, end users would be heavily reliant on technicalspecialists to write queries for them.
The transformation of an ER Model to dimensionalform takes place in four steps:
Step 1: Classify entitiesStep 2: Design high-level star schema Step 3: Design detailed fact table Step 4: Design detailed dimension table
Step 1. Classify Entities
The first step in the transformation approach is toclassify entities into three distinct categories:
Transaction Entities: These entities recorddetails of business events (e.g., orders, ship-ments, payments, insurance claims, bank trans-actions, hotel bookings, airline reservations,and hospital admissions). Most decisionsupport applications focus on these events toidentify patterns, trends, and potential problemsand opportunities. The exception to this rule is the case of snapshot entities: entitiesrecording a static level of some commodity at a point in time (e.g., account balances and
14 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 6. ER Representation of a Star Schema
TIMEDimension
PRODUCTDimension
WHATWHEN
DateDay of weekMonthQuarterYearHoliday flag
Product IDProduct nameBrandProduct category
SALESFact
STOREDimension
WHERE
DateProduct IDStore IDDollars soldQty soldDollars cost
Store IDStore nameAddressFloor plan type
-
inventory levels). These do not record businessevents as such, but the effect of events on thestate of an entity.Component Entities: These entities are directlyrelated to a transaction entity by a one-to-manyrelationship. They are involved in the businessevent, and answer who, what, where,how, and why questions about the event. Classification Entities: These entities are relatedto a component entity by a chain of one-to-many relationships. These define embeddedhierarchies in the data model and are used toclassify component entities.
The classification of entities for the example datamodel is shown in Figure 8. There are three trans-action entities in the model: Order, Order Item,and Stock Level. Order and Order Item correspondto business events, while Stock Level records astatic level (a snapshot entity). Order and OrderItem represent different levels of detail of the samebusiness event.
There are six component entities:Customer, Employee, Retail Outlet,Product, Warehouse and DeliveryMethod. Product is a component oftwo different transactions (StockLevel and Order Item).
Finally, there are 31 classificationentities defining separate but par-tially overlapping hierarchies inthe model.
Note that some of the entities inthe underlying data model do notfit into any of these categories.Such entities do not fit the hierar-chical structure of a dimensionalmodel and therefore cannot be rep-resented in the form of a starschema (with some exceptions, to
be discussed in the next issue of the BusinessIntelligence Journal). Thus, the process of dimension-alizing an ER model effectively weeds out non-hierarchical data.
Step 2: High-Level Star Schema Design
In this step, relevant star schemas are identified andtheir high-level structures defined (table leveldesign). At a high level, transaction entities corre-spond to fact tables and component entities corre-spond to dimension tables, but the mapping is notalways one to one.
2.1 Identify Star Schemas Required
Each transaction entity is a candidate for a starschema. The process of creating star schemas is oneof subsettingdividing a large and complex modelinto manageable-sized chunks. Each star schema iscentered on a single business event, and thereforerepresents a manageable-sized chunk of data.However, there is not always a one-to-one correspon-dence between transaction entities and star schemas:
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 15
BRIDGING THE GAP
Figure 7. ER Model for Retail Sales System
Product
Order CustomerMarket Segment
Sales District Store Type City
Retail Outlet
Order Item
Stock Level Warehouse
Customer Contact Number
Product AvailabilitySupplier
Market Sector
Brand
Package Type
Postal Area
Sales Region
Industry Type Industry ClassIndustry GroupProduct
Substitution
Employee
Reporting Relationship
Position Type
Contact Number
Type
Delivery Method
Carrier
State Country
Priority
Credit Terms
Product Subcategory
Product Category
Tax Status
Delivery Charge
Employment Status
Union Status
Education Level
Inventory Status
Import/ Export Status
-
Not all transactions will be important for deci-sion-making purposes: user input will be requiredto choose which transactions are relevant. Multiple star schemas at different levels of detailmay be required for a particular transaction. When transaction entities are connected in amaster-detail structure, they should be com-bined into a single star schema, as they repre-sent different views of the same event. The splitinto master and detail is simply a require-ment of normalization (1NF).
2.2 Define Level of Summarization
One of the most critical decisions in star schemadesign is to choose the appropriate level of granular-itythe level of detail at which data is stored. In tech-nical terms, granularity is defined as what each row inthe fact table represents. At the top level, there are twomain options in choosing the level of granularity:
Unsummarized (transaction level granularity):this is the highest level of granularity where
each fact table row corresponds to a singletransaction or line itemSummarized: transactions may be summarizedby a subset of dimensions or dimensionalattributes. In this case, each row in the facttable corresponds to multiple transactions
The lower the level of granularity (or conversely, thehigher the level of summarization), the less storagespace required and the faster queries will be exe-cuted. However, the downside is that summarizationalways loses information and therefore limits thetypes of analyses that can be carried out.Transaction-level granularity provides maximumflexibility for analysis, as no information is lost fromthe original normalized model.
A common misconception is that star schemas alwayscontain summarized datathis is not necessarily thecase and is not always desirable. However, for perfor-mance reasonswhen dealing with large datavolumesor to simplify the view of data for a partic-
ular set of users, somelevel of summarizationmay be necessary. In prac-tice, a range of starschemas at different levelsof granularity are gener-ally created for each trans-action entity to suit theneeds of different users.An important design con-sideration is that all ofthese star schemas shoulduse exactly the samedimension tables or viewsdefined on these tablesthese are called conform-ing dimensions (Kimball,1996, 2002). This ensuresthat users can drillacross from one star
16 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 8. Classification of Entities
Diagram Key
Product
Order CustomerMarket Segment
Sales District Store Type City
Retail Outlet
Order Item
Stock Level Warehouse
ComponentEntity
Customer Contact Number
Product AvailabilitySupplier
Market Sector
Brand
Package Type
Postal Area
Sales Region
Industry Type Industry ClassIndustry GroupProduct
Substitution
Employee
Reporting Relationship
Position Type
Contact Number
Type
Delivery Method Carrier
State Country
Priority
Credit Terms
Product Subcategory
Product Category
Tax Status
Delivery Charge
Employment Status
Union Status
Education Level
Inventory Status
Import/ Export Status
ClassificationEntity
TransactionEntity
-
schema to another. Facts should also be defined con-sistently between different star schemas, to avoidinconsistent results from queries.
There is an almost infinite range of possibilities for cre-ating summary level star schemas for a given transactionentity. In general, any combination of dimensions ordimensional attributes may be used to summarize trans-actions. For example, orders could be summarized by:
Month (an attribute of the Date dimension)and Retail Outlet: this gives monthly salestotals for each retail outlet.Date, Product, and City (an attribute of theRetail Outlet dimension): this gives dailyproduct sales for each city.
These specify the group by conditions for starschema load processes.
2.3 Identify Relevant Dimensions
The component entities associated with each transac-tion entity represent candidate dimensions for the starschema. However, the correspondence between compo-nent entities and dimensions is not always one to one:
Not all component entities may be relevant forpurposes of analysis or for the level of granular-ity chosen. If a component entity has no dependentattributes, its key will be included in the facttable but it will not have its own dimensiontable. This is called a degenerate dimension.Explicit dimensions are required to representdates and times. Date and/or Time appear asexplicit dimensions in most star schemas tosupport different types of historical analyses.These are not normally represented as entitiesin operational systems, as they are handled bybuilt-in DBMS functions. These correspondto data types rather than entities at the opera-tional level.
Figure 9 shows the transaction level star schema forthe Order/Order Item transaction. Each row in thefact table corresponds to an individual Order Item(line item granularity).
The star schema has six dimension tables correspond-ing to the five component entities associated with thetransaction, plus an additional Date dimension.There are multiple relationships to the Date dimen-sion corresponding to the date the order was madeand when it was posted to the general ledger. Orderforms a degenerate dimension. If the time of orders isimportant for analysis, an additional Time dimensionwill be required. Date and Time are usually repre-sented as separate dimensions to reduce the size ofthe dimension tables (Kimball, 1996).
As discussed earlier, transaction entities connected ina master-detail structure should be combined into asingle star schema. Component entities associatedwith either of the entities become dimensions in theresulting star schema. In most cases, the masterentity will become a degenerate dimension in thestar schema: it forms part of the key of the fact tablebut has no corresponding dimension table (Figure10). The reason for this is that all attributes of the
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 17
BRIDGING THE GAP
Figure 9. Transaction Level Star Schema for Order/Order ItemTransaction Entities
Order Item Fact
Delivery Type
Customer
Date
ProductWHO Dimension
HOW Dimension
WHAT Dimension
WHEN Dimension
Employee
WHO Dimension
order posted
Retail Outlet
WHERE Dimension
Order
Degenerate Dimension
-
master entity should be allocateddown to the item level and storedin the fact table.
Figure 11 shows the transaction-level star schema for the StockLevel transaction entity, which isan example of a snapshot or levelentity. Each row in the fact tablerecords the inventory level of aparticular product in a particularwarehouse on a particular date.The star schema has three dimen-sion tables, corresponding to thetwo component entities associated with the transac-tion entity plus an explicit Date dimension.
When non-transaction level granularity is chosen,the dimensions required will be determined by howthe transactions are summarized. This will generallybe a subset of the dimensions used in the transactionlevel star schema, though dimensions may them-selves be subsets of the transaction level dimensiontables (e.g., Month instead of Date). Figure 12shows a summary level star schema in which dailyorders are summarized by retail outlet and product.This requires three dimension tables: Date, RetailOutlet, and Product.
Step 3: Detailed Fact Table Design
In this step, we complete the detailed design for facttables in each star schema (attribute level design).
3.1 Define Key
The key of each fact table is a composite key con-sisting of the keys of all dimension tables plus anydegenerate dimensions. This is different from howthe key is defined for a normal relational table. Ina relational database, the key should be minimal:no subset of the attributes in the key should alsobe unique. However, the key of a fact tableincludes all dimensional keys, and will therefore
not be minimal in most cases. For example, inFigure 13 (the fact table for the Order transac-tion), the key consists of eight attributes, but onlytwo (Order Number and Product ID) are necessaryfor uniqueness.
3.2 Define Facts
The non-key attributes of the fact table are mea-sures (facts) that can be analyzed using numericalfunctions. What facts are defined depends on whatevent information is collected by operationalsystemsthat is, what attributes are stored in trans-action entities. However, a key concept in defining
18 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 10. Merging of "Master-Detail" Entities
Time Dimension(not required at
operational level)
DeliveryMethod
Dimension
CustomerDimension
EmployeeDimension
Retail OutletDimension
ProductDimension
OrderItemFact
DateDimension
Order(degeneratedimension)
DeliveryMethod
Customer
Product
Order
OrderItem
Employee
RetailOutlet
Figure 11. Transaction Level Star Schema for Stock Level Transaction Entity
WHENDimension
WHEREDimension
WHATDimension
Warehouse ProductInventory
LevelFact
Date
-
facts is that of additivity. Kimball (1996) defines threelevels of additivity:
Fully additive facts: these are facts that can bemeaningfully added across all dimensions. Forexample, Quantity Ordered in the Order Itementity can be added across dates, products, orcustomers to get total sales volumes for a par-ticular day, product, or customer.Semi-additive facts: these are facts that can bemeaningfully added across some dimensions butnot others (usually time). For example, QuantityOn Hand from the Stock Level entity can beadded across products and warehouses (to get thetotal quantity on hand for a particular product orwarehouse) but not across time (as this wouldlead to double counting of stock). Quantity OnHand can be averaged over time (to get averagedaily inventory levels) but not added.Non-additive facts: these are facts that cannotbe meaningfully added across any dimensions.For example, Unit Price from the Order Itementity cannot be added across any dimensions.Unit Price cannot even be sensibly averagedover time, because to calculate the averageselling price of a product you need to take intoaccount the quantity sold.
Wherever possible, additive facts should be used toprevent errors in queries. This means converting semi-and non-additive facts to additive facts. In theexample, Unit Price (a non-additive fact) can be con-verted to an additive fact by multiplying it byQuantity Ordered to form the Extended Price (thetotal price for an individual order item). Note that thistransformation does not lose information: the UnitPrice for a particular order item can be derived bydividing the Extended Price by the Quantity Ordered.
Figure 13 shows the detailed fact table design for thetransaction level (non-summarized) Order Item starschema. Each row in the table corresponds to anindividual order item.
The key of the fact table consists of the keys ofall the dimension tables, including the degener-ate dimension. The non-key attributes are the union of theattributes in the two original transaction enti-ties (Order and Order Item). However, a non-additive fact (Unit Price) has been converted toan additive fact (Extended or Item Price =Order Quantity x Unit Price).
In this case, no data is lost from the original (normal-ized) data modelwe can reconstruct details of indi-vidual orders down to the line item level.
For transaction entities connected in a master-detailstructure, all attributes of the master record should beallocated down to the item level if possible (Kimball,1995). For example, if discount is defined at themaster (Order) level, the total discount amountshould be allocated at the item level (e.g., in propor-tion to the price for each item). Otherwise, querieswill result in multiple counting of discounts. Thesame should be done to delivery charges, order leveltaxes, fees, etc. If attributes of the master entitycannot be allocated down to the item level, separatestar schemas may be required for each.
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 19
BRIDGING THE GAP
Figure 12. Summary Star Schema for Order Transaction
WHENDimension
WHEREDimension
WHATDimension
RetailOutlet
ProductDailySales
Summary
Date
-
Figure 14 shows the detailed fact table design for thesummary level star schema of Figure 12. Each row inthis fact table summarizes sales by date, product, andretail outlet.
The key of the fact table consists of the keys ofall the dimension tables. The non-key attributes of the fact table are allsummarized facts. Total Sales Quantity is thetotal daily quantity sold for each product in eachretail outlet (calculated as the sum of OrderQuantity grouped by Date, Product ID, andRetail Outlet ID). Total Sales Amount is the totaldaily revenue for each product in each retailoutlet (calculated as the sum of Order Quantity xUnit Price). Total Discount Amount is the totaldaily discount by product and retail outlet (thesum of Discount Amount). An additional fact,Number of Orders, counts the number of sepa-rate orders by day, product, and retail outlet,which is a semi-additive fact. This can be addedacross Retail Outlets and Dates but not acrossProducts, as this would result in double counting.Including such a fact is questionable, as it couldlead to erroneous conclusions by end users.
Such summarization loses information compared to the original (normalized) model: we cannot
reconstruct details of individual order items fromthis star schema. This means that some queries arenot possible: for example, it is possible to deter-mine the total sales of a product on any given day,but not the characteristics of the customers whobought them.
In some cases, fact tables may contain no numericalfacts at all. For example, in a star schema thatrecords the incidence of crimes, the fact table mayrecord when and where a crime took place, the typeof offense committed, and who committed it.However, there are no measures involvedtheimportant information is simply that the crime took
place. This is called a factless fact table. Other examplesof factless fact tables include traffic accident, studentenrollment, and disease statistics. The only relevantaggregation operation for such types of events isCOUNT, which can be used to analyze the frequencyof events over time.
Step 4: Detailed Dimension Table Design
In this step, we complete the detailed design for dimen-sion tables in each star schema (attribute level design).This completes the detailed design of the star schema.
4.1 Define Dimensional Key
The key of each dimension table should be a simple(single attribute) numeric key. In most cases, this willjust be the key of the underlying component entity.However, the operational key may need to be general-
20 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
Figure 14. Daily Sales Summary Fact Table
WHENDimension
WHEREDimension
WHATDimension
RetailOutlet
ProductDailySales
Summary
DateFacts: Total Sales Quantity Total Sales Amount ($) Total Discount Amount ($) Number of Orders
Daily Sales Summary
Date
Product_ID
Retail_Outlet_ID
Total Sales Quantity
Total Sales Amount
Total Discount Amount
Number of Orders
Figure 13. Transaction Level Order Item Fact Table
Order Item Fact
Delivery Type
Customer
Date
Product
WHO Dimension
HOW Dimension
WHAT Dimension
WHEN Dimension
Employee
WHO Dimension
order posted
Retail Outlet
WHERE Dimension
Order
Degenerate Dimension
Order Item Fact
Order_Number
Product_ID
Customer_ID
Retail_Outlet_ID
Employee_ID
Delivery_Method
Order_Date
Posted_Date
Order_Quantity
Extended_Price
Discount_AmountFacts: Order Quantity Extended Price ($) Discount Amount ($)
-
ized to ensure that it remains unique over time.Because operational systems often only require keyuniqueness at a point in time (i.e., keys may be reusedover time), this may cause problems when performinghistorical analysis. Another situation where thedimensional key needs to be generalized is in the caseof slowly changing dimensions.
4.2 Collapse Hierarchies
Dimension tables are formed by collapsing ordenormalizing hierarchies (defined by classificationentities) into component entities. Figure 15 showshow the hierarchies associated with the Customercomponent are collapsed to form the Customerdimension table. The resulting dimension table con-sists of the union of all attributes in the original enti-tiesas many descriptive attributes should beincluded as possible to support slicing and dicing ofdata in different ways. In practice, it is common fordimension tables to contain more than a hundredattributes. This process introduces redundancy in theform of transitive dependencies, which are violationsto third normal form (3NF) (Codd, 1970). Thismeans that the resulting dimension table is in secondnormal form (2NF).
4.3 Replace Codes and Abbreviations by Descriptive Text
To make the star schema as understandable as possible,codes and abbreviations in source data should beremoved and replaced by descriptive text (Kimball,1996). While such codes and abbreviations have animportant place in efficient processing of transactionsat the operational level, they simply obfuscate thingsin the data warehousing environment. Use of suchcodes and abbreviations places a cognitive load onthe end user to remember what they mean, some-thing users can well do without. Codes and abbrevia-tions may be optionally retained in the star schema ifthey add to understanding.
Figure 16 shows the detailed star schema design for the Order transaction (line item granularity).
This shows attributes and keys for the fact table andall dimension tables:
All of the tables in the star schema are denor-malized to at least some extent: each table cor-responds to multiple entities in the originalnormalized ER model. The dimension tablesare the result of collapsing classification entitiesinto component entities, while the fact table isthe result of merging the master and detailtransaction entities.All of the dimension tables are in 2NF: theyhave simple keys and no repeating attributes.The fact table is in 3NF as Discount Amount(an attribute of the Order entity) has been allo-cated down to the line item level. This meansthat Order becomes a degenerate dimension asit has no other dependent attributes.The Date Dimension is a new table which doesnot correspond to any entity in the original ERmodel. This is because dates must be explicitlymodeled in a star schema, whereas at the opera-tional level they are represented as data types.
SummaryThis article has examined the nature of dimensionalmodeling and proposed a possible theoretical
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 21
BRIDGING THE GAP
Figure 15. Collapsing Hierarchies
MarketSector
IndustryGroup
State
IndustryHierarchy
MarketHierarchy
LocationHierarchy Country
IndustryClass
CustomerIndustrySector
MarketSegment
PostalAreaCity
Customer Dimension
Customer_ID
Customer Name
Annual Revenue
Date of First Order
Market Segment Code
Market Segment Name
Market Sector Code
Market Sector Name
Industry Class Code
Industry Class Name
Industry Sector Code
Industry Sector Name
Industry Group Code
Industry Group Name
Postal Code
Postal Area Name
City Code
City Name
Population
State Code
State Name
Country Code
Country Name
-
explanation for why it has been so successful in prac-tice. It has also described an approach for derivingdimensional models directly from ER models. Thisprovides a bridge between operational system(OLTP) design and data warehouse (OLAP) design,which makes it easier for people trained in traditionaldatabase design techniques to learn dimensional design.
Deriving dimensional models from an ER modelprovides a more structured approach to developingdimensional models than starting from first principles, which can help avoid many of the pit-falls faced by inexperienced designers. It also helpsto resolve the difficult problem of matchingsupply (operational data sources) and demand(end-user information needs) in data warehousedesign. Finally, it results in a more completedimensional design, as it involves selecting from all possible ways of analyzing the data in the under-lying operational systems, and is therefore lessdependent on the designers ability to choose theright dimensions.
Relationship between ER Modeling
and Dimensional Modeling
In this paper, we have challenged the widelyaccepted view that dimensional modeling is fundamentally different from, and incompatiblewith, ER modeling. We have shown that a dimen-sional model is just a restricted form of an ERmodel and there is a straightforward transformationbetween the two. An ER model can be transformedinto a set of dimensional models by a process ofselective subsetting, denormalization, and(optional) summarization:
Subsetting: The ER model is partitioned intoa set of separate star schemas, each centeredon a single business event (transactionentity). This reduces complexity through aprocess of chunking.Denormalization: Hierarchies in the ER model
are collapsed to form dimension tables. Thisfurther reduces complexity by reducing thenumber of separate tables. Master-detail trans-action entities are denormalized into a singlefact table, as these represent different views ofthe same underlying event.Summarization: The most flexible dimensionalstructure is one in which each fact represents asingle transaction or line item. However, sum-marization may be required for performancereasons, or to suit the needs of a particulargroup of users.
At the detailed (attribute) design level, a range oftransformations are required:
Generalization of operational keys to ensureuniqueness of keys over time and/or to supportslowly changing dimensions.Conversion of non-additive or semi-additivefacts to additive facts.Substitution of codes and abbreviations withdescriptive text.
Advanced Design Issues
The transformation approach described in this articleresults in a fully detailed star schema design. However,in practice, dimensional modeling tends to be an iter-ative process. The first-cut dimensional model willalmost certainly need refinementas in most designproblems, there are always exceptions and special casesto deal with. In Part II of this discussion, (publishingin the Fall issue of the Business Intelligence Journal), wewill discuss advanced design issues that need to beconsidered in dimensional modeling:
Alternative dimensional structures: stars,snowflakes, and starflakesSlowly changing dimensionsMinidimensionsHeterogeneous star schemasDealing with nonhierarchical data
22 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP
-
REFERENCESBaddeley, A. D. The Magical Number Seven:
Still Magic After All These Years? PsychologicalReview, 101, 1994.
Baddeley, A. D., Essentials of Human Memory,Psychology Press, Hove [England], 1999.
Chen, P. P. The Entity Relationship Model:
Towards An Integrated View Of Data, ACMTransactions On Database Systems, 1, 9-36, 1976.
Codd, E. F. A Relational Database Model For LargeShared Data Banks, Communications of the ACM,13, 377-387, 1970.
Craig, R. The Schema Wars, ENT, 3, 30-32,1998.
BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 23
BRIDGING THE GAP
Figure 16. Order Item Star Schema (fully attributed)
Customer Dimension
Customer_ID
Customer Name
Number of Employees
Annual Revenue
Date of First Order
Market Segment Name
Market Sector Name
Industry Class Name
Industry Sector Name
Industry Group Name
Flat/Apartment Number
Street Number
Street Name
Street Type
Postcode
Postal Area Name
City Name
Population
State Name
Country Name
Account Manager
Source Entities:CustomerPostal AreaStateCityCountryMarket SegmentMarket SectorIndustry SectorIndustry SegmentIndustry Class
Source Entities:NONE(corresponds to a data type at the operational level)
Source Entities:Delivery MethodCarrierPriority
Source Entities:EmployeePosition TypeEmployment StatusEducation LevelUnion Status
Source Entities:Retail OutletSales DistrictSales RegionPostal AreaCityStateCountryStore Type
Delivery MethodDimension
Delivery_Method_ID
Delivery Method
Priority
Carrier Name
Date Dimension
Date_Key
Day Name
Day Number
Day Number in Year
Week Number in Year
Month Name
Month Number
Quarter
Fiscal Period
Year
Holiday
Weekend
Season
Event
Country Name
Employee Dimension
Employee_ID
Employee Name
Age
Sex
Employment Status
Position Type
Commencement Date
Departure Date
Salary Level
Union Status
Last Performance Rating
Education Level
Source Entities:ProductPackage TypeTax StatusBrandProduct CategoryProduct SubcategoryUnit TypeImport StatusExport StatusInventory Status
Product Dimension
Product_ID
Product Name
Brand
Product Subcategory
Product Category
Package Type
Package Size
Unit Type
Product Manager
Import Status
Export Status
Inventory Status
Order Item Fact
Order_Number
Product_ID
Customer_ID
Retail_Outlet_ID
Employee_Number
Delivery_Method
Order_Date
Posted_Date
Order_Quantity
Total_Price
Discount_Amount
Retail Outlet Dimension
Retail_Outlet_ID
Retail Outlet Name
Floor Space
Store Type
Number of Employees
Sales Region
Sales District
Flat/Apartment Number
Street Number
Street Name
Street Type
Postcode
Postal Area Name
City Name
Population
State Name
Country Name
Parking Space
Store Manager
Source Entities:OrderOrder Item
-
Flood, R. L. and Carson, E. R. Dealing WithComplexity: An Introduction To The Theory AndApplication Of Systems Science, Plenum Press, 1993.
Gallas, S. Put Data in its Place! DM DirectBusiness Intelligence Newsletter, 1999.
Glassey, K. Seducing the End User,Communications of the ACM, 1998.
Kent, W. A Simple Guide To The Five NormalForms Of Relational Database Theory,Communications Of The ACM, 1983.
Kimball, R. Is ER Modeling Hazardous to DSS?DBMS Magazine, 1995.
Kimball, R. The Data Warehouse Toolkit, JohnWiley and Sons, New York, 1996.
Kimball, R. A Dimensional Manifesto, DBMSOnline, 1997.
Kimball, R. The Data Warehouse Toolkit: TheComplete Guide to Dimensional Modeling, JohnWiley and Sons, New York, 2002.
Klir, G. J. Architecture of Systems ProblemSolving, Plenum Press, New York, 1985.
Kuhn, T. S. The Structure Of ScientificRevolutions, University Of Chicago Press, 1970.
Lipowski, Z. J. Sensory And Information InputsOverload: Behavioural Effects, ComprehensivePsychiatry, 16, 1975.
Maier, R. (1996) Benefits And Quality Of DataModeling-Results Of An Empirical Analysis, InProceedings Of The Fifteenth InternationalConference On The Entity Relationship Approach(Ed, Thalheim, B.), Elsevier, Cottbus, Germany.
Miller, G. A. The Magical Number Seven, PlusOr Minus Two: Some Limits On Our Capacity ForProcessing Information, The PsychologicalReview, 1956.
Moody, D. L. (2002) Complexity Effects On EndUser Understanding Of Data Models: AnExperimental Comparison Of Large Data ModelRepresentation Methods, In Proceedings of theTenth European Conference on InformationSystems (ECIS, 2002) Gdansk, Poland.
Newell, A. A. and Simon, H. A. Human ProblemSolving, Prentice-Hall, 1972.
Oates, J., The Great Data Warehouse Debate,Sybase White Paper, 2000.
Raisinghani, M. S., Adapting Data ModelingTechniques for Data Warehouse Design, Journal ofComputer Information Systems, 40, 2000.
Simon, H. A. Sciences Of The Artificial (3rdedition), MIT Press, 1996.
Spencer, T. and Loukas, T. From Star to Snowflaketo ERD, Enterprise Systems Journal, 1999.
Uhr, L., Vossier, C. and Weman, J. PatternRecognition over Distortions by Human Subjectsand a Computer Model of Human FormPerception, Journal of Experimental Psychology:Learning, 63, 1962.
Weber, R. A. Are Attributes Entities? A Study OfDatabase Designers Memory Structures,Information Systems Research, 1996.
Winter, R. and Strauch, B. Demand-DrivenInformation Requirements Analysis in DataWarehousing, Journal of Data Warehousing,V8N1, 2003.
24 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003
BRIDGING THE GAP