Dimensional Models Intro

BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 7

BRIDGING THE GAP

Dimensional modeling is a database design technique devel-

oped specifically for designing data warehouses. Its objectives

are to create database structures that end users can easily

understand and write queries against, and to optimize query

performance. It has become the predominant approach to

designing data warehouses in practice and has proven to be a

major breakthrough in developing databases that can be used

directly by end users. Dimensional modeling is not based on

any theory, but has clearly been very successful in practice.

This article, the first in a two-part series, examines the nature

of dimensional modeling and proposes a possible explanation

for why it has been so successful.

This article also explodes the popular myth that traditional ER

modeling and dimensional modeling are fundamentally different

and somehow incompatible. It shows that a dimensional model

is just a restricted form of an ER model, and that there is a

straightforward mapping between the two.

An ER model can be transformed into a set of dimensional

models by a process of selective subsetting, denormalization,

and (optional) summarization. Understanding the relationship

between the two types of models can help to bridge the gap

between operational system (OLTP) design and data ware-

house (OLAP) design. It can also help to resolve the difficult

problem of matching supply (operational data sources) and

demand (end-user information needs) in data warehouse

design. Finally, it results in a more complete dimensional

design, which is less dependent on the designers ability to

choose the right dimensions.

From ER Models toDimensional Models:Bridging the Gap between OLTP

and OLAP Design, Part I

Daniel L. Moody and Mark A. R. Kortink

Daniel L. Moody is a Visiting Professor in the Departmentof Software Engineering, Charles University, Prague

(visiting from Monash University, Melbourne, Australia)[email protected]

Mark A. R. Kortink is the Managing Director of Kortinkand Associates, an Australianbased IT management

consultancy company. [email protected]

IntroductionThe data warehousing or OLAP (online analyticalprocessing) environment is very different from theoperational or OLTP (online transaction processing)environment, and therefore requires a differentapproach to database design (Kimball, 1995, 1996,1997, 2002). There are two fundamental differencesthat suggest the need for different design approaches:

End-user access (visibility of database struc-ture): In a data warehousing environment, endusers access data directly using user-friendlyquery and analysis (OLAP) tools (Glassey, 1998).This means users need to understand how data isstructured in order to use the database effectively(though some of the details may be hidden bythe OLAP tool). In operational systems, thedatabase structure is effectively hidden by anapplication front endusers only access thedatabase via data entry and enquiry screens.Read-only access (query versus update per-formance): In a data warehousing environ-ment, users can retrieve data but cannot modifyit, whereas in an operational environment,users make online updates to the database inresponse to business events. For most practicalpurposes, therefore, the data warehouse can beconsidered as a read-only database, as it is onlyupdated via batch extract, transform, and load(ETL) processes.

Traditional database design techniques such as ERmodeling and normalization are primarily designedfor getting data efficiently and reliably into databases.However, structuring databases in this way also actsas a barrier to getting data out. The reason for this isthat such methods tend to result in database struc-tures which are too complex for end users to under-stand and write queries against. The averageoperational database consists of around a hundredtables, linked by a complex web of relationships,which most end users (not to mention most IT

specialists) would find overwhelming. Even simplequeries require multi-table joins, which are beyondthe capabilities of most end users.

The primary cause of complexity in operationaldatabases is the use of normalization. Normalizationtends to multiply the number of tables, as it involvessplitting out functionally dependent groups ofattributes into separate tables. This minimizes dataredundancy and maximizes update efficiency becauseeach change can be made in a single place. It alsomaximizes data integrity by avoiding inconsistencyand preventing insertion, deletion, and updateanomalies (Codd, 1970). However, on the downside,it tends to penalize retrieval because of the number ofjoins required (Kent, 1983). Thus, normalizationresults in databases that are optimized for updaterather than retrieval. Given that data warehouses areeffectively read-only databases, normalization is notwell suited for designing data warehouses.

Dimensional ModelingDimensional modeling, a database design techniquedeveloped specifically for designing data warehouses,addresses the limitations of traditional databasedesign methods for such purposes. The approach wasfirst defined by Kimball (1996), though he claimsnot to have invented it but to have merely docu-mented what people had been doing in practice formany years. In particular, it was based on his obser-vations of vendors in the business of providing datain a user-friendly form to their customers. Theobjectives of dimensional modeling are to:

Produce database structures that are easy forend users to understand and write queriesagainst. Optimize query performance (as opposed toupdate performance).

Dimensional modeling was a pivotal development indata warehousing, as it provided a viable means of

8 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

BRIDGING THE GAP

representing data in a way that end users couldunderstand and write queries against. Effectively thistransformed the dream of end users being able tosatisfy their own information needs (the self servicemodel of decision support as expounded by earlierdata warehousing theorists) into reality. Dimensionalmodeling has become the predominant approach todesigning data warehouses in practice and has provento be a major breakthrough in designing databasestructures that can be understood and used directlyby end users.

Star Schemas

Like most good ideas, dimensional modeling is verysimple. It is based on a single, highly regular datastructure called a star schema. This consists of acentral table called the fact table, surrounded by anumber of dimension tables.

The fact table forms the center of the star, whilethe dimension tables form the points of the star.

Fact tables typically correspond to particular types ofbusiness eventsfor example, orders, shipments,payments, bank transactions, airline reservations, andhospital admissions. Such tables usually containnumerical values (e.g., dollar amounts), which maybe analyzed using statistical functions. Dimensiontables provide the basis for analyzing data in the facttableways of slicing and dicing the data.Dimensions typically answer who, what, when,where, how, and why questions about the busi-ness events stored in the fact table. A star schemamay have any number of dimensions.

The terminology of dimensional model derivesfrom the fact that a star schema may be visualized asa data cube where each dimension table representsa different spatial dimension (or more generally ahypercube, as a star schema may have any number ofdimensions). As shown in Figure 2, each cell in thecube represents the intersection of values from each

dimension. Of course, this metaphor loses its intu-itive appeal once there are more than three dimen-sions, as most people have difficulty thinking inhyper-dimensional space!

Dimensions are often highly denormalized tables,typically consisting of one or more embedded hierar-chies. For example, Customer, which forms a singledimension table in the star schema of Figure 1, con-sists of three independent hierarchies when they arenormalized out (Figure 3). The hierarchical structureof dimensions provides the basis for analyzing data atdifferent levels of detailthe ability to drill downand roll up in OLAP tools.

Snowflake Schemas

A snowflake schema is a star schema with fully nor-malized dimensionsit gets its name because itforms a shape similar to a snowflake. Rather thanhaving a regular structure like a star schema, thearms of the snowflake can grow to arbitrary lengthsin each direction. Figure 4 shows the equivalentsnowflake schema for the star schema of Figure 1.While these two structures are equivalent in datacontent and the queries they support, the snowflakeschema has a much more complex structure.


BRIDGING THE GAP

Figure 1. Star Schema Example

Date

Product

WHAT Dimension

WHO Dimension

WHERE Dimension

WHEN Dimension

RetailOutletCustomer

Sales Summary(Fact Table)

However, the advantage of the snowflake structure isthat it explicitly shows the structure of each dimen-sion rather than appearing as an unstructured collec-tion of data items.

Kimball (1996, 2002) arguesthat snowflaking is undesir-able because it adds unneces-sary complexity, reduces queryperformance, and doesnt sub-stantially reduce storage space.In the example, the snowflakeschema has more than fivetimes as many tables as theequivalent star schema.However, empirical tests showthat the performance impact ofsnowflaking depends on theDBMS and/or OLAP toolused: some favor snowflakeswhile others favor star schemas(Spencer and Loukas, 1999).An advantage of the snowflakerepresentation is that it explic-itly shows the hierarchicalstructure of each dimension

in a star schema this requires tacit knowledge on thepart of the user. While this may be overwhelmingwhen presented all at once, it may be useful to allowend users to see the underlying structure of individualdimensions to understand how data can be sensibly


BRIDGING THE GAP

Figure 2. Data "Cube"

Product

Customer

Date

Figure 3. Embedded Hierarchies in Dimension Tables

MarketSector

IndustryGroup

Country

IndustryHierarchy

MarketSegmentation

Hierarchy

LocationHierarchy

IndustryClass

Customer

Customer IDCustomer NameMarket SegmentMarket SectorIndustry ClassIndustry SectorIndustry GroupCityStateCountry

IndustrySector

MarketSegment

CityState

Figure 4. Snowflake Schema

Date

FinancialQuarter SeasonHoliday

FianacialYear

Product

PackageSize

Brand

ProcessType

ManufacturingGroup

ManufacturingSubgroup

ProductCategory

ProductSubcategory

Department

RetailOutlet Suburb City

MarketingRegion

MarketingDivision

StoreLayout

StoreType

State CountryCustomerIndustryClassIndustrySector

MarketSegment

MarketSector

IndustryGroup

CityStateCountry

Sales Summary(Fact Table)

ProductSource

analyzed. This suggests that whether a snowflake or astar schema is used at the physical level, views shouldbe constructed so that the user can see the structure ofeach dimension if required.

Strengths and Weaknesses of Dimensional ModelsThe major advantage of using dimensional models to represent data is that they drastically reduce thecomplexity of the database structure. This makes the database easier for people to understand andwrite queries against, by minimizing the number oftables and therefore the number of joins required.Dimensional models also optimize query performance.A star schema has a fixed structure that has no alterna-tive join paths, which greatly simplifies the evaluationand optimization of queries (Raisinghani, 2000).

However, its strength is also its weakness. The fixedstructure of the star schema limits the queries thatcan be written to the dimensions that have beendefined. This means the designer must have a goodidea in advance about the sort of questions users willwant to ask (Craig, 1998). Star schemas make themost common queries easy to write but also restrictthe way the data can be analyzed (Gallas, 1999).Another weakness of dimensional models is that notall information can be represented in dimensionalform. Dimensional models assume an underlyinghierarchical structure of data and exclude data that isnaturally non-hierarchical (e.g., network structureddata). Thus, dimensional modeling at best provides apartial solution (an 80/20 solution) to the problemof database query and analysis.

Theoretical Explanation: Why Dimensional Modeling WorksDimensional modeling is not based on any theory,but has clearly been very successful in practice. It is amajor contribution to the database design fieldasimportant in practical terms as Codds contributionof normalization (Codd, 1970) and Chens contribution

of ER modeling (Chen, 1976) and its influence onthe practice of data warehouse design cannot be overstated. For these reasons, it is important tounderstand why it works. We offer a theoreticalexplanation for dimensional modelings success indesigning databases end users can understand andwrite queries against.

Principle #1: Chunking

Psychological studies show that due to limits onshort-term memory, humans have a strictly limitedcapacity for processing informationthis is esti-mated to be the magical number seven, plus orminus two concepts concurrently (Miller, 1956;Newell and Simon, 1972; Baddeley, 1994). If theamount of information received exceeds these limits,information overload ensues and comprehensiondegrades rapidly (Lipowski, 1975).

In one of the most important papers in the history ofpsychology, Miller (1956) described the results of aseries of experiments using a wide range of differentexperimental stimuli (sounds, tastes, colors, points,numbers, and words). He observed a surprising simi-larity in human channel capacity across all theseexperiments, with a mean of 6.5 concepts, and onestandard deviation, including 4 to 10 concepts (Figure5). This has been described as the inelastic limit ofhuman capacity (Uhr et al., 1962) or cognitivebandwidth (Moody, 2002), and represents one of theenduring laws of human cognition (Baddeley, 1994).

The primary mechanism used by the human mind tocope with large amounts of information is to orga-nize it into chunks of manageable size (Weber,1996; Baddeley, 1999; Miller, 1956). The ability torecursively develop information-saturated chunks isthe key to peoples ability to deal with complexityevery day. The process of organizing data into a set ofseparate star schemas effectively provides a way oforganizing a large amount of data into cognitivelymanageable chunks. Although Kimball never refers


BRIDGING THE GAP

to the seven plus or minus two principle, all theexamples he uses in his original book (Kimball,1996) implicitly obey this rule: an average of 6.6entities per star schema, with a maximum of 11 anda minimum of 4. This corresponds remarkably well(in fact, almost perfectly) to the limits of humanchannel capacity defined by Miller!

Principle #2: Hierarchical Structuring

Hierarchy is one of the most common ways of organiz-ing complexity for the purposes of human understand-ing, and is one of the basic principles of systems theory(Flood and Carson, 1993; Klir, 1985). Herbert Simon,the dual Nobel Prize winner, observed the use of hierar-chy to organize complexity in organizational systems,social systems, biological systems, physical systems, andsymbolic systems (Simon, 1996). Based on this, he pro-posed hierarchical organization as a general architecturefor complexity. Hierarchical structures act as a com-plexity management mechanism by reducing thenumber of items one has to deal with at each level ofthe hierarchy. For example, instead of having to dealwith a million customers, one can group them into ahundred market sectors and then into 10 market seg-ments. Hierarchical structures are a familiar and naturalway of organizing information, and are all around us in

everyday lifefor example, books are organized hierar-chically, Web sites are organized hierarchically, the orga-nizations we work in are organized hierarchicallyeventhis article is organized hierarchically.

As discussed earlier, each dimension in a star schematypically consists of one or more hierarchies. Theseprovide a way of classifying business events stored inthe fact table, thereby reducing complexity. It is thishierarchical structure that provides the ability toanalyze data at different levels of detail, and to rollup and drill down in OLAP tools. As we will seelater, the process of building dimensional models islargely one of extracting hierarchical structures fromenterprise data.

Implications for Practice

Together these principles provide a possible explana-tion for why dimensional modeling has been so suc-cessful in practice. Unlike traditional database designtechniques, dimensional modeling is not based onany theory but simply evolved as a result of practice.However, closer examination shows that it is basedon some general organizational principles (from psy-chology and systems theory) that have been used tomanage complexity in a wide range of domains.

This is hardly surprising, as one of the most impor-tant problems in data warehousing is complexitymanagement: how to present enormous volumes ofdata in a way that decision makers can understand.Current practice in dimensional modeling has moreof the characteristics of an art than a science, andrelies heavily on common-sense, experience, andrules of thumb. Understanding the underlyingprinciples on which dimensional models are basedmay help to develop principles for good dimen-sional modeling, which are largely lacking. Forexample, the seven plus or minus two principlecould be used to define a limit on the number ofdimensions in a star schema, to ensure that it repre-sents a conceptually manageable chunk of data.


BRIDGING THE GAP

Figure 5. Human Channel Capacity or Cognitive Bandwidth (Miller, 1956)

Number of Variable Aspects

PointsColors

Tastes

Tones

Chan

nel C

apac

ity

00

2

4

6

8

10

1 2 3 4 5 6 7

From ER Models to Dimensional Models:Exploding the MythsThere is a common misconception that dimensionalmodeling is fundamentally different from, andincompatible with, ER modeling. For example, asKimball (1996) says:

Entity relation models are a disaster for queryingbecause they cannot be understood by users andcannot be navigated usefully by DBMS software.Entity relation models cannot be used as the basisfor enterprise data warehouses.

As a result, the starting point for most dimensionaldesign approaches is that you must forget everythingyou ever learned about database design (Kimball,1996, 1997, 1995, 2002; Oates, 2000). Sadly, this isa common theme in the introduction of any newtechnology, where proponents of the new technologyclaim that a whole new approach is needed, oftenwhen they do not fully understand the oldapproaches themselves. It is also part of good market-ing to claim that something is totally new. We thinkit is important to dispel this notion, as it holds backthe discipline of data warehouse design in twoimportant ways:

It acts as a barrier for people trained in tradi-tional database design techniques to learndimensional design. ER modeling and normal-ization have been used in practice to designdatabase schemas for over two decades, and aretaught as part of virtually every university levelIT curriculum. Clearly, it would be useful tobuild on this wealth of knowledge and exper-tise than to simply discard itfor one thing,people learn better by building on what theyalready know.

It acts as a barrier to establishing a cumulativetradition in the field, which is essential for sci-entific progress (Kuhn, 1970). Rather than

simply saying that data warehousing is some-thing totally new and different and trying toestablish a new design discipline from its foundations, dimensional modeling should be linked to the existing body of knowledge in database design.

On closer examination, we find that a star schema isjust a restricted form of an ER model (Figure 6):

There is a single entity called the fact table,which is in third normal form (3NF).Violations to second normal form (2NF)would result in double counting in queries.The fact table forms an n-ary intersectionentity (where n is the number of dimensions)between the dimension tables, and includes thekeys of all dimension tables.There are one or more entities called dimen-sion tables, each of which is related to the facttable via one or more one-to-many relation-ships. Dimension tables have simple keys, andare in at least 2NF. Transitive dependencies(3NF violations) are allowed, but partial depen-dencies (2NF violations) and repeating groups(1NF violations) are not.

As we will see, there is a straightforward transforma-tion between a normalized ER model and a dimen-sional model, involving selective subsetting,denormalization, and (optional) summarization.

Deriving Dimensional Models from ER ModelsCurrent approaches to dimensional design mostlytake a top-down approach, in which dimensionalmodels are developed directly from user query andanalysis requirements (demand side analysis). While this may seem like a sound, customer-drivenapproach, it ignores one of the fundamental realitiesof the data warehouse environment. To a large extent, a data warehouse is simply a repackaging of


BRIDGING THE GAP

operational data in a more accessible form. Datawarehouse design is therefore a highly constraineddesign problem limited by what data is available from operational systems (supply issues).

In this article, we describe how to derive dimensionalmodels from (normalized) ER models. This provides away of repackaging operational data (usually stored innormalized tables) into a form that end users canunderstand and write queries against. This helps toresolve the difficult problem of matching supply(operational data sources) and demand (managementinformation needs) in data warehouse design (Winterand Strauch, 2003). Deriving dimensional modelsfrom ER models also provides a more structuredapproach to dimensional design than starting fromfirst principles. There is a large conceptual leap ingetting from end-user analysis requirements to dimen-sional models, with which all but the most gifteddesigners have difficulty. Experience shows that there isa very steep learning curve involved and a great deal ofexperience required before people can successfullydevelop dimensional models on their own. Theapproach described in this paper can reduce this learn-ing curve by providing a stepwise approach to develop-ing dimensional models starting with the source data.

To illustrate the approach, we use an ER model foran order processing system (Figure 7). This con-sists of 42 entities, which is less than half the sizeof the average application data model, but largeenough to illustrate the main principles of theapproach. We will refer to this example through-out this article and the next (on advanced designissues, to appear in the Fall issue of the BusinessIntelligence Journal).

Even in a model of this size, the problem of complex-ity in normalized ER models is clearly evident. Mostend users would find this schema incomprehensible,as it exceeds the limits of human cognitive capacitymany times over. Even quite simple queries wouldrequire multi-table joins and/or subqueries, and as aresult, end users would be heavily reliant on technicalspecialists to write queries for them.

The transformation of an ER Model to dimensionalform takes place in four steps:

Step 1: Classify entitiesStep 2: Design high-level star schema Step 3: Design detailed fact table Step 4: Design detailed dimension table

Step 1. Classify Entities

The first step in the transformation approach is toclassify entities into three distinct categories:

Transaction Entities: These entities recorddetails of business events (e.g., orders, ship-ments, payments, insurance claims, bank trans-actions, hotel bookings, airline reservations,and hospital admissions). Most decisionsupport applications focus on these events toidentify patterns, trends, and potential problemsand opportunities. The exception to this rule is the case of snapshot entities: entitiesrecording a static level of some commodity at a point in time (e.g., account balances and


BRIDGING THE GAP

Figure 6. ER Representation of a Star Schema

TIMEDimension

PRODUCTDimension

WHATWHEN

DateDay of weekMonthQuarterYearHoliday flag

Product IDProduct nameBrandProduct category

SALESFact

STOREDimension

WHERE

DateProduct IDStore IDDollars soldQty soldDollars cost

Store IDStore nameAddressFloor plan type

inventory levels). These do not record businessevents as such, but the effect of events on thestate of an entity.Component Entities: These entities are directlyrelated to a transaction entity by a one-to-manyrelationship. They are involved in the businessevent, and answer who, what, where,how, and why questions about the event. Classification Entities: These entities are relatedto a component entity by a chain of one-to-many relationships. These define embeddedhierarchies in the data model and are used toclassify component entities.

The classification of entities for the example datamodel is shown in Figure 8. There are three trans-action entities in the model: Order, Order Item,and Stock Level. Order and Order Item correspondto business events, while Stock Level records astatic level (a snapshot entity). Order and OrderItem represent different levels of detail of the samebusiness event.

There are six component entities:Customer, Employee, Retail Outlet,Product, Warehouse and DeliveryMethod. Product is a component oftwo different transactions (StockLevel and Order Item).

Finally, there are 31 classificationentities defining separate but par-tially overlapping hierarchies inthe model.

Note that some of the entities inthe underlying data model do notfit into any of these categories.Such entities do not fit the hierar-chical structure of a dimensionalmodel and therefore cannot be rep-resented in the form of a starschema (with some exceptions, to

be discussed in the next issue of the BusinessIntelligence Journal). Thus, the process of dimension-alizing an ER model effectively weeds out non-hierarchical data.

Step 2: High-Level Star Schema Design

In this step, relevant star schemas are identified andtheir high-level structures defined (table leveldesign). At a high level, transaction entities corre-spond to fact tables and component entities corre-spond to dimension tables, but the mapping is notalways one to one.

2.1 Identify Star Schemas Required

Each transaction entity is a candidate for a starschema. The process of creating star schemas is oneof subsettingdividing a large and complex modelinto manageable-sized chunks. Each star schema iscentered on a single business event, and thereforerepresents a manageable-sized chunk of data.However, there is not always a one-to-one correspon-dence between transaction entities and star schemas:


BRIDGING THE GAP

Figure 7. ER Model for Retail Sales System

Product

Order CustomerMarket Segment

Sales District Store Type City

Retail Outlet

Order Item

Stock Level Warehouse

Customer Contact Number

Product AvailabilitySupplier

Market Sector

Brand

Package Type

Postal Area

Sales Region

Industry Type Industry ClassIndustry GroupProduct

Substitution

Employee

Reporting Relationship

Position Type

Contact Number

Type

Delivery Method

Carrier

State Country

Priority

Credit Terms

Product Subcategory

Product Category

Tax Status

Delivery Charge

Employment Status

Union Status

Education Level

Inventory Status

Import/ Export Status

Not all transactions will be important for deci-sion-making purposes: user input will be requiredto choose which transactions are relevant. Multiple star schemas at different levels of detailmay be required for a particular transaction. When transaction entities are connected in amaster-detail structure, they should be com-bined into a single star schema, as they repre-sent different views of the same event. The splitinto master and detail is simply a require-ment of normalization (1NF).

2.2 Define Level of Summarization

One of the most critical decisions in star schemadesign is to choose the appropriate level of granular-itythe level of detail at which data is stored. In tech-nical terms, granularity is defined as what each row inthe fact table represents. At the top level, there are twomain options in choosing the level of granularity:

Unsummarized (transaction level granularity):this is the highest level of granularity where

each fact table row corresponds to a singletransaction or line itemSummarized: transactions may be summarizedby a subset of dimensions or dimensionalattributes. In this case, each row in the facttable corresponds to multiple transactions

The lower the level of granularity (or conversely, thehigher the level of summarization), the less storagespace required and the faster queries will be exe-cuted. However, the downside is that summarizationalways loses information and therefore limits thetypes of analyses that can be carried out.Transaction-level granularity provides maximumflexibility for analysis, as no information is lost fromthe original normalized model.

A common misconception is that star schemas alwayscontain summarized datathis is not necessarily thecase and is not always desirable. However, for perfor-mance reasonswhen dealing with large datavolumesor to simplify the view of data for a partic-

ular set of users, somelevel of summarizationmay be necessary. In prac-tice, a range of starschemas at different levelsof granularity are gener-ally created for each trans-action entity to suit theneeds of different users.An important design con-sideration is that all ofthese star schemas shoulduse exactly the samedimension tables or viewsdefined on these tablesthese are called conform-ing dimensions (Kimball,1996, 2002). This ensuresthat users can drillacross from one star


BRIDGING THE GAP

Figure 8. Classification of Entities

Diagram Key

Product

Order CustomerMarket Segment

Sales District Store Type City

Retail Outlet

Order Item

Stock Level Warehouse

ComponentEntity

Customer Contact Number

Product AvailabilitySupplier

Market Sector

Brand

Package Type

Postal Area

Sales Region

Industry Type Industry ClassIndustry GroupProduct

Substitution

Employee

Reporting Relationship

Position Type

Contact Number

Type

Delivery Method Carrier

State Country

Priority

Credit Terms

Product Subcategory

Product Category

Tax Status

Delivery Charge

Employment Status

Union Status

Education Level

Inventory Status

Import/ Export Status

ClassificationEntity

TransactionEntity

schema to another. Facts should also be defined con-sistently between different star schemas, to avoidinconsistent results from queries.

There is an almost infinite range of possibilities for cre-ating summary level star schemas for a given transactionentity. In general, any combination of dimensions ordimensional attributes may be used to summarize trans-actions. For example, orders could be summarized by:

Month (an attribute of the Date dimension)and Retail Outlet: this gives monthly salestotals for each retail outlet.Date, Product, and City (an attribute of theRetail Outlet dimension): this gives dailyproduct sales for each city.

These specify the group by conditions for starschema load processes.

2.3 Identify Relevant Dimensions

The component entities associated with each transac-tion entity represent candidate dimensions for the starschema. However, the correspondence between compo-nent entities and dimensions is not always one to one:

Not all component entities may be relevant forpurposes of analysis or for the level of granular-ity chosen. If a component entity has no dependentattributes, its key will be included in the facttable but it will not have its own dimensiontable. This is called a degenerate dimension.Explicit dimensions are required to representdates and times. Date and/or Time appear asexplicit dimensions in most star schemas tosupport different types of historical analyses.These are not normally represented as entitiesin operational systems, as they are handled bybuilt-in DBMS functions. These correspondto data types rather than entities at the opera-tional level.

Figure 9 shows the transaction level star schema forthe Order/Order Item transaction. Each row in thefact table corresponds to an individual Order Item(line item granularity).

The star schema has six dimension tables correspond-ing to the five component entities associated with thetransaction, plus an additional Date dimension.There are multiple relationships to the Date dimen-sion corresponding to the date the order was madeand when it was posted to the general ledger. Orderforms a degenerate dimension. If the time of orders isimportant for analysis, an additional Time dimensionwill be required. Date and Time are usually repre-sented as separate dimensions to reduce the size ofthe dimension tables (Kimball, 1996).

As discussed earlier, transaction entities connected ina master-detail structure should be combined into asingle star schema. Component entities associatedwith either of the entities become dimensions in theresulting star schema. In most cases, the masterentity will become a degenerate dimension in thestar schema: it forms part of the key of the fact tablebut has no corresponding dimension table (Figure10). The reason for this is that all attributes of the


BRIDGING THE GAP

Figure 9. Transaction Level Star Schema for Order/Order ItemTransaction Entities

Order Item Fact

Delivery Type

Customer

Date

ProductWHO Dimension

HOW Dimension

WHAT Dimension

WHEN Dimension

Employee

WHO Dimension

order posted

Retail Outlet

WHERE Dimension

Order

Degenerate Dimension

master entity should be allocateddown to the item level and storedin the fact table.

Figure 11 shows the transaction-level star schema for the StockLevel transaction entity, which isan example of a snapshot or levelentity. Each row in the fact tablerecords the inventory level of aparticular product in a particularwarehouse on a particular date.The star schema has three dimen-sion tables, corresponding to thetwo component entities associated with the transac-tion entity plus an explicit Date dimension.

When non-transaction level granularity is chosen,the dimensions required will be determined by howthe transactions are summarized. This will generallybe a subset of the dimensions used in the transactionlevel star schema, though dimensions may them-selves be subsets of the transaction level dimensiontables (e.g., Month instead of Date). Figure 12shows a summary level star schema in which dailyorders are summarized by retail outlet and product.This requires three dimension tables: Date, RetailOutlet, and Product.

Step 3: Detailed Fact Table Design

In this step, we complete the detailed design for facttables in each star schema (attribute level design).

3.1 Define Key

The key of each fact table is a composite key con-sisting of the keys of all dimension tables plus anydegenerate dimensions. This is different from howthe key is defined for a normal relational table. Ina relational database, the key should be minimal:no subset of the attributes in the key should alsobe unique. However, the key of a fact tableincludes all dimensional keys, and will therefore

not be minimal in most cases. For example, inFigure 13 (the fact table for the Order transac-tion), the key consists of eight attributes, but onlytwo (Order Number and Product ID) are necessaryfor uniqueness.

3.2 Define Facts

The non-key attributes of the fact table are mea-sures (facts) that can be analyzed using numericalfunctions. What facts are defined depends on whatevent information is collected by operationalsystemsthat is, what attributes are stored in trans-action entities. However, a key concept in defining


BRIDGING THE GAP

Figure 10. Merging of "Master-Detail" Entities

Time Dimension(not required at

operational level)

DeliveryMethod

Dimension

CustomerDimension

EmployeeDimension

Retail OutletDimension

ProductDimension

OrderItemFact

DateDimension

Order(degeneratedimension)

DeliveryMethod

Customer

Product

Order

OrderItem

Employee

RetailOutlet

Figure 11. Transaction Level Star Schema for Stock Level Transaction Entity

WHENDimension

WHEREDimension

WHATDimension

Warehouse ProductInventory

LevelFact

Date

facts is that of additivity. Kimball (1996) defines threelevels of additivity:

Fully additive facts: these are facts that can bemeaningfully added across all dimensions. Forexample, Quantity Ordered in the Order Itementity can be added across dates, products, orcustomers to get total sales volumes for a par-ticular day, product, or customer.Semi-additive facts: these are facts that can bemeaningfully added across some dimensions butnot others (usually time). For example, QuantityOn Hand from the Stock Level entity can beadded across products and warehouses (to get thetotal quantity on hand for a particular product orwarehouse) but not across time (as this wouldlead to double counting of stock). Quantity OnHand can be averaged over time (to get averagedaily inventory levels) but not added.Non-additive facts: these are facts that cannotbe meaningfully added across any dimensions.For example, Unit Price from the Order Itementity cannot be added across any dimensions.Unit Price cannot even be sensibly averagedover time, because to calculate the averageselling price of a product you need to take intoaccount the quantity sold.

Wherever possible, additive facts should be used toprevent errors in queries. This means converting semi-and non-additive facts to additive facts. In theexample, Unit Price (a non-additive fact) can be con-verted to an additive fact by multiplying it byQuantity Ordered to form the Extended Price (thetotal price for an individual order item). Note that thistransformation does not lose information: the UnitPrice for a particular order item can be derived bydividing the Extended Price by the Quantity Ordered.

Figure 13 shows the detailed fact table design for thetransaction level (non-summarized) Order Item starschema. Each row in the table corresponds to anindividual order item.

The key of the fact table consists of the keys ofall the dimension tables, including the degener-ate dimension. The non-key attributes are the union of theattributes in the two original transaction enti-ties (Order and Order Item). However, a non-additive fact (Unit Price) has been converted toan additive fact (Extended or Item Price =Order Quantity x Unit Price).

In this case, no data is lost from the original (normal-ized) data modelwe can reconstruct details of indi-vidual orders down to the line item level.

For transaction entities connected in a master-detailstructure, all attributes of the master record should beallocated down to the item level if possible (Kimball,1995). For example, if discount is defined at themaster (Order) level, the total discount amountshould be allocated at the item level (e.g., in propor-tion to the price for each item). Otherwise, querieswill result in multiple counting of discounts. Thesame should be done to delivery charges, order leveltaxes, fees, etc. If attributes of the master entitycannot be allocated down to the item level, separatestar schemas may be required for each.


BRIDGING THE GAP

Figure 12. Summary Star Schema for Order Transaction

WHENDimension

WHEREDimension

WHATDimension

RetailOutlet

ProductDailySales

Summary

Date

Figure 14 shows the detailed fact table design for thesummary level star schema of Figure 12. Each row inthis fact table summarizes sales by date, product, andretail outlet.

The key of the fact table consists of the keys ofall the dimension tables. The non-key attributes of the fact table are allsummarized facts. Total Sales Quantity is thetotal daily quantity sold for each product in eachretail outlet (calculated as the sum of OrderQuantity grouped by Date, Product ID, andRetail Outlet ID). Total Sales Amount is the totaldaily revenue for each product in each retailoutlet (calculated as the sum of Order Quantity xUnit Price). Total Discount Amount is the totaldaily discount by product and retail outlet (thesum of Discount Amount). An additional fact,Number of Orders, counts the number of sepa-rate orders by day, product, and retail outlet,which is a semi-additive fact. This can be addedacross Retail Outlets and Dates but not acrossProducts, as this would result in double counting.Including such a fact is questionable, as it couldlead to erroneous conclusions by end users.

Such summarization loses information compared to the original (normalized) model: we cannot

reconstruct details of individual order items fromthis star schema. This means that some queries arenot possible: for example, it is possible to deter-mine the total sales of a product on any given day,but not the characteristics of the customers whobought them.

In some cases, fact tables may contain no numericalfacts at all. For example, in a star schema thatrecords the incidence of crimes, the fact table mayrecord when and where a crime took place, the typeof offense committed, and who committed it.However, there are no measures involvedtheimportant information is simply that the crime took

place. This is called a factless fact table. Other examplesof factless fact tables include traffic accident, studentenrollment, and disease statistics. The only relevantaggregation operation for such types of events isCOUNT, which can be used to analyze the frequencyof events over time.

Step 4: Detailed Dimension Table Design

In this step, we complete the detailed design for dimen-sion tables in each star schema (attribute level design).This completes the detailed design of the star schema.

4.1 Define Dimensional Key

The key of each dimension table should be a simple(single attribute) numeric key. In most cases, this willjust be the key of the underlying component entity.However, the operational key may need to be general-


BRIDGING THE GAP

Figure 14. Daily Sales Summary Fact Table

WHENDimension

WHEREDimension

WHATDimension

RetailOutlet

ProductDailySales

Summary

DateFacts: Total Sales Quantity Total Sales Amount ($) Total Discount Amount ($) Number of Orders

Daily Sales Summary

Date

Product_ID

Retail_Outlet_ID

Total Sales Quantity

Total Sales Amount

Total Discount Amount

Number of Orders

Figure 13. Transaction Level Order Item Fact Table

Order Item Fact

Delivery Type

Customer

Date

Product

WHO Dimension

HOW Dimension

WHAT Dimension

WHEN Dimension

Employee

WHO Dimension

order posted

Retail Outlet

WHERE Dimension

Order

Degenerate Dimension

Order Item Fact

Order_Number

Product_ID

Customer_ID

Retail_Outlet_ID

Employee_ID

Delivery_Method

Order_Date

Posted_Date

Order_Quantity

Extended_Price

Discount_AmountFacts: Order Quantity Extended Price ($) Discount Amount ($)

ized to ensure that it remains unique over time.Because operational systems often only require keyuniqueness at a point in time (i.e., keys may be reusedover time), this may cause problems when performinghistorical analysis. Another situation where thedimensional key needs to be generalized is in the caseof slowly changing dimensions.

4.2 Collapse Hierarchies

Dimension tables are formed by collapsing ordenormalizing hierarchies (defined by classificationentities) into component entities. Figure 15 showshow the hierarchies associated with the Customercomponent are collapsed to form the Customerdimension table. The resulting dimension table con-sists of the union of all attributes in the original enti-tiesas many descriptive attributes should beincluded as possible to support slicing and dicing ofdata in different ways. In practice, it is common fordimension tables to contain more than a hundredattributes. This process introduces redundancy in theform of transitive dependencies, which are violationsto third normal form (3NF) (Codd, 1970). Thismeans that the resulting dimension table is in secondnormal form (2NF).

4.3 Replace Codes and Abbreviations by Descriptive Text

To make the star schema as understandable as possible,codes and abbreviations in source data should beremoved and replaced by descriptive text (Kimball,1996). While such codes and abbreviations have animportant place in efficient processing of transactionsat the operational level, they simply obfuscate thingsin the data warehousing environment. Use of suchcodes and abbreviations places a cognitive load onthe end user to remember what they mean, some-thing users can well do without. Codes and abbrevia-tions may be optionally retained in the star schema ifthey add to understanding.

Figure 16 shows the detailed star schema design for the Order transaction (line item granularity).

This shows attributes and keys for the fact table andall dimension tables:

All of the tables in the star schema are denor-malized to at least some extent: each table cor-responds to multiple entities in the originalnormalized ER model. The dimension tablesare the result of collapsing classification entitiesinto component entities, while the fact table isthe result of merging the master and detailtransaction entities.All of the dimension tables are in 2NF: theyhave simple keys and no repeating attributes.The fact table is in 3NF as Discount Amount(an attribute of the Order entity) has been allo-cated down to the line item level. This meansthat Order becomes a degenerate dimension asit has no other dependent attributes.The Date Dimension is a new table which doesnot correspond to any entity in the original ERmodel. This is because dates must be explicitlymodeled in a star schema, whereas at the opera-tional level they are represented as data types.

SummaryThis article has examined the nature of dimensionalmodeling and proposed a possible theoretical


BRIDGING THE GAP

Figure 15. Collapsing Hierarchies

MarketSector

IndustryGroup

State

IndustryHierarchy

MarketHierarchy

LocationHierarchy Country

IndustryClass

CustomerIndustrySector

MarketSegment

PostalAreaCity

Customer Dimension

Customer_ID

Customer Name

Annual Revenue

Date of First Order

Market Segment Code

Market Segment Name

Market Sector Code

Market Sector Name

Industry Class Code

Industry Class Name

Industry Sector Code

Industry Sector Name

Industry Group Code

Industry Group Name

Postal Code

Postal Area Name

City Code

City Name

Population

State Code

State Name

Country Code

Country Name

explanation for why it has been so successful in prac-tice. It has also described an approach for derivingdimensional models directly from ER models. Thisprovides a bridge between operational system(OLTP) design and data warehouse (OLAP) design,which makes it easier for people trained in traditionaldatabase design techniques to learn dimensional design.

Deriving dimensional models from an ER modelprovides a more structured approach to developingdimensional models than starting from first principles, which can help avoid many of the pit-falls faced by inexperienced designers. It also helpsto resolve the difficult problem of matchingsupply (operational data sources) and demand(end-user information needs) in data warehousedesign. Finally, it results in a more completedimensional design, as it involves selecting from all possible ways of analyzing the data in the under-lying operational systems, and is therefore lessdependent on the designers ability to choose theright dimensions.

Relationship between ER Modeling

and Dimensional Modeling

In this paper, we have challenged the widelyaccepted view that dimensional modeling is fundamentally different from, and incompatiblewith, ER modeling. We have shown that a dimen-sional model is just a restricted form of an ERmodel and there is a straightforward transformationbetween the two. An ER model can be transformedinto a set of dimensional models by a process ofselective subsetting, denormalization, and(optional) summarization:

Subsetting: The ER model is partitioned intoa set of separate star schemas, each centeredon a single business event (transactionentity). This reduces complexity through aprocess of chunking.Denormalization: Hierarchies in the ER model

are collapsed to form dimension tables. Thisfurther reduces complexity by reducing thenumber of separate tables. Master-detail trans-action entities are denormalized into a singlefact table, as these represent different views ofthe same underlying event.Summarization: The most flexible dimensionalstructure is one in which each fact represents asingle transaction or line item. However, sum-marization may be required for performancereasons, or to suit the needs of a particulargroup of users.

At the detailed (attribute) design level, a range oftransformations are required:

Generalization of operational keys to ensureuniqueness of keys over time and/or to supportslowly changing dimensions.Conversion of non-additive or semi-additivefacts to additive facts.Substitution of codes and abbreviations withdescriptive text.

Advanced Design Issues

The transformation approach described in this articleresults in a fully detailed star schema design. However,in practice, dimensional modeling tends to be an iter-ative process. The first-cut dimensional model willalmost certainly need refinementas in most designproblems, there are always exceptions and special casesto deal with. In Part II of this discussion, (publishingin the Fall issue of the Business Intelligence Journal), wewill discuss advanced design issues that need to beconsidered in dimensional modeling:

Alternative dimensional structures: stars,snowflakes, and starflakesSlowly changing dimensionsMinidimensionsHeterogeneous star schemasDealing with nonhierarchical data


BRIDGING THE GAP

REFERENCESBaddeley, A. D. The Magical Number Seven:

Still Magic After All These Years? PsychologicalReview, 101, 1994.

Baddeley, A. D., Essentials of Human Memory,Psychology Press, Hove [England], 1999.

Chen, P. P. The Entity Relationship Model:

Towards An Integrated View Of Data, ACMTransactions On Database Systems, 1, 9-36, 1976.

Codd, E. F. A Relational Database Model For LargeShared Data Banks, Communications of the ACM,13, 377-387, 1970.

Craig, R. The Schema Wars, ENT, 3, 30-32,1998.


BRIDGING THE GAP

Figure 16. Order Item Star Schema (fully attributed)

Customer Dimension

Customer_ID

Customer Name

Number of Employees

Annual Revenue

Date of First Order

Market Segment Name

Market Sector Name

Industry Class Name

Industry Sector Name

Industry Group Name

Flat/Apartment Number

Street Number

Street Name

Street Type

Postcode

Postal Area Name

City Name

Population

State Name

Country Name

Account Manager

Source Entities:CustomerPostal AreaStateCityCountryMarket SegmentMarket SectorIndustry SectorIndustry SegmentIndustry Class

Source Entities:NONE(corresponds to a data type at the operational level)

Source Entities:Delivery MethodCarrierPriority

Source Entities:EmployeePosition TypeEmployment StatusEducation LevelUnion Status

Source Entities:Retail OutletSales DistrictSales RegionPostal AreaCityStateCountryStore Type

Delivery MethodDimension

Delivery_Method_ID

Delivery Method

Priority

Carrier Name

Date Dimension

Date_Key

Day Name

Day Number

Day Number in Year

Week Number in Year

Month Name

Month Number

Quarter

Fiscal Period

Year

Holiday

Weekend

Season

Event

Country Name

Employee Dimension

Employee_ID

Employee Name

Age

Sex

Employment Status

Position Type

Commencement Date

Departure Date

Salary Level

Union Status

Last Performance Rating

Education Level

Source Entities:ProductPackage TypeTax StatusBrandProduct CategoryProduct SubcategoryUnit TypeImport StatusExport StatusInventory Status

Product Dimension

Product_ID

Product Name

Brand

Product Subcategory

Product Category

Package Type

Package Size

Unit Type

Product Manager

Import Status

Export Status

Inventory Status

Order Item Fact

Order_Number

Product_ID

Customer_ID

Retail_Outlet_ID

Employee_Number

Delivery_Method

Order_Date

Posted_Date

Order_Quantity

Total_Price

Discount_Amount

Retail Outlet Dimension

Retail_Outlet_ID

Retail Outlet Name

Floor Space

Store Type

Number of Employees

Sales Region

Sales District

Flat/Apartment Number

Street Number

Street Name

Street Type

Postcode

Postal Area Name

City Name

Population

State Name

Country Name

Parking Space

Store Manager

Source Entities:OrderOrder Item

Flood, R. L. and Carson, E. R. Dealing WithComplexity: An Introduction To The Theory AndApplication Of Systems Science, Plenum Press, 1993.

Gallas, S. Put Data in its Place! DM DirectBusiness Intelligence Newsletter, 1999.

Glassey, K. Seducing the End User,Communications of the ACM, 1998.

Kent, W. A Simple Guide To The Five NormalForms Of Relational Database Theory,Communications Of The ACM, 1983.

Kimball, R. Is ER Modeling Hazardous to DSS?DBMS Magazine, 1995.

Kimball, R. The Data Warehouse Toolkit, JohnWiley and Sons, New York, 1996.

Kimball, R. A Dimensional Manifesto, DBMSOnline, 1997.

Kimball, R. The Data Warehouse Toolkit: TheComplete Guide to Dimensional Modeling, JohnWiley and Sons, New York, 2002.

Klir, G. J. Architecture of Systems ProblemSolving, Plenum Press, New York, 1985.

Kuhn, T. S. The Structure Of ScientificRevolutions, University Of Chicago Press, 1970.

Lipowski, Z. J. Sensory And Information InputsOverload: Behavioural Effects, ComprehensivePsychiatry, 16, 1975.

Maier, R. (1996) Benefits And Quality Of DataModeling-Results Of An Empirical Analysis, InProceedings Of The Fifteenth InternationalConference On The Entity Relationship Approach(Ed, Thalheim, B.), Elsevier, Cottbus, Germany.

Miller, G. A. The Magical Number Seven, PlusOr Minus Two: Some Limits On Our Capacity ForProcessing Information, The PsychologicalReview, 1956.

Moody, D. L. (2002) Complexity Effects On EndUser Understanding Of Data Models: AnExperimental Comparison Of Large Data ModelRepresentation Methods, In Proceedings of theTenth European Conference on InformationSystems (ECIS, 2002) Gdansk, Poland.

Newell, A. A. and Simon, H. A. Human ProblemSolving, Prentice-Hall, 1972.

Oates, J., The Great Data Warehouse Debate,Sybase White Paper, 2000.

Raisinghani, M. S., Adapting Data ModelingTechniques for Data Warehouse Design, Journal ofComputer Information Systems, 40, 2000.

Simon, H. A. Sciences Of The Artificial (3rdedition), MIT Press, 1996.

Spencer, T. and Loukas, T. From Star to Snowflaketo ERD, Enterprise Systems Journal, 1999.

Uhr, L., Vossier, C. and Weman, J. PatternRecognition over Distortions by Human Subjectsand a Computer Model of Human FormPerception, Journal of Experimental Psychology:Learning, 63, 1962.

Weber, R. A. Are Attributes Entities? A Study OfDatabase Designers Memory Structures,Information Systems Research, 1996.

Winter, R. and Strauch, B. Demand-DrivenInformation Requirements Analysis in DataWarehousing, Journal of Data Warehousing,V8N1, 2003.


BRIDGING THE GAP

Dimensional Models Intro

Documents

Transcript of Dimensional Models Intro