Dimensional Models Intro

download Dimensional Models Intro

of 18

description

Whitepaper from Daniel L. Moody and Mark A. R. Kortink

Transcript of Dimensional Models Intro

  • BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 7

    BRIDGING THE GAP

    Dimensional modeling is a database design technique devel-

    oped specifically for designing data warehouses. Its objectives

    are to create database structures that end users can easily

    understand and write queries against, and to optimize query

    performance. It has become the predominant approach to

    designing data warehouses in practice and has proven to be a

    major breakthrough in developing databases that can be used

    directly by end users. Dimensional modeling is not based on

    any theory, but has clearly been very successful in practice.

    This article, the first in a two-part series, examines the nature

    of dimensional modeling and proposes a possible explanation

    for why it has been so successful.

    This article also explodes the popular myth that traditional ER

    modeling and dimensional modeling are fundamentally different

    and somehow incompatible. It shows that a dimensional model

    is just a restricted form of an ER model, and that there is a

    straightforward mapping between the two.

    An ER model can be transformed into a set of dimensional

    models by a process of selective subsetting, denormalization,

    and (optional) summarization. Understanding the relationship

    between the two types of models can help to bridge the gap

    between operational system (OLTP) design and data ware-

    house (OLAP) design. It can also help to resolve the difficult

    problem of matching supply (operational data sources) and

    demand (end-user information needs) in data warehouse

    design. Finally, it results in a more complete dimensional

    design, which is less dependent on the designers ability to

    choose the right dimensions.

    From ER Models toDimensional Models:Bridging the Gap between OLTP

    and OLAP Design, Part I

    Daniel L. Moody and Mark A. R. Kortink

    Daniel L. Moody is a Visiting Professor in the Departmentof Software Engineering, Charles University, Prague

    (visiting from Monash University, Melbourne, Australia)[email protected]

    Mark A. R. Kortink is the Managing Director of Kortinkand Associates, an Australianbased IT management

    consultancy company. [email protected]

  • IntroductionThe data warehousing or OLAP (online analyticalprocessing) environment is very different from theoperational or OLTP (online transaction processing)environment, and therefore requires a differentapproach to database design (Kimball, 1995, 1996,1997, 2002). There are two fundamental differencesthat suggest the need for different design approaches:

    End-user access (visibility of database struc-ture): In a data warehousing environment, endusers access data directly using user-friendlyquery and analysis (OLAP) tools (Glassey, 1998).This means users need to understand how data isstructured in order to use the database effectively(though some of the details may be hidden bythe OLAP tool). In operational systems, thedatabase structure is effectively hidden by anapplication front endusers only access thedatabase via data entry and enquiry screens.Read-only access (query versus update per-formance): In a data warehousing environ-ment, users can retrieve data but cannot modifyit, whereas in an operational environment,users make online updates to the database inresponse to business events. For most practicalpurposes, therefore, the data warehouse can beconsidered as a read-only database, as it is onlyupdated via batch extract, transform, and load(ETL) processes.

    Traditional database design techniques such as ERmodeling and normalization are primarily designedfor getting data efficiently and reliably into databases.However, structuring databases in this way also actsas a barrier to getting data out. The reason for this isthat such methods tend to result in database struc-tures which are too complex for end users to under-stand and write queries against. The averageoperational database consists of around a hundredtables, linked by a complex web of relationships,which most end users (not to mention most IT

    specialists) would find overwhelming. Even simplequeries require multi-table joins, which are beyondthe capabilities of most end users.

    The primary cause of complexity in operationaldatabases is the use of normalization. Normalizationtends to multiply the number of tables, as it involvessplitting out functionally dependent groups ofattributes into separate tables. This minimizes dataredundancy and maximizes update efficiency becauseeach change can be made in a single place. It alsomaximizes data integrity by avoiding inconsistencyand preventing insertion, deletion, and updateanomalies (Codd, 1970). However, on the downside,it tends to penalize retrieval because of the number ofjoins required (Kent, 1983). Thus, normalizationresults in databases that are optimized for updaterather than retrieval. Given that data warehouses areeffectively read-only databases, normalization is notwell suited for designing data warehouses.

    Dimensional ModelingDimensional modeling, a database design techniquedeveloped specifically for designing data warehouses,addresses the limitations of traditional databasedesign methods for such purposes. The approach wasfirst defined by Kimball (1996), though he claimsnot to have invented it but to have merely docu-mented what people had been doing in practice formany years. In particular, it was based on his obser-vations of vendors in the business of providing datain a user-friendly form to their customers. Theobjectives of dimensional modeling are to:

    Produce database structures that are easy forend users to understand and write queriesagainst. Optimize query performance (as opposed toupdate performance).

    Dimensional modeling was a pivotal development indata warehousing, as it provided a viable means of

    8 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

  • representing data in a way that end users couldunderstand and write queries against. Effectively thistransformed the dream of end users being able tosatisfy their own information needs (the self servicemodel of decision support as expounded by earlierdata warehousing theorists) into reality. Dimensionalmodeling has become the predominant approach todesigning data warehouses in practice and has provento be a major breakthrough in designing databasestructures that can be understood and used directlyby end users.

    Star Schemas

    Like most good ideas, dimensional modeling is verysimple. It is based on a single, highly regular datastructure called a star schema. This consists of acentral table called the fact table, surrounded by anumber of dimension tables.

    The fact table forms the center of the star, whilethe dimension tables form the points of the star.

    Fact tables typically correspond to particular types ofbusiness eventsfor example, orders, shipments,payments, bank transactions, airline reservations, andhospital admissions. Such tables usually containnumerical values (e.g., dollar amounts), which maybe analyzed using statistical functions. Dimensiontables provide the basis for analyzing data in the facttableways of slicing and dicing the data.Dimensions typically answer who, what, when,where, how, and why questions about the busi-ness events stored in the fact table. A star schemamay have any number of dimensions.

    The terminology of dimensional model derivesfrom the fact that a star schema may be visualized asa data cube where each dimension table representsa different spatial dimension (or more generally ahypercube, as a star schema may have any number ofdimensions). As shown in Figure 2, each cell in thecube represents the intersection of values from each

    dimension. Of course, this metaphor loses its intu-itive appeal once there are more than three dimen-sions, as most people have difficulty thinking inhyper-dimensional space!

    Dimensions are often highly denormalized tables,typically consisting of one or more embedded hierar-chies. For example, Customer, which forms a singledimension table in the star schema of Figure 1, con-sists of three independent hierarchies when they arenormalized out (Figure 3). The hierarchical structureof dimensions provides the basis for analyzing data atdifferent levels of detailthe ability to drill downand roll up in OLAP tools.

    Snowflake Schemas

    A snowflake schema is a star schema with fully nor-malized dimensionsit gets its name because itforms a shape similar to a snowflake. Rather thanhaving a regular structure like a star schema, thearms of the snowflake can grow to arbitrary lengthsin each direction. Figure 4 shows the equivalentsnowflake schema for the star schema of Figure 1.While these two structures are equivalent in datacontent and the queries they support, the snowflakeschema has a much more complex structure.

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 9

    BRIDGING THE GAP

    Figure 1. Star Schema Example

    Date

    Product

    WHAT Dimension

    WHO Dimension

    WHERE Dimension

    WHEN Dimension

    RetailOutletCustomer

    Sales Summary(Fact Table)

  • However, the advantage of the snowflake structure isthat it explicitly shows the structure of each dimen-sion rather than appearing as an unstructured collec-tion of data items.

    Kimball (1996, 2002) arguesthat snowflaking is undesir-able because it adds unneces-sary complexity, reduces queryperformance, and doesnt sub-stantially reduce storage space.In the example, the snowflakeschema has more than fivetimes as many tables as theequivalent star schema.However, empirical tests showthat the performance impact ofsnowflaking depends on theDBMS and/or OLAP toolused: some favor snowflakeswhile others favor star schemas(Spencer and Loukas, 1999).An advantage of the snowflakerepresentation is that it explic-itly shows the hierarchicalstructure of each dimension

    in a star schema this requires tacit knowledge on thepart of the user. While this may be overwhelmingwhen presented all at once, it may be useful to allowend users to see the underlying structure of individualdimensions to understand how data can be sensibly

    10 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 2. Data "Cube"

    Product

    Customer

    Date

    Figure 3. Embedded Hierarchies in Dimension Tables

    MarketSector

    IndustryGroup

    Country

    IndustryHierarchy

    MarketSegmentation

    Hierarchy

    LocationHierarchy

    IndustryClass

    Customer

    Customer IDCustomer NameMarket SegmentMarket SectorIndustry ClassIndustry SectorIndustry GroupCityStateCountry

    IndustrySector

    MarketSegment

    CityState

    Figure 4. Snowflake Schema

    Date

    FinancialQuarter SeasonHoliday

    FianacialYear

    Product

    PackageSize

    Brand

    ProcessType

    ManufacturingGroup

    ManufacturingSubgroup

    ProductCategory

    ProductSubcategory

    Department

    RetailOutlet Suburb City

    MarketingRegion

    MarketingDivision

    StoreLayout

    StoreType

    State CountryCustomerIndustryClassIndustrySector

    MarketSegment

    MarketSector

    IndustryGroup

    CityStateCountry

    Sales Summary(Fact Table)

    ProductSource

  • analyzed. This suggests that whether a snowflake or astar schema is used at the physical level, views shouldbe constructed so that the user can see the structure ofeach dimension if required.

    Strengths and Weaknesses of Dimensional ModelsThe major advantage of using dimensional models to represent data is that they drastically reduce thecomplexity of the database structure. This makes the database easier for people to understand andwrite queries against, by minimizing the number oftables and therefore the number of joins required.Dimensional models also optimize query performance.A star schema has a fixed structure that has no alterna-tive join paths, which greatly simplifies the evaluationand optimization of queries (Raisinghani, 2000).

    However, its strength is also its weakness. The fixedstructure of the star schema limits the queries thatcan be written to the dimensions that have beendefined. This means the designer must have a goodidea in advance about the sort of questions users willwant to ask (Craig, 1998). Star schemas make themost common queries easy to write but also restrictthe way the data can be analyzed (Gallas, 1999).Another weakness of dimensional models is that notall information can be represented in dimensionalform. Dimensional models assume an underlyinghierarchical structure of data and exclude data that isnaturally non-hierarchical (e.g., network structureddata). Thus, dimensional modeling at best provides apartial solution (an 80/20 solution) to the problemof database query and analysis.

    Theoretical Explanation: Why Dimensional Modeling WorksDimensional modeling is not based on any theory,but has clearly been very successful in practice. It is amajor contribution to the database design fieldasimportant in practical terms as Codds contributionof normalization (Codd, 1970) and Chens contribution

    of ER modeling (Chen, 1976) and its influence onthe practice of data warehouse design cannot be overstated. For these reasons, it is important tounderstand why it works. We offer a theoreticalexplanation for dimensional modelings success indesigning databases end users can understand andwrite queries against.

    Principle #1: Chunking

    Psychological studies show that due to limits onshort-term memory, humans have a strictly limitedcapacity for processing informationthis is esti-mated to be the magical number seven, plus orminus two concepts concurrently (Miller, 1956;Newell and Simon, 1972; Baddeley, 1994). If theamount of information received exceeds these limits,information overload ensues and comprehensiondegrades rapidly (Lipowski, 1975).

    In one of the most important papers in the history ofpsychology, Miller (1956) described the results of aseries of experiments using a wide range of differentexperimental stimuli (sounds, tastes, colors, points,numbers, and words). He observed a surprising simi-larity in human channel capacity across all theseexperiments, with a mean of 6.5 concepts, and onestandard deviation, including 4 to 10 concepts (Figure5). This has been described as the inelastic limit ofhuman capacity (Uhr et al., 1962) or cognitivebandwidth (Moody, 2002), and represents one of theenduring laws of human cognition (Baddeley, 1994).

    The primary mechanism used by the human mind tocope with large amounts of information is to orga-nize it into chunks of manageable size (Weber,1996; Baddeley, 1999; Miller, 1956). The ability torecursively develop information-saturated chunks isthe key to peoples ability to deal with complexityevery day. The process of organizing data into a set ofseparate star schemas effectively provides a way oforganizing a large amount of data into cognitivelymanageable chunks. Although Kimball never refers

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 11

    BRIDGING THE GAP

  • to the seven plus or minus two principle, all theexamples he uses in his original book (Kimball,1996) implicitly obey this rule: an average of 6.6entities per star schema, with a maximum of 11 anda minimum of 4. This corresponds remarkably well(in fact, almost perfectly) to the limits of humanchannel capacity defined by Miller!

    Principle #2: Hierarchical Structuring

    Hierarchy is one of the most common ways of organiz-ing complexity for the purposes of human understand-ing, and is one of the basic principles of systems theory(Flood and Carson, 1993; Klir, 1985). Herbert Simon,the dual Nobel Prize winner, observed the use of hierar-chy to organize complexity in organizational systems,social systems, biological systems, physical systems, andsymbolic systems (Simon, 1996). Based on this, he pro-posed hierarchical organization as a general architecturefor complexity. Hierarchical structures act as a com-plexity management mechanism by reducing thenumber of items one has to deal with at each level ofthe hierarchy. For example, instead of having to dealwith a million customers, one can group them into ahundred market sectors and then into 10 market seg-ments. Hierarchical structures are a familiar and naturalway of organizing information, and are all around us in

    everyday lifefor example, books are organized hierar-chically, Web sites are organized hierarchically, the orga-nizations we work in are organized hierarchicallyeventhis article is organized hierarchically.

    As discussed earlier, each dimension in a star schematypically consists of one or more hierarchies. Theseprovide a way of classifying business events stored inthe fact table, thereby reducing complexity. It is thishierarchical structure that provides the ability toanalyze data at different levels of detail, and to rollup and drill down in OLAP tools. As we will seelater, the process of building dimensional models islargely one of extracting hierarchical structures fromenterprise data.

    Implications for Practice

    Together these principles provide a possible explana-tion for why dimensional modeling has been so suc-cessful in practice. Unlike traditional database designtechniques, dimensional modeling is not based onany theory but simply evolved as a result of practice.However, closer examination shows that it is basedon some general organizational principles (from psy-chology and systems theory) that have been used tomanage complexity in a wide range of domains.

    This is hardly surprising, as one of the most impor-tant problems in data warehousing is complexitymanagement: how to present enormous volumes ofdata in a way that decision makers can understand.Current practice in dimensional modeling has moreof the characteristics of an art than a science, andrelies heavily on common-sense, experience, andrules of thumb. Understanding the underlyingprinciples on which dimensional models are basedmay help to develop principles for good dimen-sional modeling, which are largely lacking. Forexample, the seven plus or minus two principlecould be used to define a limit on the number ofdimensions in a star schema, to ensure that it repre-sents a conceptually manageable chunk of data.

    12 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 5. Human Channel Capacity or Cognitive Bandwidth (Miller, 1956)

    Number of Variable Aspects

    PointsColors

    Tastes

    Tones

    Chan

    nel C

    apac

    ity

    00

    2

    4

    6

    8

    10

    1 2 3 4 5 6 7

  • From ER Models to Dimensional Models:Exploding the MythsThere is a common misconception that dimensionalmodeling is fundamentally different from, andincompatible with, ER modeling. For example, asKimball (1996) says:

    Entity relation models are a disaster for queryingbecause they cannot be understood by users andcannot be navigated usefully by DBMS software.Entity relation models cannot be used as the basisfor enterprise data warehouses.

    As a result, the starting point for most dimensionaldesign approaches is that you must forget everythingyou ever learned about database design (Kimball,1996, 1997, 1995, 2002; Oates, 2000). Sadly, this isa common theme in the introduction of any newtechnology, where proponents of the new technologyclaim that a whole new approach is needed, oftenwhen they do not fully understand the oldapproaches themselves. It is also part of good market-ing to claim that something is totally new. We thinkit is important to dispel this notion, as it holds backthe discipline of data warehouse design in twoimportant ways:

    It acts as a barrier for people trained in tradi-tional database design techniques to learndimensional design. ER modeling and normal-ization have been used in practice to designdatabase schemas for over two decades, and aretaught as part of virtually every university levelIT curriculum. Clearly, it would be useful tobuild on this wealth of knowledge and exper-tise than to simply discard itfor one thing,people learn better by building on what theyalready know.

    It acts as a barrier to establishing a cumulativetradition in the field, which is essential for sci-entific progress (Kuhn, 1970). Rather than

    simply saying that data warehousing is some-thing totally new and different and trying toestablish a new design discipline from its foundations, dimensional modeling should be linked to the existing body of knowledge in database design.

    On closer examination, we find that a star schema isjust a restricted form of an ER model (Figure 6):

    There is a single entity called the fact table,which is in third normal form (3NF).Violations to second normal form (2NF)would result in double counting in queries.The fact table forms an n-ary intersectionentity (where n is the number of dimensions)between the dimension tables, and includes thekeys of all dimension tables.There are one or more entities called dimen-sion tables, each of which is related to the facttable via one or more one-to-many relation-ships. Dimension tables have simple keys, andare in at least 2NF. Transitive dependencies(3NF violations) are allowed, but partial depen-dencies (2NF violations) and repeating groups(1NF violations) are not.

    As we will see, there is a straightforward transforma-tion between a normalized ER model and a dimen-sional model, involving selective subsetting,denormalization, and (optional) summarization.

    Deriving Dimensional Models from ER ModelsCurrent approaches to dimensional design mostlytake a top-down approach, in which dimensionalmodels are developed directly from user query andanalysis requirements (demand side analysis). While this may seem like a sound, customer-drivenapproach, it ignores one of the fundamental realitiesof the data warehouse environment. To a large extent, a data warehouse is simply a repackaging of

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 13

    BRIDGING THE GAP

  • operational data in a more accessible form. Datawarehouse design is therefore a highly constraineddesign problem limited by what data is available from operational systems (supply issues).

    In this article, we describe how to derive dimensionalmodels from (normalized) ER models. This provides away of repackaging operational data (usually stored innormalized tables) into a form that end users canunderstand and write queries against. This helps toresolve the difficult problem of matching supply(operational data sources) and demand (managementinformation needs) in data warehouse design (Winterand Strauch, 2003). Deriving dimensional modelsfrom ER models also provides a more structuredapproach to dimensional design than starting fromfirst principles. There is a large conceptual leap ingetting from end-user analysis requirements to dimen-sional models, with which all but the most gifteddesigners have difficulty. Experience shows that there isa very steep learning curve involved and a great deal ofexperience required before people can successfullydevelop dimensional models on their own. Theapproach described in this paper can reduce this learn-ing curve by providing a stepwise approach to develop-ing dimensional models starting with the source data.

    To illustrate the approach, we use an ER model foran order processing system (Figure 7). This con-sists of 42 entities, which is less than half the sizeof the average application data model, but largeenough to illustrate the main principles of theapproach. We will refer to this example through-out this article and the next (on advanced designissues, to appear in the Fall issue of the BusinessIntelligence Journal).

    Even in a model of this size, the problem of complex-ity in normalized ER models is clearly evident. Mostend users would find this schema incomprehensible,as it exceeds the limits of human cognitive capacitymany times over. Even quite simple queries wouldrequire multi-table joins and/or subqueries, and as aresult, end users would be heavily reliant on technicalspecialists to write queries for them.

    The transformation of an ER Model to dimensionalform takes place in four steps:

    Step 1: Classify entitiesStep 2: Design high-level star schema Step 3: Design detailed fact table Step 4: Design detailed dimension table

    Step 1. Classify Entities

    The first step in the transformation approach is toclassify entities into three distinct categories:

    Transaction Entities: These entities recorddetails of business events (e.g., orders, ship-ments, payments, insurance claims, bank trans-actions, hotel bookings, airline reservations,and hospital admissions). Most decisionsupport applications focus on these events toidentify patterns, trends, and potential problemsand opportunities. The exception to this rule is the case of snapshot entities: entitiesrecording a static level of some commodity at a point in time (e.g., account balances and

    14 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 6. ER Representation of a Star Schema

    TIMEDimension

    PRODUCTDimension

    WHATWHEN

    DateDay of weekMonthQuarterYearHoliday flag

    Product IDProduct nameBrandProduct category

    SALESFact

    STOREDimension

    WHERE

    DateProduct IDStore IDDollars soldQty soldDollars cost

    Store IDStore nameAddressFloor plan type

  • inventory levels). These do not record businessevents as such, but the effect of events on thestate of an entity.Component Entities: These entities are directlyrelated to a transaction entity by a one-to-manyrelationship. They are involved in the businessevent, and answer who, what, where,how, and why questions about the event. Classification Entities: These entities are relatedto a component entity by a chain of one-to-many relationships. These define embeddedhierarchies in the data model and are used toclassify component entities.

    The classification of entities for the example datamodel is shown in Figure 8. There are three trans-action entities in the model: Order, Order Item,and Stock Level. Order and Order Item correspondto business events, while Stock Level records astatic level (a snapshot entity). Order and OrderItem represent different levels of detail of the samebusiness event.

    There are six component entities:Customer, Employee, Retail Outlet,Product, Warehouse and DeliveryMethod. Product is a component oftwo different transactions (StockLevel and Order Item).

    Finally, there are 31 classificationentities defining separate but par-tially overlapping hierarchies inthe model.

    Note that some of the entities inthe underlying data model do notfit into any of these categories.Such entities do not fit the hierar-chical structure of a dimensionalmodel and therefore cannot be rep-resented in the form of a starschema (with some exceptions, to

    be discussed in the next issue of the BusinessIntelligence Journal). Thus, the process of dimension-alizing an ER model effectively weeds out non-hierarchical data.

    Step 2: High-Level Star Schema Design

    In this step, relevant star schemas are identified andtheir high-level structures defined (table leveldesign). At a high level, transaction entities corre-spond to fact tables and component entities corre-spond to dimension tables, but the mapping is notalways one to one.

    2.1 Identify Star Schemas Required

    Each transaction entity is a candidate for a starschema. The process of creating star schemas is oneof subsettingdividing a large and complex modelinto manageable-sized chunks. Each star schema iscentered on a single business event, and thereforerepresents a manageable-sized chunk of data.However, there is not always a one-to-one correspon-dence between transaction entities and star schemas:

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 15

    BRIDGING THE GAP

    Figure 7. ER Model for Retail Sales System

    Product

    Order CustomerMarket Segment

    Sales District Store Type City

    Retail Outlet

    Order Item

    Stock Level Warehouse

    Customer Contact Number

    Product AvailabilitySupplier

    Market Sector

    Brand

    Package Type

    Postal Area

    Sales Region

    Industry Type Industry ClassIndustry GroupProduct

    Substitution

    Employee

    Reporting Relationship

    Position Type

    Contact Number

    Type

    Delivery Method

    Carrier

    State Country

    Priority

    Credit Terms

    Product Subcategory

    Product Category

    Tax Status

    Delivery Charge

    Employment Status

    Union Status

    Education Level

    Inventory Status

    Import/ Export Status

  • Not all transactions will be important for deci-sion-making purposes: user input will be requiredto choose which transactions are relevant. Multiple star schemas at different levels of detailmay be required for a particular transaction. When transaction entities are connected in amaster-detail structure, they should be com-bined into a single star schema, as they repre-sent different views of the same event. The splitinto master and detail is simply a require-ment of normalization (1NF).

    2.2 Define Level of Summarization

    One of the most critical decisions in star schemadesign is to choose the appropriate level of granular-itythe level of detail at which data is stored. In tech-nical terms, granularity is defined as what each row inthe fact table represents. At the top level, there are twomain options in choosing the level of granularity:

    Unsummarized (transaction level granularity):this is the highest level of granularity where

    each fact table row corresponds to a singletransaction or line itemSummarized: transactions may be summarizedby a subset of dimensions or dimensionalattributes. In this case, each row in the facttable corresponds to multiple transactions

    The lower the level of granularity (or conversely, thehigher the level of summarization), the less storagespace required and the faster queries will be exe-cuted. However, the downside is that summarizationalways loses information and therefore limits thetypes of analyses that can be carried out.Transaction-level granularity provides maximumflexibility for analysis, as no information is lost fromthe original normalized model.

    A common misconception is that star schemas alwayscontain summarized datathis is not necessarily thecase and is not always desirable. However, for perfor-mance reasonswhen dealing with large datavolumesor to simplify the view of data for a partic-

    ular set of users, somelevel of summarizationmay be necessary. In prac-tice, a range of starschemas at different levelsof granularity are gener-ally created for each trans-action entity to suit theneeds of different users.An important design con-sideration is that all ofthese star schemas shoulduse exactly the samedimension tables or viewsdefined on these tablesthese are called conform-ing dimensions (Kimball,1996, 2002). This ensuresthat users can drillacross from one star

    16 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 8. Classification of Entities

    Diagram Key

    Product

    Order CustomerMarket Segment

    Sales District Store Type City

    Retail Outlet

    Order Item

    Stock Level Warehouse

    ComponentEntity

    Customer Contact Number

    Product AvailabilitySupplier

    Market Sector

    Brand

    Package Type

    Postal Area

    Sales Region

    Industry Type Industry ClassIndustry GroupProduct

    Substitution

    Employee

    Reporting Relationship

    Position Type

    Contact Number

    Type

    Delivery Method Carrier

    State Country

    Priority

    Credit Terms

    Product Subcategory

    Product Category

    Tax Status

    Delivery Charge

    Employment Status

    Union Status

    Education Level

    Inventory Status

    Import/ Export Status

    ClassificationEntity

    TransactionEntity

  • schema to another. Facts should also be defined con-sistently between different star schemas, to avoidinconsistent results from queries.

    There is an almost infinite range of possibilities for cre-ating summary level star schemas for a given transactionentity. In general, any combination of dimensions ordimensional attributes may be used to summarize trans-actions. For example, orders could be summarized by:

    Month (an attribute of the Date dimension)and Retail Outlet: this gives monthly salestotals for each retail outlet.Date, Product, and City (an attribute of theRetail Outlet dimension): this gives dailyproduct sales for each city.

    These specify the group by conditions for starschema load processes.

    2.3 Identify Relevant Dimensions

    The component entities associated with each transac-tion entity represent candidate dimensions for the starschema. However, the correspondence between compo-nent entities and dimensions is not always one to one:

    Not all component entities may be relevant forpurposes of analysis or for the level of granular-ity chosen. If a component entity has no dependentattributes, its key will be included in the facttable but it will not have its own dimensiontable. This is called a degenerate dimension.Explicit dimensions are required to representdates and times. Date and/or Time appear asexplicit dimensions in most star schemas tosupport different types of historical analyses.These are not normally represented as entitiesin operational systems, as they are handled bybuilt-in DBMS functions. These correspondto data types rather than entities at the opera-tional level.

    Figure 9 shows the transaction level star schema forthe Order/Order Item transaction. Each row in thefact table corresponds to an individual Order Item(line item granularity).

    The star schema has six dimension tables correspond-ing to the five component entities associated with thetransaction, plus an additional Date dimension.There are multiple relationships to the Date dimen-sion corresponding to the date the order was madeand when it was posted to the general ledger. Orderforms a degenerate dimension. If the time of orders isimportant for analysis, an additional Time dimensionwill be required. Date and Time are usually repre-sented as separate dimensions to reduce the size ofthe dimension tables (Kimball, 1996).

    As discussed earlier, transaction entities connected ina master-detail structure should be combined into asingle star schema. Component entities associatedwith either of the entities become dimensions in theresulting star schema. In most cases, the masterentity will become a degenerate dimension in thestar schema: it forms part of the key of the fact tablebut has no corresponding dimension table (Figure10). The reason for this is that all attributes of the

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 17

    BRIDGING THE GAP

    Figure 9. Transaction Level Star Schema for Order/Order ItemTransaction Entities

    Order Item Fact

    Delivery Type

    Customer

    Date

    ProductWHO Dimension

    HOW Dimension

    WHAT Dimension

    WHEN Dimension

    Employee

    WHO Dimension

    order posted

    Retail Outlet

    WHERE Dimension

    Order

    Degenerate Dimension

  • master entity should be allocateddown to the item level and storedin the fact table.

    Figure 11 shows the transaction-level star schema for the StockLevel transaction entity, which isan example of a snapshot or levelentity. Each row in the fact tablerecords the inventory level of aparticular product in a particularwarehouse on a particular date.The star schema has three dimen-sion tables, corresponding to thetwo component entities associated with the transac-tion entity plus an explicit Date dimension.

    When non-transaction level granularity is chosen,the dimensions required will be determined by howthe transactions are summarized. This will generallybe a subset of the dimensions used in the transactionlevel star schema, though dimensions may them-selves be subsets of the transaction level dimensiontables (e.g., Month instead of Date). Figure 12shows a summary level star schema in which dailyorders are summarized by retail outlet and product.This requires three dimension tables: Date, RetailOutlet, and Product.

    Step 3: Detailed Fact Table Design

    In this step, we complete the detailed design for facttables in each star schema (attribute level design).

    3.1 Define Key

    The key of each fact table is a composite key con-sisting of the keys of all dimension tables plus anydegenerate dimensions. This is different from howthe key is defined for a normal relational table. Ina relational database, the key should be minimal:no subset of the attributes in the key should alsobe unique. However, the key of a fact tableincludes all dimensional keys, and will therefore

    not be minimal in most cases. For example, inFigure 13 (the fact table for the Order transac-tion), the key consists of eight attributes, but onlytwo (Order Number and Product ID) are necessaryfor uniqueness.

    3.2 Define Facts

    The non-key attributes of the fact table are mea-sures (facts) that can be analyzed using numericalfunctions. What facts are defined depends on whatevent information is collected by operationalsystemsthat is, what attributes are stored in trans-action entities. However, a key concept in defining

    18 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 10. Merging of "Master-Detail" Entities

    Time Dimension(not required at

    operational level)

    DeliveryMethod

    Dimension

    CustomerDimension

    EmployeeDimension

    Retail OutletDimension

    ProductDimension

    OrderItemFact

    DateDimension

    Order(degeneratedimension)

    DeliveryMethod

    Customer

    Product

    Order

    OrderItem

    Employee

    RetailOutlet

    Figure 11. Transaction Level Star Schema for Stock Level Transaction Entity

    WHENDimension

    WHEREDimension

    WHATDimension

    Warehouse ProductInventory

    LevelFact

    Date

  • facts is that of additivity. Kimball (1996) defines threelevels of additivity:

    Fully additive facts: these are facts that can bemeaningfully added across all dimensions. Forexample, Quantity Ordered in the Order Itementity can be added across dates, products, orcustomers to get total sales volumes for a par-ticular day, product, or customer.Semi-additive facts: these are facts that can bemeaningfully added across some dimensions butnot others (usually time). For example, QuantityOn Hand from the Stock Level entity can beadded across products and warehouses (to get thetotal quantity on hand for a particular product orwarehouse) but not across time (as this wouldlead to double counting of stock). Quantity OnHand can be averaged over time (to get averagedaily inventory levels) but not added.Non-additive facts: these are facts that cannotbe meaningfully added across any dimensions.For example, Unit Price from the Order Itementity cannot be added across any dimensions.Unit Price cannot even be sensibly averagedover time, because to calculate the averageselling price of a product you need to take intoaccount the quantity sold.

    Wherever possible, additive facts should be used toprevent errors in queries. This means converting semi-and non-additive facts to additive facts. In theexample, Unit Price (a non-additive fact) can be con-verted to an additive fact by multiplying it byQuantity Ordered to form the Extended Price (thetotal price for an individual order item). Note that thistransformation does not lose information: the UnitPrice for a particular order item can be derived bydividing the Extended Price by the Quantity Ordered.

    Figure 13 shows the detailed fact table design for thetransaction level (non-summarized) Order Item starschema. Each row in the table corresponds to anindividual order item.

    The key of the fact table consists of the keys ofall the dimension tables, including the degener-ate dimension. The non-key attributes are the union of theattributes in the two original transaction enti-ties (Order and Order Item). However, a non-additive fact (Unit Price) has been converted toan additive fact (Extended or Item Price =Order Quantity x Unit Price).

    In this case, no data is lost from the original (normal-ized) data modelwe can reconstruct details of indi-vidual orders down to the line item level.

    For transaction entities connected in a master-detailstructure, all attributes of the master record should beallocated down to the item level if possible (Kimball,1995). For example, if discount is defined at themaster (Order) level, the total discount amountshould be allocated at the item level (e.g., in propor-tion to the price for each item). Otherwise, querieswill result in multiple counting of discounts. Thesame should be done to delivery charges, order leveltaxes, fees, etc. If attributes of the master entitycannot be allocated down to the item level, separatestar schemas may be required for each.

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 19

    BRIDGING THE GAP

    Figure 12. Summary Star Schema for Order Transaction

    WHENDimension

    WHEREDimension

    WHATDimension

    RetailOutlet

    ProductDailySales

    Summary

    Date

  • Figure 14 shows the detailed fact table design for thesummary level star schema of Figure 12. Each row inthis fact table summarizes sales by date, product, andretail outlet.

    The key of the fact table consists of the keys ofall the dimension tables. The non-key attributes of the fact table are allsummarized facts. Total Sales Quantity is thetotal daily quantity sold for each product in eachretail outlet (calculated as the sum of OrderQuantity grouped by Date, Product ID, andRetail Outlet ID). Total Sales Amount is the totaldaily revenue for each product in each retailoutlet (calculated as the sum of Order Quantity xUnit Price). Total Discount Amount is the totaldaily discount by product and retail outlet (thesum of Discount Amount). An additional fact,Number of Orders, counts the number of sepa-rate orders by day, product, and retail outlet,which is a semi-additive fact. This can be addedacross Retail Outlets and Dates but not acrossProducts, as this would result in double counting.Including such a fact is questionable, as it couldlead to erroneous conclusions by end users.

    Such summarization loses information compared to the original (normalized) model: we cannot

    reconstruct details of individual order items fromthis star schema. This means that some queries arenot possible: for example, it is possible to deter-mine the total sales of a product on any given day,but not the characteristics of the customers whobought them.

    In some cases, fact tables may contain no numericalfacts at all. For example, in a star schema thatrecords the incidence of crimes, the fact table mayrecord when and where a crime took place, the typeof offense committed, and who committed it.However, there are no measures involvedtheimportant information is simply that the crime took

    place. This is called a factless fact table. Other examplesof factless fact tables include traffic accident, studentenrollment, and disease statistics. The only relevantaggregation operation for such types of events isCOUNT, which can be used to analyze the frequencyof events over time.

    Step 4: Detailed Dimension Table Design

    In this step, we complete the detailed design for dimen-sion tables in each star schema (attribute level design).This completes the detailed design of the star schema.

    4.1 Define Dimensional Key

    The key of each dimension table should be a simple(single attribute) numeric key. In most cases, this willjust be the key of the underlying component entity.However, the operational key may need to be general-

    20 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

    Figure 14. Daily Sales Summary Fact Table

    WHENDimension

    WHEREDimension

    WHATDimension

    RetailOutlet

    ProductDailySales

    Summary

    DateFacts: Total Sales Quantity Total Sales Amount ($) Total Discount Amount ($) Number of Orders

    Daily Sales Summary

    Date

    Product_ID

    Retail_Outlet_ID

    Total Sales Quantity

    Total Sales Amount

    Total Discount Amount

    Number of Orders

    Figure 13. Transaction Level Order Item Fact Table

    Order Item Fact

    Delivery Type

    Customer

    Date

    Product

    WHO Dimension

    HOW Dimension

    WHAT Dimension

    WHEN Dimension

    Employee

    WHO Dimension

    order posted

    Retail Outlet

    WHERE Dimension

    Order

    Degenerate Dimension

    Order Item Fact

    Order_Number

    Product_ID

    Customer_ID

    Retail_Outlet_ID

    Employee_ID

    Delivery_Method

    Order_Date

    Posted_Date

    Order_Quantity

    Extended_Price

    Discount_AmountFacts: Order Quantity Extended Price ($) Discount Amount ($)

  • ized to ensure that it remains unique over time.Because operational systems often only require keyuniqueness at a point in time (i.e., keys may be reusedover time), this may cause problems when performinghistorical analysis. Another situation where thedimensional key needs to be generalized is in the caseof slowly changing dimensions.

    4.2 Collapse Hierarchies

    Dimension tables are formed by collapsing ordenormalizing hierarchies (defined by classificationentities) into component entities. Figure 15 showshow the hierarchies associated with the Customercomponent are collapsed to form the Customerdimension table. The resulting dimension table con-sists of the union of all attributes in the original enti-tiesas many descriptive attributes should beincluded as possible to support slicing and dicing ofdata in different ways. In practice, it is common fordimension tables to contain more than a hundredattributes. This process introduces redundancy in theform of transitive dependencies, which are violationsto third normal form (3NF) (Codd, 1970). Thismeans that the resulting dimension table is in secondnormal form (2NF).

    4.3 Replace Codes and Abbreviations by Descriptive Text

    To make the star schema as understandable as possible,codes and abbreviations in source data should beremoved and replaced by descriptive text (Kimball,1996). While such codes and abbreviations have animportant place in efficient processing of transactionsat the operational level, they simply obfuscate thingsin the data warehousing environment. Use of suchcodes and abbreviations places a cognitive load onthe end user to remember what they mean, some-thing users can well do without. Codes and abbrevia-tions may be optionally retained in the star schema ifthey add to understanding.

    Figure 16 shows the detailed star schema design for the Order transaction (line item granularity).

    This shows attributes and keys for the fact table andall dimension tables:

    All of the tables in the star schema are denor-malized to at least some extent: each table cor-responds to multiple entities in the originalnormalized ER model. The dimension tablesare the result of collapsing classification entitiesinto component entities, while the fact table isthe result of merging the master and detailtransaction entities.All of the dimension tables are in 2NF: theyhave simple keys and no repeating attributes.The fact table is in 3NF as Discount Amount(an attribute of the Order entity) has been allo-cated down to the line item level. This meansthat Order becomes a degenerate dimension asit has no other dependent attributes.The Date Dimension is a new table which doesnot correspond to any entity in the original ERmodel. This is because dates must be explicitlymodeled in a star schema, whereas at the opera-tional level they are represented as data types.

    SummaryThis article has examined the nature of dimensionalmodeling and proposed a possible theoretical

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 21

    BRIDGING THE GAP

    Figure 15. Collapsing Hierarchies

    MarketSector

    IndustryGroup

    State

    IndustryHierarchy

    MarketHierarchy

    LocationHierarchy Country

    IndustryClass

    CustomerIndustrySector

    MarketSegment

    PostalAreaCity

    Customer Dimension

    Customer_ID

    Customer Name

    Annual Revenue

    Date of First Order

    Market Segment Code

    Market Segment Name

    Market Sector Code

    Market Sector Name

    Industry Class Code

    Industry Class Name

    Industry Sector Code

    Industry Sector Name

    Industry Group Code

    Industry Group Name

    Postal Code

    Postal Area Name

    City Code

    City Name

    Population

    State Code

    State Name

    Country Code

    Country Name

  • explanation for why it has been so successful in prac-tice. It has also described an approach for derivingdimensional models directly from ER models. Thisprovides a bridge between operational system(OLTP) design and data warehouse (OLAP) design,which makes it easier for people trained in traditionaldatabase design techniques to learn dimensional design.

    Deriving dimensional models from an ER modelprovides a more structured approach to developingdimensional models than starting from first principles, which can help avoid many of the pit-falls faced by inexperienced designers. It also helpsto resolve the difficult problem of matchingsupply (operational data sources) and demand(end-user information needs) in data warehousedesign. Finally, it results in a more completedimensional design, as it involves selecting from all possible ways of analyzing the data in the under-lying operational systems, and is therefore lessdependent on the designers ability to choose theright dimensions.

    Relationship between ER Modeling

    and Dimensional Modeling

    In this paper, we have challenged the widelyaccepted view that dimensional modeling is fundamentally different from, and incompatiblewith, ER modeling. We have shown that a dimen-sional model is just a restricted form of an ERmodel and there is a straightforward transformationbetween the two. An ER model can be transformedinto a set of dimensional models by a process ofselective subsetting, denormalization, and(optional) summarization:

    Subsetting: The ER model is partitioned intoa set of separate star schemas, each centeredon a single business event (transactionentity). This reduces complexity through aprocess of chunking.Denormalization: Hierarchies in the ER model

    are collapsed to form dimension tables. Thisfurther reduces complexity by reducing thenumber of separate tables. Master-detail trans-action entities are denormalized into a singlefact table, as these represent different views ofthe same underlying event.Summarization: The most flexible dimensionalstructure is one in which each fact represents asingle transaction or line item. However, sum-marization may be required for performancereasons, or to suit the needs of a particulargroup of users.

    At the detailed (attribute) design level, a range oftransformations are required:

    Generalization of operational keys to ensureuniqueness of keys over time and/or to supportslowly changing dimensions.Conversion of non-additive or semi-additivefacts to additive facts.Substitution of codes and abbreviations withdescriptive text.

    Advanced Design Issues

    The transformation approach described in this articleresults in a fully detailed star schema design. However,in practice, dimensional modeling tends to be an iter-ative process. The first-cut dimensional model willalmost certainly need refinementas in most designproblems, there are always exceptions and special casesto deal with. In Part II of this discussion, (publishingin the Fall issue of the Business Intelligence Journal), wewill discuss advanced design issues that need to beconsidered in dimensional modeling:

    Alternative dimensional structures: stars,snowflakes, and starflakesSlowly changing dimensionsMinidimensionsHeterogeneous star schemasDealing with nonhierarchical data

    22 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP

  • REFERENCESBaddeley, A. D. The Magical Number Seven:

    Still Magic After All These Years? PsychologicalReview, 101, 1994.

    Baddeley, A. D., Essentials of Human Memory,Psychology Press, Hove [England], 1999.

    Chen, P. P. The Entity Relationship Model:

    Towards An Integrated View Of Data, ACMTransactions On Database Systems, 1, 9-36, 1976.

    Codd, E. F. A Relational Database Model For LargeShared Data Banks, Communications of the ACM,13, 377-387, 1970.

    Craig, R. The Schema Wars, ENT, 3, 30-32,1998.

    BUSINESS INTELLIGENCE JOURNAL SUMMER 2003 23

    BRIDGING THE GAP

    Figure 16. Order Item Star Schema (fully attributed)

    Customer Dimension

    Customer_ID

    Customer Name

    Number of Employees

    Annual Revenue

    Date of First Order

    Market Segment Name

    Market Sector Name

    Industry Class Name

    Industry Sector Name

    Industry Group Name

    Flat/Apartment Number

    Street Number

    Street Name

    Street Type

    Postcode

    Postal Area Name

    City Name

    Population

    State Name

    Country Name

    Account Manager

    Source Entities:CustomerPostal AreaStateCityCountryMarket SegmentMarket SectorIndustry SectorIndustry SegmentIndustry Class

    Source Entities:NONE(corresponds to a data type at the operational level)

    Source Entities:Delivery MethodCarrierPriority

    Source Entities:EmployeePosition TypeEmployment StatusEducation LevelUnion Status

    Source Entities:Retail OutletSales DistrictSales RegionPostal AreaCityStateCountryStore Type

    Delivery MethodDimension

    Delivery_Method_ID

    Delivery Method

    Priority

    Carrier Name

    Date Dimension

    Date_Key

    Day Name

    Day Number

    Day Number in Year

    Week Number in Year

    Month Name

    Month Number

    Quarter

    Fiscal Period

    Year

    Holiday

    Weekend

    Season

    Event

    Country Name

    Employee Dimension

    Employee_ID

    Employee Name

    Age

    Sex

    Employment Status

    Position Type

    Commencement Date

    Departure Date

    Salary Level

    Union Status

    Last Performance Rating

    Education Level

    Source Entities:ProductPackage TypeTax StatusBrandProduct CategoryProduct SubcategoryUnit TypeImport StatusExport StatusInventory Status

    Product Dimension

    Product_ID

    Product Name

    Brand

    Product Subcategory

    Product Category

    Package Type

    Package Size

    Unit Type

    Product Manager

    Import Status

    Export Status

    Inventory Status

    Order Item Fact

    Order_Number

    Product_ID

    Customer_ID

    Retail_Outlet_ID

    Employee_Number

    Delivery_Method

    Order_Date

    Posted_Date

    Order_Quantity

    Total_Price

    Discount_Amount

    Retail Outlet Dimension

    Retail_Outlet_ID

    Retail Outlet Name

    Floor Space

    Store Type

    Number of Employees

    Sales Region

    Sales District

    Flat/Apartment Number

    Street Number

    Street Name

    Street Type

    Postcode

    Postal Area Name

    City Name

    Population

    State Name

    Country Name

    Parking Space

    Store Manager

    Source Entities:OrderOrder Item

  • Flood, R. L. and Carson, E. R. Dealing WithComplexity: An Introduction To The Theory AndApplication Of Systems Science, Plenum Press, 1993.

    Gallas, S. Put Data in its Place! DM DirectBusiness Intelligence Newsletter, 1999.

    Glassey, K. Seducing the End User,Communications of the ACM, 1998.

    Kent, W. A Simple Guide To The Five NormalForms Of Relational Database Theory,Communications Of The ACM, 1983.

    Kimball, R. Is ER Modeling Hazardous to DSS?DBMS Magazine, 1995.

    Kimball, R. The Data Warehouse Toolkit, JohnWiley and Sons, New York, 1996.

    Kimball, R. A Dimensional Manifesto, DBMSOnline, 1997.

    Kimball, R. The Data Warehouse Toolkit: TheComplete Guide to Dimensional Modeling, JohnWiley and Sons, New York, 2002.

    Klir, G. J. Architecture of Systems ProblemSolving, Plenum Press, New York, 1985.

    Kuhn, T. S. The Structure Of ScientificRevolutions, University Of Chicago Press, 1970.

    Lipowski, Z. J. Sensory And Information InputsOverload: Behavioural Effects, ComprehensivePsychiatry, 16, 1975.

    Maier, R. (1996) Benefits And Quality Of DataModeling-Results Of An Empirical Analysis, InProceedings Of The Fifteenth InternationalConference On The Entity Relationship Approach(Ed, Thalheim, B.), Elsevier, Cottbus, Germany.

    Miller, G. A. The Magical Number Seven, PlusOr Minus Two: Some Limits On Our Capacity ForProcessing Information, The PsychologicalReview, 1956.

    Moody, D. L. (2002) Complexity Effects On EndUser Understanding Of Data Models: AnExperimental Comparison Of Large Data ModelRepresentation Methods, In Proceedings of theTenth European Conference on InformationSystems (ECIS, 2002) Gdansk, Poland.

    Newell, A. A. and Simon, H. A. Human ProblemSolving, Prentice-Hall, 1972.

    Oates, J., The Great Data Warehouse Debate,Sybase White Paper, 2000.

    Raisinghani, M. S., Adapting Data ModelingTechniques for Data Warehouse Design, Journal ofComputer Information Systems, 40, 2000.

    Simon, H. A. Sciences Of The Artificial (3rdedition), MIT Press, 1996.

    Spencer, T. and Loukas, T. From Star to Snowflaketo ERD, Enterprise Systems Journal, 1999.

    Uhr, L., Vossier, C. and Weman, J. PatternRecognition over Distortions by Human Subjectsand a Computer Model of Human FormPerception, Journal of Experimental Psychology:Learning, 63, 1962.

    Weber, R. A. Are Attributes Entities? A Study OfDatabase Designers Memory Structures,Information Systems Research, 1996.

    Winter, R. and Strauch, B. Demand-DrivenInformation Requirements Analysis in DataWarehousing, Journal of Data Warehousing,V8N1, 2003.

    24 BUSINESS INTELLIGENCE JOURNAL SUMMER 2003

    BRIDGING THE GAP