Function the Dimensional Data Model - Clarity Insights · 7 Applying the Dimensional Form without...

23
White Paper An Architect‘s Evaluation of Form and Function– the Dimensional Data Model Donavon Gooldy, Senior Principal Tuesday, May 27, 2014

Transcript of Function the Dimensional Data Model - Clarity Insights · 7 Applying the Dimensional Form without...

White Paper

An Architect‘s Evaluation of Form and

Function– the Dimensional Data Model

Donavon Gooldy, Senior Principal

Tuesday, May 27, 2014

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 2

Table of contents

1 Introduction 3

2 Model Characteristics 4

3 Dimensional Model Architectural Origins 5

3.1 The Entity Relationship Model Form 5

3.2 An Organized Performance Architecture Response 6

4 The Dimension Model Form 8

5 The Dimensional Model Function 10

6 The Limits of Single Form Design 11

6.1 Function Limiting Characteristics the Dimensional Form 11

6.1.1 The Dimensional Form Does Not Extend Well 11

6.1.2 The Dimensional Form Is Not Flexible 12

6.1.3 The Form Does Not Describe the Business 13

7 Applying the Dimensional Form without Requirements 15

7.1 Client A 15

7.2 Client B 15

7.3 Common Characteristics 16

7.4 Bottom-Up Warehouse Design 19

8 System Architecture Form to Fulfill Multiple Functions 21

8.1 Combining Model Forms 21

8.2 Integrating Model Form with Technology Form 22

9 Conclusion 23

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 3

1 Introduction

"It is the pervading law of all things organic and inorganic, of all things physical and

metaphysical, of all things human and all things superhuman, of all true manifestations of the

head, of the heart, of the soul, that the life is recognizable in its expression, that form ever

follows function. This is the law."

Louis Sullivan

“Form follows function - that has been misunderstood. Form and function should be one,

joined in a spiritual union.”

Frank Lloyd Wright

To be an architect of information solutions is to understand the concept of form following

function intuitively, as a matter of nature, because design (creation of form) is about enabling

informational function. Taking the title ―architect‖ affirms one‘s conscious method design

based decision process in terms of aligning form with functional needs.

As one examines form‘s relationship to function within the dimensional model, the evaluation

of the model form must not be based solely on Sullivan‘s statement, but on Wright‘s; form not

only follows function, but function follows form.

The concept of form and function unity highlights that form is not only based on function, but

also limits it, many times strictly. Form and function are bound together in a cause and effect

relationship; function is the cause of the form, while form both facilitates function and limits it.

When considering the data warehouse function, one considers the overall goal to delivery

information, allowing the business to measure its activity and understand the impacts of its

actions in the market place. This high-level statement of function though, is far too general for

the evaluation of model form. As will be demonstrated, a more detailed understanding of

system functionality is needed before determining model form application.

The function-limiting impact of form is often overlooked in design, particularly data model

design. By implementing a specific design form, are the broader limits on function considered?

What system design steps are needed to mitigate those limitations?

Too often data practitioners apply the form they know best, the latest form they‘ve come to

appreciate or a form that is deemed a ―best practice‖ in their circles.

True architects are not practitioners of ―best practices‖. They practice the application of forms

to function based on principles derived from cause and effect analysis.

The architect studies the relationship of form and function, of cause and effect and then

applies forms specific to the required functions. The architect deals with the complexity of the

client‘s multi-functional needs and devises multi-component solution forms to deliver

functionality incapable of being delivered in single form solutions.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 4

2 Model Characteristics

One generally thinks about a model form in terms of certain characteristics. Through the

evaluation these characteristics and examination of model form, it becomes evident how they

align with, support and limit function in relationship to data and information delivery.

o The model‘s ability to extend

to extend a data model for new content/capability without disruption and

redesign of processes

o The model‘s ability to be flexible

to support multiple purposes or functions

o The model‘s ability to describe the business and subjects within the corporate

structure

to document the business using data

o The model‘s ability to support any valid business question

to answer business questions without specific design structuring

not a matter of ease or performance but a matter of ability

o The model‘s ability to efficiently and quickly answer business questions (report query

performance)

to provide acceptable query performance for corporate decision support

and analysis

o The model‘s ability to demonstrate business performance

to measure business performance

The critical examination of limiting aspects to the dimensional model gives the architect the

foundational principles necessary to understand the application of dimensional form in

Information Architecture solutions.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 5

3 Dimensional Model Architectural Origins

The dimensional model form is designed to greatly simplify database optimization for queries

that would otherwise be applied against an Entity Relationship (ER) model. Because the

dimensional model is a design response used to overcome ER form limits, there must first be

examination of the ER form and its characteristics as a comparison basis.

3.1 The Entity Relationship Model Form

1. To free the collection of relations from undesirable insertion, update and deletion

dependencies;

2. To reduce the need for restructuring the collection of relations, as new types of data are

introduced, and thus increase the life span of application programs;

3. To make the relational model more informative to users;

4. To make the collection of relations neutral to the query statistics, where these statistics are

liable to change as time goes by.

— E.F. Codd, "Further Normalization of the Data Base Relational Model"

Each of Codd‘s goals not only provides insight to ER model function, but are also instructive as

to the reasons for the dimensional model form.

The Data Architect produces an ER model that describes the business through ―Entities‖

representing each of the objects, actors, organizational fictions, contracts, business activities

and others in the business landscape. If it can be named as a subject, it must be represented

as an entity within the model. Each entity is given an identifier known as the primary key.

Additional attributes are added to describe only the primary key.

Foreign key relationships document each business relationship existing between entities. These

relationships are instilled in the model logically rather than by direct data association. This

distinction is fundamental to the examination of the ER and Dimensional Model form

characteristics and its ability to deliver specific functionality.

This examination won‘t delve into the application of normalization rules, except to state that

many modelers deal with normalization intuitively as a matter of entity definition and

evaluation of attribute when creating the ER model. Normalization rules represent a method of

thinking regarding the evaluation of data content in model development. Normalization

ensures all entities are defined purely and that all business relationships within the model are

defined logically rather than by physical association.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 6

As one examines Codd‘s goals it is obvious that they align with some of the model

characteristics previously discussed. Those characteristics are:

extensibility

flexibility

ability to describe the subject

ability to support any valid business question

Cobb‘s fourth goal may appear somewhat cryptic, but is central to an architect‘s

understanding of both model forms and support of Codd‘s preceding goals.

In a fully normalized model there is no statistical data relationship bias that emphasizes one

relationship or eliminates another, because relationships are implemented logically. Data that

is not normalized, associates data physically on the same row, creating a bias. When data is

organized this way, certain questions can be answered, while others cannot.

Applying rules of normalization ensures no bias exists for one type of business question or

another.

One can ask any valid business question of a normalized model. Based on the model‘s

logically implemented relationships, (foreign key) one will always get the answer. There is no

need to know future questions. It will always work if each entity is represented within the model

that is germane to the question and each relationship between the entities documented

logically. As long as one is willing to write the necessary queries and wait, the model will

answer.

Therefore, the normalized entity relationship model form is designed for flexibility, to answer

any business question. It eliminates relationship bias by describing each entity purely and

documenting all business relationship logically, providing data relationship neutrality.

Extensibility is another outcome of eliminating relational bias, as will be seen later.

The normalized form that gives us this functionality also limits function. To answer more than

simple business questions, complex queries need to be written with many joins that follow

relational paths, and identifying specific content within data sets using correlated sub-queries.

The query may need to do mixed aggregation to common group by levels as well as use outer

joins complicating query optimization. Temp tables and multiple query steps may need to be

used in some cases. In data warehousing, all of this complex query optimization results in issues

of access and join serialization in relationship to lots of I/O from large data reading, buffering

and sorting.

No one wants to wait hours for BI report results. In the early days of data warehousing, on at

least one RDMBS, the longer the query ran, the more likely it would end in error due to the

database‘s concurrency architecture.

3.2 An Organized Performance Architecture Response

At the time of Ralph Kimball‘s first edition release of The Data Warehouse Toolkit, most data

warehouse servers were hosted on SMP database servers. These types of servers do not scale

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 7

parallel processing linearly as MPP clusters do, and often led to a variety of very limiting data

forms that were intended to improve query performance.

The introduction of the dimensional model provided an organized, systematic design basis for

a performance architecture form leading to predictable query optimization.

It also addressed another issue at the time; it‘s much simpler to write queries against. Hand

coding queries against an ER model for any sort of complicated reporting requires a good

deal of skill, experience and time. While users still need to write manual queries, Business

Intelligence software has diminished that by supporting metadata driven abstraction that

interprets the physical data model for the user.

When dimensional models are designed properly for reporting they require only selection of

attributes and measure required, direct join to dimensions needed, application of WHERE or

JOIN filters, appropriate aggregate functions and GROUP BY clauses (and perhaps a HAVING

clause.)

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 8

4 The Dimension Model Form

Dimensional modeling achieves its performance advantage by designing denormalizations

into data organizations specific to answering a limited range of business questions. These

denormalizations take the form of placing data in physical relationships and eliminating the

logical business-based relationships that follow an entity-to-entity-to-entity form, in favor of

more direct report grouping reference relationship to business metrics.

In other words, the dimensional model form creates explicit relationship biases to simplify

queries, reduce I/O and eliminate query optimization complexity, which delivers answers to

business questions efficiently and quickly.

The pattern of denormalization follows the form of a central table called a fact table

containing one or more business measurements called facts. The facts may be sourced from a

variety of transactional and reference sources, all of which may be used in combination to

answer certain classes of business questions.

The fact table row always has the context of a time period, either date or time together. The

time period may be either date or higher level time period, such as week, month, quarter or

year. Facts maybe transactional, a point-in-time snapshot state of metrics or period-based

aggregate.

The fact table also has foreign key relationship attributes relating the fact rows to reference

tables called dimensions. Dimensions may represent a single entity identity of data, but

typically contain attributes from, or derived from, multiple entities describing a subject.

Typically there is at least one dimension associated with the fact table that has at its basis in on

an entity with a natural business-based relationship to the business activity represented in facts

of the fact table. There are usually other dimension relationships that are one or two entities

removed from the business activity documented in the fact table. There may also be

additional dimensions related to the facts that must be derived by processing other business

activity.

Keep in mind that if a source does not actually document all of the data relationships, for

example the customer‘s origination sales channels, then these relationships must be derived

from processing business activity records, such as sales or service orders.

One must also build into the process and structure of the star schema all of the complex

processing that would be needed in against the entity relationship model to bring data up to

common simplified form, fit to answering functionally similar business questions.

The philosophy of the dimensional model is to do all of processing once to form a common

basis for a class of business questions or analysis, storing the results of that process in the star

schema so that BI queries avoid that complex process at report runtime. It is a ‗process once,

use it many times‘ approach.

The end result should be a star schema capable of delivering measurements based on simple

SELECT, JOIN, WHERE and GROUP BY statements.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 9

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 10

5 The Dimensional Model Function

One concludes that the dimensional form is a performance architecture intended to improve

report query performance. However so far, a full understanding of why dimensional models

perform so well and what limits them has yet to be exposed.

The star schema design is created to measure business. It is created with a business function

orientation, as opposed to the subject area orientation of the ER model.

The form is one of centralization of a series of measures (facts) surrounded by attributes gives

business context to those measurements.

While some consumers may refer to the content as subjects, the real orientation is focused on

business reporting and analysis. It may be Sales Analysis or Risk Analysis, but these are

organized to support specific business functions and not provide general data as a subject.

Instead of presenting data as it exists in an ER model, or in the source, data is organized to

make decisions.

Some of Webster‘s definitions of the word ―Information‖ are:

1. ―knowledge obtained from investigation, study, or instruction‖

2. ―INTELLIGENCE, NEWS‖

3. ―FACTS, DATA‖

Architects do not design dimensional models that deliver measurements (facts) randomly as

data. The purpose is to deliver organized information to the business clients that supports the

client‘s business decision making function.

To be ―information,‖ measures have to be organized and presented with functional context;

without that, it is simply data. Providing data is what an ER model does. It delivers it without

bias. It‘s up the consumer to discern how to make it provide information. In a dimensional

model, much of that work of organizing data as information is performed in advance of the

report execution.

Therefore, a primary function for which the dimensional form is employed is that of a

performance architecture built upon the direct structuring of information for specific business

function.

It is important to make this distinction because there are other means of implementing

performance architectures for delivering information that do not rely on data denormalizations

in a database.

And, this is not to say that dimensional model content is the final state of the information

organization. In systems that employ the dimensional form, it represents the foundational state

of information that is further organized into reporting to deliver KPIs, comparisons, trends,

graphics and other business oriented presentations of information

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 11

6 The Limits of Single Form Design

All that has been examined to this point represents the foundation for the remaining

examination.

Architects realized that there are limits to form. An automobile maker creates a variety of

forms for different functional needs. Each of those forms has recognizable limits. A Freightliner

semi-truck with a raised roof sleeper, Hendrickson AIRTEK axels, and front suspensions is

designed for long distance freight hauling in comfort, but it is not functional for the morning

commute. One might drive it downtown, but the fuel consumption empties the wallet and

guarantied, it won‘t fit in the parking garage.

Clearly design form has limits. The architect‘s role is to understand those design form limits and

produce system designs using integrated design forms to fulfill functional requirements.

And by form, not only model forms are available for examination, but also a wide variety of

technology based design forms as well.

6.1 Function Limiting Characteristics the Dimensional Form

The dimensional model is a powerful performance architecture form for the delivery of

information to businesses when properly applied. Like the ER form, the dimensional form has

limitations in its recognized function.

6.1.1 The Dimensional Form Does Not Extend Well

Ability to extend is a relative evaluation comparing one form to another. The evaluation is

really about how much disruption to process, existing data and retesting is involved in existing

implementations.

Purveyors of the dimensional model sometimes state that extending the dimensional form is as

easy as adding new attributes to dimensions, or new dimensions and dimensional keys to an

existing fact table from a specific point in time forward, and backfilling attributes and foreign

keys with the standard defaults for NULL or Not Applicable definition.

The reality of dimensional model extension is rather different.

1. Changes in Processing

Even when this approach can be taken, the addition of new content means there is a change

in existing processing. Aside from additional sourcing, the processing typically involves

integration with content sourced from multiple entity sources. If the target is an existing fact

with new dimensionality, the amount of disruption will depend on whether the new dimension

needs to be set in the primary key or not.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 12

2. Effects of Historic Context Changes

This also assumes that the historic data context does not need to be updated to reflect the

new content. If it does, then Type-2 Dimension keys will need to be restated to take into

account a new source of temporal attribute change. This in turn means that dimension key

references on the facts need to be restated, which results complete rebuilds of fact tables as

a consequence.

3. Cost of disruption often avoided.

These are often the primary reasons why star schemas in some implementations don‘t get

updated, or are not updated for long periods of time, even when business needs change. The

development and testing time needed to implement these change is painful for some clients.

4. Comparison to the Entity Relationship Form

The ER model form is one that was designed explicitly for extensibility. First, by modeling based

on pure entities and identifying keys and attributes that describe those entities, there is a very

good chance of ensuring that entity definitions are complete and less likely to need new

attributes in the future.

ER modeling is based on subject area organization. The subject area is typically associated

with a single prime entity that serves as a parent and ancestor to all other content in the

subject area. Modeling should always be based on parent dependencies. Because of this,

entities left for later are child entities. As new entities are added to the model using foreign

keys from existing parent entities rather than adding a parent, the need to add new foreign

keys to existing entities is eliminated.

Because there is no disruption to parent entities in the addition of a new one, only new load

processes are added instead of changing existing processes of the surrounding entities.

If new historic attributes are added to existing entities of an ER modeled EDW there is no

disruption to the key structures cascading to other referencing entities. In an ER-modeled EDW

the entities primary key substitutes (surrogate keys) only for the natural key and not the

temporal key of the attribute historic context. The temporal context of the data is not

transferred through the foreign key reference, and therefore will never be a disruption in any

related entity regarding changes to foreign key values.

6.1.2 The Dimensional Form Is Not Flexible

The star schema is built to answer the business questions for which it was designed. It is a

performance architecture based on the creation of the relational biases and inclusion of

processing that supports its performance relative to the range of business analysis and reports

intended.

An individual star schema has a limited range of business questions that can be asked of it. This

is not only due to the fact that the measures or facts represented on the fact table related to

limited business activity, but also due to limits imposed by reporting role context of the

dimensional relationships.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 13

For example, in the cellular industry, reporting revenue and network usage together is often

needed, since network usage partially drives revenue. Revenue is realized by monthly bill

cycle, which overlaps two months.

Since reporting is for many millions of Customer‘s or Subscription, detail billing and usage data

must be aggregated to the bill cycle level.

In a typical reporting Bill Cycle Context is often used. If however reporting needs a monthly

context a separate fact table is required.

Additionally, one can only measure facts by the contextual role base relationship that have

been foreseen and include by way of dimensional relationship to the fact. If one wants to

measure the success of a sales channel by revenue generation, then a sales channel

dimension has to be associated to the fact based on channel the subscription was sold

through. The same would be true of measuring program success and measuring promotion

effectiveness.

To ask additional questions or even broaden the question, additional star schemas capable of

answering those questions need to be design or modifications to existing designs need to be

made. It must be recognize that new questions in the future, if not just variations of old ones,

will likely require further star schema development.

6.1.3 The Form Does Not Describe the Business

The fact that the dimensional form does not describe subject content is at the heart of the

forms inflexibility. Yes, the form has attribution describing certain dimensions, but one cannot

look at a dimensional model and derive an understanding of how the business works in

relationship to entities that make up the business. And one cannot discern the business

relationships that exist between business entities.

This is the major limiting factor to its broader use when Enterprise Data Warehousing need to

function as a central data repository solution, able to deliver any form of business data

required for any use.

Once data is dimensionally cast for business function, as opposed to modeled for business

description, the denormalizations of data and relationships eliminates the ability to understand

the basis of the denormalizations and how they were applied in the first place. Only a

reporting relationship can be determined rather than a business data relationship.

There may be a model in the architect‘s hear as to how the entities represented in the

dimension are actually related, but there is nothing in the dimensional model that describes

the business.

The guidance of an ER model, either documented or undocumented (mental understanding)

is required to know how the entities and relationships describe the business model in order to

build other star schemas.

The star schema is incapable of supporting functions broader than its purpose. While it is

possible to join facts from two or more star schemas, the business context of the question

asked of the combined fact tables are limited to the fact table‘s common dimensional

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 14

context. Because of this, it is possible that the context of the questions asked of the combined

fact tables is more limited than that of a single fact table.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 15

7 Applying the Dimensional Form without Requirements

It is evident from the examination of the dimensional form that its design needs to be guided

by detailed use requirements. Without those requirements, one cannot properly identify

needed business measurements and align those measurements with business function. This

alignment of measures to business function is crucial to not only attaining the performance

goals of the dimensional model, but also producing a usable model.

Yet practitioners still try to deliver data warehouses based solely on ―data requirements‖

because it is believe that an EDW should be dimensional. These practitioner don‘t realize that

there are no such rules except those of form and function that dictate a specific model

organization for an EDW.

In his career, the author has reviewed a number of EDW implementations, some of which

completely failed and others breathing on political life support. The following two examples

outline the consequences of delivering dimensional form without the instruction of use

requirements.

7.1 Client A

Client A had a data warehouse built by a consulting firm. The ―architect‖ who drafted the

solution document explained that the dimensional form was chosen because ―everyone

knows that an EDW has to be dimensional.‖ Reporting and use requirements were never

documented, but a dimensional model was implemented nonetheless. Most of the common

characteristics, documented in the next section, were present in the implementation.

Client A experienced a user revolt. IT delivered an EDW at significant expense that made no

sense to the business users. The EDW could not answer business questions the users posed

because it did not conform any real businesses use requirements.

Instead, the users insisted on using the EDW source staging area because it was somewhat

normalized and left the data in a state still capable of delivering on their use requirements.

7.2 Client B

Client B also hired a consulting firm to build its EDW system. Client B recognized the system had

great difficulty delivering reports, but, due to the client‘s large investment, the system was not

deemed a failure.

A more detailed evaluation reveals several errors related to the data model. The systems

―architect‖ did not first design an overall solution framework based on the client‘s needs, but

instead took a piecemeal approach of consideration and built on a component by

component basis.

The ―architect‖ first delivered an ODS that was Entity Relationship modeled. In this case the

ODS had no other purpose than to serve as a data integration area. It fed no other system

and provided no other functionality.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 16

In an attempt to create value for the ODS, the architect decided that the EDW had to be

dimensionally modeled, because the optics of an identically modeled ODS and EDW would

point to the fact that the ODS was nothing more than a very expensive staging area.

Because the implementation was based on a bottom-up approach and IT was well insulated

from the business, the client could not gather use requirements.

A dimensional model with all of the Common Characteristics of a ―dimensional‖ form without

a purpose (detailed in the next section) was developed and implemented for the EDW.

An application that presented EDW sourced data to the clients customers was then

developed. The application which needed the historic perspective of the EDW, however, the

application‘s data requirements dictated a normalized data. Therefore additional processing

had to be developed to properly re-normalize the data from the ―dimensionally‖

denormalized data of the EDW.

The Application‘s model was remarkably similar to that of the ODS.

The EDW system was deployed on a very robust MPP server cluster. It had the horse power to

deliver from an ER model, answer the business were starting to ask. However, because the

data was organized in a form not guided by requirements, but by making the data ―look‖

dimensional, the architect and modeler had baked in relational biases that were difficult or

impossible to resolve by query alone.

After reviewing the model and the use requirements the client had started to see, the author

told Client B that the data marts could NOT be virtualized but needed to be physically

implemented. The reasoning was delivered with a certain amount of political sensitivity

because the IT client could not afford to have the project viewed as a failure.

In the author‘s opinion, the system was an architectural failure. Client B had implemented the

EDW on a database server capable of virtualizing much of the information delivery through

SQL processing for business reporting from an ER model. The architect and modeler had

created such disorder that Client B now had to spend a lot more money to write processing

that would undo the denormalization as part of creating the real reporting model that was

needed.

7.3 Common Characteristics

Both of these clients‘ implementations had the same application of an artificial dimensional

form applied to the warehouse, guided not by reporting or analytics requirements, but by a

modeler‘s imagination of how data could ―made dimensional‖ or how the client ―might‖ use

the data. Since it has been established that the dimensional form needs to be aligned for

business function, the use of the term ―artificial dimensional form‖ is warranted anytime

dimensional model design is not guided by use requirements, which defines the function.

In both of these examples the model practitioners did not understand that the dimensional

form only gains its performance architecture status when it is functionally aligned to deliver

actual reporting and analysis capabilities (usable information).

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 17

The goal of most modelers who attempt to model dimensionally without requirements is that

they believe that by shaping the data in a dimensional form they are somehow making it

easier to use and that the users will be able to make their reporting work with the new model.

Typical patterns that are repeated by modelers in this approach are evident in the following

features:

1. All facts containing measurement data are single transaction based facts

Transactions are automatically converted to facts because these are a source basis for

many business measures. Typically though, a good deal of business reporting does not look

at a single type of business activity in isolation.

For instance, in the cellular phone industry, reports combine subscription base, additions

and defections as well as contract renewals to provide context.

Combine multiple fact tables requires preplanning to ensure that proper common

dimensional context supports the reporting requirement. Additionally joining multiple fact

table, unless planned for may not support performance SLAs.

Because no use requirements exist, there is nothing to guide the modeler to produce facts

that would facilitate a specific business function or SLA,

Typically fact tables created in this manner have a dimensional context based on

relationships that are immediately associated with the transaction in source. They may also

associate the fact to additional dimensions based on entities that have a parent

relationship to those entities forming the basis of dimensions immediately associated with

the fact or transaction.

Finally, the modeler might create any derived relationships to the fact if they learn the

client finds that reference useful.

The point is, that when real reporting requirements arise that demand additional

dimensional relationships and relationships based on additional role types, the fact table

needs to be rebuilt anyway.

2. A one-to-one relationship exists between a very large fact and a very large dimension

The condition arises when the modeler directly casts a transaction into a fact and the fact

has so many non-measurement attributes with which to contend. They believe they cannot

leave them as degenerate dimensions on the fact and have the attributes have to go

somewhere.

Based on requirements one would normally identify which of these are actually needed for

business consumption. Further, it would likely be found that some actually describe another

dimension or that they can be logical grouped into multiple junk dimensions.

The use of a one-to-one dimension is counterproductive to the function of the dimensional

model. I/O is the major factor in database performance. Additional joins in the star schema

when joining to small dimensions have minimal impact on performance when compared

to significant increases in I/O for multi-table joins based on large fact cardinality.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 18

Measures used together in reporting are organized on the same fact table to eliminate

fact-to-fact joins, which represents additional I/O. A join of a one-to-one fact-to-dimension

relationship is no different than a fact-to-fact join. And that I/O will occur every time a user

needs as little as a single attribute from that dimension.

Joins do not by themselves significantly decrease performance in a star schema, I/O does.

The smaller the dimension the more likely the dimension is to be cashed in memory.

Unless the implementation is on a Columnar Database, this practice is counter to the

performance architecture function of the dimensional form.

3. Presence of ―Factless‖ Facts

What does one do with reference data not directly related to a fact table if one thinks the

reference data belongs in the data warehouse, yet has no supporting reporting

requirements? One creates a Factless Fact!

Turn all those entities into dimensions, eliminate the business rule based relationships

between them, create a ―fact‖ table, and associate all the entity based dimension with

one another by way of the fact, thus obscuring an understanding of the natural

relationship of the entities to one another. By creating a data bias not informed by

requirements, the modeler has no idea whether the bias will actually be useful to the

function for the star.

The legitimate purpose of a factless fact is to produce row counts based on a requirement

for reporting. Factless Facts are typically rare in the model based on real use requirements.

They are common in dimensional models driven by the desire to cast data in the

dimensional ―form‖ where no requirements exist.

4. Many-to-Many relationships are left unresolved

Data architects work with business users to eliminate many-to-many relationships in a

dimensional model because it is understood that if improperly used, the many-to-many

causes a duplication of measurement rows in the output.

There are several techniques to eliminate the many-to-many relationship. The first is to

create specific role based relationships of single members of the many-to-many dimension.

The other is to ―flatten out‖ the many to a single row that represents the mix of the

combinations that the many represent. This has to be done based on requirements and

working with the business to produce a representation that provides them the information

they need in the report.

A typed many-to-many that can identify a single typed row is the last resort for inclusion of

a many-to-many in a dimensional model.

To leave many-to-many dimensional relationships unresolved is to leave a reporting

accident waiting to happen and the dimension useless for reporting purposes.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 19

The consequence of misapplying the dimensional form is the creation of a good deal of

unnecessary dysfunction for the client later when they actually need to use the warehouse for

business. These implementations occur whenever the modeler has to provision data in the

warehouse based on a bottom up approach (data requirement first, reporting requirements

sometime later) and believe they must deliver a dimensional model.

7.4 Bottom-Up Warehouse Design

It is always desirable to drive any data warehouse implementation on actual use

requirements, even when this means extensive business consulting to help the organization

understand how best to measure their business.

However, there are a number of clients who cannot commit to the process of requirements

analysis that informs data warehouse design.

It should be recognized that for many companies implementing a Data Warehouse or

enabling Business Intelligence for the first time, are in fact, just stating a process of transitioning

from an operational focus to one that is more strategic. They are determining how they should

measure or gain business insights, and how to focuses attention on customers.

The clients can‘t always determine their requirements at the early stages of the process but

they do benefit from business data availability.

While the process of this focus shift progresses, reporting requirements change rapidly and the

data warehouse must be able to respond as the business identifies useful information.

The need to fulfill a data requirement still needs to be guided by analysis that determines

necessary subject areas and sources necessary to enabling specific business analyses

capabilities. Doing so not only ensures that data delivered is useful, but also allows delivery of

priority capabilities first.

The data warehouse in these cases needs to be easily extendable to assimilate new subject

content when missing data is identified. As implied, it has to be flexible to answer questions not

yet thought of.

Flexibility, to answer any question, means that every subject area in the EDW can be cast and

recast in structure and content organization to meet any function or functions the business

has.

The functionality described here is that of central business data repository. It is fulfilled by a

normalized Entity Relationship Model (most likely with modest denormalizations such as

allowing some repeating groups).

Data from the central repository, called the data warehouse or ―Enterprise Data Warehouse‖,

will feed the organized information components of solution architecture. Some call these

components data marts, some call it information delivery. It is typically in the dimensional form

to efficiently deliver functionally aligned measurements to the business for reporting and

analysis.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 20

In a top-down approach where reporting and use requirement guide the design of the EDW,

Information Delivery structure are directly defined. However, in some cases there may still be

functional reasons that a central repository is still needed.

Additionally, there are a number of ways to deliver dimensional capabilities with BI tools,

including BI cube functionality. It is a waste of company resources to deliver a star schema

with the sole purpose of creating a cube if the cube can be created from query of an ER

source model.

When a client can articulate reporting requirements, and there are no other functions that

require an ER business based model as the foundation of the warehouse, there is no reason for

the client to pay for more functionality than is need.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 21

8 System Architecture Form to Fulfill Multiple Functions

It is common for clients to have system requirements calling for the support of multiple

functions as implied in prior sections. The distinction of functional difference between a central

business data repositories and organized information structures exists largely because form

does determine function, which was Wright‘s point.

These two functions are in conflict with one another due to the fact that model form required

for each does not support the other‘s function, on the most common technology used for

data warehouse today, SMP servers.

The application of multiple forms to support the multiple functions in system solution is the

means by which to deal with limits to a single form.

8.1 Combining Model Forms

Bill Inmon‘s Corporate Information Factory is a concept that was developed to deal with such

requirements. The central idea is to build a central historic data warehouse repository that is

entity relationship modeled for flexibility, extensibility and business descriptive. This is the

foundation form from which various information delivery structures can be created in response

to use requirements.

The advantage that this system architecture form has is that it can accomplish what two

individual model forms cannot by themselves.

With a well-documented ER business modeled foundation of the data warehouse in place,

any information delivery form can be cast and recast as requirements change, always more

easily than from an operational system source. This is because the data warehouse model is

based on business organization and presentation of the data, rather than how the operational

system stores and treats data. The differences between these perspectives can be significant.

Additionally the data warehouse can also contain a number of architectural features and

structures put in place to ease the delivery of common dimensional patterns used across

many fact tables.

Many times dimensional components such as dimensions and even simpler facts can be the

result of database or materialized views, eliminating the expense of ETL processing.

The limits of this form is cost and time. This form is typically viewed as the most expensive way to

deliver a data warehouse and also viewed as the slowest. While the observation regarding

cost is valid in the short run, it can be more efficiency and value for many clients facing

requirement changes as part of their typical business cycle.

As for time, the way the delivery is organized can greatly influence implementation. ER model

form is built to absorb new content easily. This means they are not built monolithically, but by

subject area. Functional information delivery can drive the subject content order of the EDW.

Don‘t wait for the entire EDW to be delivered to deliver reporting, but rather prioritize subject

content delivery to the EDW based on reporting requirements.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 22

The key here is to define the architecture that fits the customer needs. The client should not

pay for an architecture they don‘t need, nor should they rely on an architecture that doesn‘t

fulfill their information management system requirements.

8.2 Integrating Model Form with Technology Form

Combining the ER form with MPP database server architecture creates a high degree of data

warehouse flexibility.

The MPP form brings the ability to apply linearly scalable parallel processing to the data

warehouse. The technology form of the MPP architecture is its shared nothing, multi node

distributed computing environment. Once data is properly distributed, parallel query

optimization is far more straightforward than that of the SMP servers.

Teradata‘s consulting organization has a bottom up approach that emphasizes delivering

industry-based ER models. They use the power of the database engine to side step much of

the need for delivering an information organized performance architecture that the

dimensional model form represents.

Instead, the implementations virtualize as much of the physical information organization as

possible with database views and create lightly processed materializations of information

organization structures where additional performance is needed to meet SLAs. This provides

the flexibility of the Corporate Information Factory with less development cost than of the prior

section requiring a full suite of physical dimensional schema‘s supported by ETL processes.

ETL process development represents the most significant labor cost of warehouse system

development.

Clients pay a premium for MPP server technology. The technology exists to provide the ability

to perform significant parallel data processing. It makes some sense to use the technology to

allow for greater flexibility in the data warehouse. The MPP platform itself is a performance

architecture based on technology.

The architect needs to carefully consider the decision of forgoing extensibility and flexibility

when deploying a model based performance architecture designed for the SMP servers on

technology that is a performance architecture itself.

This is not to say that such an implementation is wrong, but implementing a dimensional

models on such a platform as default practice is not the practice of an architect.

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc. 23

9 Conclusion

The tools the architect works with is the knowledge of form‘s function enabling and limiting

characteristics. While this discussion centers on of the relationship that model form and

function have with one another, the same principle of form‘s support for function and form‘s

limiting effect on function has broad applications in all architectural applications and

disciplines, whether it is the evaluation of model form, technology form or even methodology.

To be an architect is to be a student of form and function and apply form based on these

principles of form‘s effects.

The architect‘s role is to recognize the clients‘ needs and apply form based on those needs.

There is always a balancing of the application of form, usually constrained by the client‘s

tolerances for cost. The architect has to ensure the client fully understands the impact of

compromises in form‘s application.

The architect that can engage the client in a fact-based discussion in terms of cause and

effect related to their needs has a much better opportunity to deliver what is appropriate and

ensure client satisfaction.

Architectural leadership is grounded in the knowledge that Form and Function are in fact

united in a co-dependent relationship of cause and effect. For the architect, in the

application of form, the question of ―why‖, is of far more importance than the statement of

―what‖.