Data Vault Consortium A Mathematical Perspective of Data Vault.

28
Data Vault Consortium Presentation by @dougneedham

description

Doug Needham Data Vault presentation discussing Mathematical interpretation of links and the application of that to business modeling. Volumetrics and some general principles for Data Vault best practices

Transcript of Data Vault Consortium A Mathematical Perspective of Data Vault.

Page 1: Data Vault Consortium A Mathematical Perspective of Data Vault.

Data Vault ConsortiumPresentation by @dougneedham

Page 2: Data Vault Consortium A Mathematical Perspective of Data Vault.

Who are we?

CLEAR MEASURES offers a range of services and solutions designed to satisfy needs shared by firms large and small; and the skills required to make your customized goals a reality. If your goals aren’t yet defined, CLEAR MEASURES can help you define a strategy for managing, analyzing, or visualizing your data in ways that make your path easier to identify.

• Analytics and Intelligence

• Data Integration

• Enterprise Architecture

• Strategic & Project Management

• Cloud Infrastructure

• Database Administration

• System Administration

• Technology Services

Page 3: Data Vault Consortium A Mathematical Perspective of Data Vault.

Who are we?All our customers have access to:

CapacityPay on demand, with 15 minute increments, not the half-day or full-day you pay for a contractor.

CoverageTrue 24 X 7 Coverage, with in-facility staff directed from our Global Operations Center in Covington, Kentucky.

CostCLEAR MEASURES can help your team with effective costs from Rural Sourcing and Global Sourcing locations. CLEAR MEASURES proprietary ONguard system allows for complete direction of a global workforce with U.S. oversight, focused on efficiency and repeatability.

Page 4: Data Vault Consortium A Mathematical Perspective of Data Vault.

Who am I?• The Data Guy• 1st job was Marine Corps DBA supporting the Entire

Marine Corps at the main site for Systems Software Evaluation.

• First 10 years of my career DBA.• 20 years of data management. • Most recent decade building analytical systems.• Pentaho, Informatica, Business Objects, Cognos,

Oracle, SQL Server, MySQL.• Cloud based Analytics with a large healthcare

information company on Cassandra.• Trying to figure out where Data Science, and Big Data

fit together with the Data Warehouse.

Page 5: Data Vault Consortium A Mathematical Perspective of Data Vault.

This is the wrong time for Data Science

• It is also the wrong time for a Data Warehouse, Business Intelligence Platform, Data Vault, Data Mining, Big Data, or any other predictive, machine learning, analytics platform.

• Do these projects when things are going well. Anticipate what could happen to prevent things from going poorly.

Page 6: Data Vault Consortium A Mathematical Perspective of Data Vault.

When is the right time?

• If you have multiple systems you need to integrate. • As you lay the foundation for Self Service Business

Intelligence. • To lay the foundation of Data as a service

application. • If you are combining data from many applications,

systems, or business units, or you are providing data to many applications, systems, or business units that want data provided to them in slightly different standard feeds.

Page 7: Data Vault Consortium A Mathematical Perspective of Data Vault.

Data Science and The Data Warehouse• “Data Science is the application of statistical and

mathematical rigor to business data.” Doug• I have heard it said 80% of data science is data

munging. • Data Vault is: “100% of the data 100% of the time” –

Dan L.• What does this mean? • What does the data say? Where did the data come

from? What happened to the data from the time it was captured until the time it was presented?

• Models, Statistical Models specifically, are the core of Data Science.

• Looking forward to hearing more about DV 2.0 and how it supports Polyglot persistence.

Page 8: Data Vault Consortium A Mathematical Perspective of Data Vault.

Data Science and The Data Warehouse• By the way, we have been doing this for a while.• Some data is predictive, All data is instructive.• Being able to create a statistical model, quickly

run lots of data through that statistical model, observe the actual results and compare these with predicted results allows us to refine the statistical model.

• Are Business analysts Data Scientists? What is the main differential between the two?

• Which one “needs” more data? Which one can actually use more data?

Page 9: Data Vault Consortium A Mathematical Perspective of Data Vault.

Quick Trivia• Who was one of the first Data Scientist?

• • Now let’s talk about storing all of this data we

collect, and see if there is anything new with our understanding of the structures we are all familiar with.

Page 10: Data Vault Consortium A Mathematical Perspective of Data Vault.

Data Vault• The integration layer of an overall data warehouse

strategy.• There are other areas of data warehousing.• Presentation• Near-Line• Archive

• Applications within the enterprise are the data capture mechanisms.

• I think everyone is trying to find the best way to leverage a “Big Data” platform into the world of the Data Warehouse.

• Data vault is the mechanism that allows a data warehouse to evolve over time.

• Simple, straightforward, repeatable, auditable, resilient.

Page 11: Data Vault Consortium A Mathematical Perspective of Data Vault.

Modeling• HUBs – Business Keys

• LNKs -Relationships

• SATs – Contextual data.• There are other entities of the Data Vault

methods, however, these are the primary entities. Everything else is functionally dependent on some combination of the above.

• Notice the colors, Hubs one color, Links another, Sats a third. Anything else should be a separate color.

Page 12: Data Vault Consortium A Mathematical Perspective of Data Vault.

HUB

• Business Keys. • Isolated entities that can stand alone representing

a list of unique business keys. • The collection of business keys for an organization

is the answer to the question, “What do we do?”• Which business key is most important? • How many edges does it have?

Page 13: Data Vault Consortium A Mathematical Perspective of Data Vault.

LNK

• Relationships.• Isolated entities that can stand alone representing

a list of unique business keys. • The collection of relationships for an organization

is the answer to the question, “At what time does whom do what to whom or what?”

• Links are actually very interesting in their own right. We will be speaking further about links specifically a little later in this session.

Page 14: Data Vault Consortium A Mathematical Perspective of Data Vault.

LNK

• How many edges does a link have? The number of incoming edges a Link table has is the number of HUB_SQNs the link is connecting (This includes weak hubs).

• Outgoing Edges are the number of Satellites connected to this Link table.

• What is the ratio of OE/IE?

Page 15: Data Vault Consortium A Mathematical Perspective of Data Vault.

Research in progress

• • So what? • Did you ever wonder if you have your Driving key’s correct for

your link tables? • How does this affect performance? In some databases

sequences of columns has a direct impact on performance. • Which column should go first? • This is the basic math behind simple recommender systems. • This started with the following question: How many records

should a link have? • The starting point is min(h1,h2) < ROWSINLINK < h1*h2

Page 16: Data Vault Consortium A Mathematical Perspective of Data Vault.

Details

• What is UV and RV? • UV is the Unit Vector• RV is a reference Vector. • This is how you get these numbers: • V1 = Select rv,count(*) as uv from (• Select HUB_1_SQN,count(*) as RV from LNK group by

HUB_1_SQN• ) group by rv• Repeat for the number of HUB’s identified in a link.

Page 17: Data Vault Consortium A Mathematical Perspective of Data Vault.

Now What?

• Now that I have these numbers, what do I do with them?

• This is one way to confirm the accuracy of the sequencing of your business keys in a link, in order to separate out the driver business key from the dependent keys?

• Are there any other links in the Data Vault that have a similar Cosine?

Page 18: Data Vault Consortium A Mathematical Perspective of Data Vault.

Now What?

• If you have cosine similarity between links does this mean something?

• What is going on in the business? Is it obvious the links are related?

• More importantly, is it not obvious why two links are similar within a margin of error?

Page 19: Data Vault Consortium A Mathematical Perspective of Data Vault.

SAT

• Contextual data.• Detail data. Most pertinent for loading use in

downstream systems.• The “Payload” of the satellite is the data you want

to capture. • The collection of business keys for an organization

is the answer to the question “What do we do?”• Has one edge.

Page 20: Data Vault Consortium A Mathematical Perspective of Data Vault.

Satellite Clustering

• Using some simple k-means clustering with Euclidean distance calculations you can identify divergent rates of change within a satellite.

• This is one way to divaricate satellites coming from a single source table.

• If you are interested in knowing more about this, let me know.

Page 21: Data Vault Consortium A Mathematical Perspective of Data Vault.

Philosophies• From Dan: “100% of the data 100% of the time”• From Doug: “A model is not valid, until 100% of

the model is populated from source systems.” • Notice I did not say 100% of the data as Dan did.• During development, the assumptions built into

the model have to be validated.• Designing a proper data vault model does not

take very long for those versed in its abilities. Loading the model to validate the assumptions built into the model is paramount to success.

Page 22: Data Vault Consortium A Mathematical Perspective of Data Vault.

Philosophies• The second portion of this philosophy is to extract

data from the Vault to an alternative system, be that star schema, statistical research, data science, excel, etc. Something Downstream needs to be populated FROM the vault

• In order to know you have a valid model, data must both go in and come out accurately according to business rules.

• This must be done in order to say a particular phase of the development cycle is complete.

• What does complete mean? It means this is the end of the beginning. Welcome to the world of Data Warehouse support, maintenance and evolution.

Page 23: Data Vault Consortium A Mathematical Perspective of Data Vault.

Aesthetics• One of the most fascinating things about a data vault

model - to me - is that it flows quite aesthetically in accordance with the particular business processes the data vault is attempting to model.

• It just makes sense to a variety of users, from technical to executive.

• The following slide is an example of this, where we are modeling a process and something surprising came out of the modelling exercise.

Page 24: Data Vault Consortium A Mathematical Perspective of Data Vault.

What do I mean by Aesthetics?

• Can you do this with another data modeling technique?

Page 25: Data Vault Consortium A Mathematical Perspective of Data Vault.

Architecture

• A data architect understands applications are only the entry point of data into the Enterprise. Data Science makes data forever useful.

Page 26: Data Vault Consortium A Mathematical Perspective of Data Vault.

Volumetrics

• In essence this is the formula for sizing a data vault. • As discussed previously there are some other interesting

mathematics that can be applied to the Data Vault, but this formula is the one DBA’s would be most interested in.

• How many rows are inserted per some time period in each of the Hubs,Links and Satellites? This is the core formula.

• Other tables such as Bridges, PIT or staging tables may or may not be volatile, but those should be taken into consideration as well.

• Recently I did a webinar with Gazzang on Sizing the Data systems of an Enterprise Architecture. This is not a simple matter, yet also this is one of the first things that will hit the bottom line of an organization doing a new project.

Page 27: Data Vault Consortium A Mathematical Perspective of Data Vault.

Summary• One of the main reasons Architects are constantly

studying designs is they are continuously looking for ways not just to create something new, but to reduce new problems to ones already solved. The same thing can be said for Mathematicians, Engineers, Physicists, even managers and executives.

• The Data Vault is a repeatable pattern for database design when that database is to be used for integration of multiple systems. There are many other uses for Data Vault, of course, but this is the first principle of why the data vault exists.

• As we learn from prior implementations, be they our own, or from someone else, let us continuously strive to not only reduce problems to those already solved but look for, and discuss these repeatable patterns of Data Vault design.

Page 28: Data Vault Consortium A Mathematical Perspective of Data Vault.

Final thoughts• With the Data Vault, the structure itself has

meaning. • This is a feature that I believe is unique to Data

Vault modeling.

• Our email contact information:• [email protected][email protected]