Agile Data Engineering - Intro to Data Vault Modeling (2016)
-
Upload
kent-graziano -
Category
Data & Analytics
-
view
593 -
download
2
Transcript of Agile Data Engineering - Intro to Data Vault Modeling (2016)
KENT GRAZIANO
AGILE DATA ENGINEERING: INTRODUCTION TO DATA VAULT DATA
MODELING
@KentGraziano kentgraziano.com
2
Agenda
Bio
What do we mean by Agile?
What is a Data Vault?
Where does it fit in an DW/BI architecture
How to design a Data Vault model
Being “agile” with Data Vault
What’s new in DV 2.0
3
My Bio
› Senior Technical Evangelist, Snowflake Computing› Oracle ACE Director (BI/DW)› Certified Data Vault Master and DV 2.0 Practitioner› Data Modeling, Data Architecture and Data Warehouse
Specialist• 30+ years in IT• 25+ years of Oracle-related work• 20+ years of data warehousing experience
› Member – DAMA Houston› Former-Member: Boulder BI Brain Trust (
http://www.boulderbibraintrust.org/)› Author & Co-Author of a bunch of books
• The Business of Data Vault Modeling • The Data Model Resource Book (1st Edition)
› Blogger: The Data Warrior› Past-President of Oracle Development Tools User Group
and Rocky Mountain Oracle User Group
4
Manifesto for Agile Software Development
“We are uncovering better ways of developing software by doing it and helping others do it.
Through this work we have come to value:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more.”
http://agilemanifesto.org/
5
Applying the Agile Manifesto to DW
(C) Kent Graziano
User Stories instead of requirements documents
Time-boxed iterations› Iteration has a standard length› Choose one or more user stories to fit in that
iteration
Rework is part of the game› There are no “missed requirements”... only those
that haven’t been delivered or discovered yet.
6
Data Vault Definition
TDAN.com Article
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of
business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable,
consistent and adaptable to the needs of the enterprise.
Architected specifically to meet the needs of today’s enterprise data warehouses
DAN LINSTEDT: Defining the Data Vault
7
What is Data Vault Trying to Solve?
(C) Kent Graziano
What are our other Enterprise Data Warehouse options?› Third-Normal Form (3NF): Complex primary keys (PK’s)
with cascading snapshot dates› Star Schema (Dimensional): Difficult to reengineer fact
tables for granularity changes
Difficult to get it right the first time
Not adaptable to rapid business change
NOT AGILE!
8
Data Vault Time Line
© LearnDataVault.com
20001960 1970 1980 1990
E.F. Codd invented relational modeling
Chris Date and Hugh Darwen Maintained and Refined Modeling
1976 Dr Peter ChenCreated E-R Diagramming
Early 70’s Bill Inmon Began Discussing Data Warehousing
Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University
Mid 70’s AC Nielsen PopularizedDimension & Fact Terms
Mid – Late 80’s Dr Kimball Popularizes Star Schema
Mid 80’s Bill InmonPopularizes Data Warehousing
Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse”
1990 – Dan Linstedt Begins R&D on Data Vault Modeling
2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling
9
Data Vault Evolution
(C) Kent Graziano
The work on the Data Vault approach began in the early 1990s, and completed around
1999.
Throughout 1999, 2000, and 2001, the Data Vault design was tested, refined, and deployed
into specific customer sites.
In 2002, the industry thought leaders were asked to review the architecture.
This is when I attend my first DV seminar in Denver and met Dan!
In 2003, Dan began teaching the modeling techniques to the mass public.
In 2014, Dan introduced DV 2.0!
10
Where does a Data Vault Fit?
© LearnDataVault.com
STAGING EDWDATA VAULT
DATA MARTS(STAR SCHEMAS)
DATA MARTS(STAR SCHEMAS)
DATA MARTS(STAR SCHEMAS)
11
Where does Data Vault fit?
©Oracle Corp
Data Vault goes here
12
Data Vault: 3 Simple Structures
© LearnDataVault.com
EDWDATA VAULT
HUB
LINK
SATELITE
01
02
03
13
Data Vault Core Architecture
© LearnDataVault.com
HUBS
Unique List of Business Keys
LINKS
Unique List of Relationships across keys
SATELITES
Descriptive Data
› Satellites have one and only one parent table› Satellites cannot be “Parents” to other tables› Hubs cannot be child tables
14
Common Attributes
© LearnDataVault.com
Required – all structures
› Primary key – PK› Load date time
stamp – DTS› Record source –
REC_SRC
Required – Satellites only
› Load end date time stamp – LEDTS
› Optional in DV 2.0
Optional – Hubs & Links only
› Last seen dates – LSDTs
› MD5KEY (REQUIRED IN DV 2.0)
Optional – Satellites only
› Load sequence ID – LDSEQ_ID
› Update user – UPDT_USER
› Update DTS – UPDT_DTS
› MD5DIFF
15
1. Hub = Business Keys
(C) Kent Graziano
Hubs = Unique Lists of Business KeysBusiness Keys are used to TRACK and IDENTIFY key information
New: DV 2.0 uses MD5 Hash of the BK for the PK
16
2: Links = Associations
(C) Kent Graziano
Links = Transactions and AssociationsThey are used to hook together multiple sets of information
In DV 2.0 the BK attributes may migrate to the Links for faster query
17
Modeling Links - 1:1 or 1:M?
(C) Kent Graziano
Today Tomorrow With a Link in The Data Vault
Relationship is a 1:1 so why model a Link?
The business rule can change to a 1:M.
You discover new data later.
No need to change the EDW structure.
Existing data is fine.
New data is added.
18
3. Satellites = Descriptors
(C) Kent Graziano
Satellites provide context for the Hubs and the LinksTracks changes over time - Like SCD 2
In DV 2.0 use HASH_DIFF to detect changes
19
Data Vault Model Flexibility (Agility)
(C) Kent Graziano
Goes beyond standard 3NF
Based on natural business keys
Hyper normalized› Hubs and Links only hold keys and meta data› Satellites split by rate of change and/or source
Enables Agile data modeling› Easy to add to model without having to change existing
structures and load routines• Relationships (links) can be dropped and created on-demand.
› No more reloading history because of a missed requirement
Not system surrogate keys
Allows for integrating data across functions and source systems more easily› All data relationships are key driven
20
Data Vault Extensibility
(C) LearnDataVault.com
Adding new components to the EDW has NEAR ZERO impact to:
› Existing Loading Processes
› Existing Data Model› Existing Reporting &
BI Functions› Existing Source
Systems› Existing Star
Schemas and Data Marts
21
Data Vault Productivity
(C) Kent Graziano
› Standardized modeling rules• Highly repeatable and learnable modeling
technique
• Can standardize load routineso Delta Driven processo Re-startable, consistent loading patterns.
• Can standardize extract routineso Rapid build of new or revised Data Marts
• Can be automated
• Can use a BI-meta layer to virtualize the reporting structureso Example: OBIEE Business Model and
Mapping toolo Example: BOBJ Universe Business Layer
• Can put views on the DV structures as wello Simulate ODS/3NF or Star Schemas
22
Data Vault Adaptability
(C) Kent Graziano
› The Data Vault holds granular historical relationships.• Holds all history for all time, allowing any
source system feeds to be reconstructed on-demando Easy generation of Audit Trails for data
lineage and compliance.
o Data Mining can discover new relationships between elements
o Patterns of change emerge from the historical pictures and linkages.
› The Data Vault can be accessed by power-users
23
Other Benefits of a Data Vault
(C) Kent Graziano
› Modeling it as a DV forces integration of the Business Keys upfront• Good for organizational alignment
› An integrated data set with raw data extends it’s value beyond BI:• Source for data quality projects• Source for master data • Source for data mining • Source for Data as a Service (DaaS) in
an SOA (Service Oriented Architecture).
24
Other Benefits of a Data Vault
(C) Kent Graziano
› Upfront Hub integration simplifies the data integration routines required to load data marts.• Helps divide the work a bit.
› It is much easier to implement security on these granular pieces.
› Granular, re-startable processes enable pin-point failure correction.
› It is designed and optimized for real-time loading in its core architecture (without any tweaks or mods).
25
How to be Agile using DV
(C) Kent Graziano
Model iteratively› Use Data Vault data
modeling technique› Create basic components,
then add over time
Virtualize the Access Layer› Don’t waste time building
facts and dimensions up front
ETL and testing takes too long› “Project” objects using
pattern-based DV model with database views (or BI meta layer)
Users see real reports with real data
› Can always build out for performance in another iteration
26
WHAT IS
THE WORLD'S SMALLEST DATA VAULT?
27
Worlds Smallest Data Vault
© LearnDataVault.com
Hub CustomerHub_Cust_Seq_ID
Hub_Cust_NumHub_Cust_Load_DTSHub_Cust_Rec_Src
Hub_Cust_Seq_IDSat_Cust_Load_DTS
Sat_Cust_Load_End_DTSSat_Cust_NameSat_Cust_Rec_Src
Satellite Customer Name
› The Data Vault doesn’t have to be “BIG”.
› A Data Vault can be built incrementally.
› Reverse engineering one component of the existing models is not uncommon.
› Building one part of the Data Vault, then changing the marts to feed from that vault is a best practice.
› The smallest Enterprise Data Warehouse consists of two tables: • One Hub, • One Satellite
28
Notably…
› In 2008 Bill Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)
› The number of Data Vault users in the US surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dv-customers/)
29
Organizations using Data Vault
› WebMD Health Services
› Anthem Blue-Cross Blue Shield
› MD Anderson Cancer Center
› Denver Public Schools
› Independent Purchasing Cooperative (IPC, Miami) • Owner of Subway
› Kaplan
› US Defense Department
› Colorado Springs Utilities
› State Court of Wyoming
› Federal Express
› US Dept. Of Agriculture
30
What’s New in DV2.0?
© LearnDataVault.com
Modeling Structure Includes…
› NoSQL, and Non-Relational DB systems, Hybrid Systems
› Minor Structure Changes to support NoSQL
01 02 03 04
New ETL Implementation Standards
› For true real-time support
› For NoSQL support
New Architecture Standards
› To include support for NoSQL data management systems
New Methodology Components
› Including CMMI, Six Sigma, and TQM
› Including Project Planning, Tracking, and Oversight
› Agile Delivery Mechanisms
› Standards, and templates for Projects
31
What’s New in DV2.0?
This model is fully compliant with Hadoop, needs NO changes to work properly
Note: Business Keys replicated to the Link structure for “join” capabilities on the way out to Data Marts.
© LearnDataVault.com
32
Summary
Data Vault provides a data modeling technique that allows:
Model Agility Productivity So? Agile Data Warehousing?
01 02 03
› Enabling rapid changes and additions
› Enabling low complexity systems with high value output at a rapid pace
› Easy projections of dimensional models
33
› Available on Amazon:
http://www.amazon.com/Better-Data-Modeling-Introduction-Engineering-ebook /dp/ B018BREV1C/
Shameless Plug:
34
› Available on Amazon.com
› Soft Cover or Kindle Format
› Now also available in PDF at LearnDataVault.com
› Hint: Kent is the Technical Editor
Super Charge Your Data Warehouse
35
› Available on Amazon:
http://www.amazon.com/Building-Scalable-Data-Warehouse-Vault/dp/0128025107/
New DV 2.0 Book
36
Register at wwdvc.com
37
Data Vault References
www.youtube.com/LearnDataVault www.facebook.com/learndatavault
www.learndatavault.comwww.danlinstedt.com
38
QUESTIONS?
39
Contact Information
KENT GRAZIANOSnowflake Computingwww.snowflake.net
@KentGraziano
http://kentgraziano.com