Data Dictionary based Testing (DDbT) for Business ... · Data Dictionary based Testing (DDbT) for...
Transcript of Data Dictionary based Testing (DDbT) for Business ... · Data Dictionary based Testing (DDbT) for...
Data Dictionary based Testing (DDbT) for Business Intelligence Applications
Author(s):
Narendra Parihar ([email protected]), BIE US, MSIT
Anandam Sarcar ([email protected]), BIE India, MSIT
Disclaimer: Authors have used pseudo code , examples , tools for illustration purpose only.
Abstract
This Paper tries to present a new test methodology which can help us better test Business Intelligence (BI) or Data warehouse (DW) with more coverage, more data centric approach and with less time. The horizon's to cover with this idea are endless to explore.
Assume that we want to test a Business Intelligence (BI) or Data warehouse (DW) application , and we normally start to think that it’s going to be black box testing or data quality testing, or both.
Our test plans are towards how to make sure data is correct, functionality is working, jobs are completing etc. and how do we achieve it…?Normally using black box techniques. This is where the twist lies..
Do we really test a BI/DW using black box techniques? Theoretically the answer may be Yes, but practically No. What we need is essentially a data verification testing methodology to test BI/DW applications. This is what we are calling in more technical term – ‘Data dictionary Based Testing Framework’ “A test framework wherein the data dictionary is leveraged to describe each and every element of the data residing in database server objects, their relationships within the data, the data lineage between the elements, which in turn when rationalized would help us to test the data quality and functionality of the application ”
Problem’s with conventional BI/DW Test strategies
o Black box with manual execution test methods are time consuming and test coverage of application under test is difficult to determine.
o Repeated Data Quality (DQ) checks in BI/DW testing.
o Automation at database object level and transformations are difficult.
o Most tests don’t cover all objects and apply DQ tests upon them.
o Most tests are focused on functional scenarios and pay less attention to different referential checks, which eventually can break system down in future.
o Large portion of BI/DW testing happens after required ETL jobs are complete- aren't we losing those long hours waiting for jobs to complete.
o Limited or no data growth checks in test.
Why DDbT?
Details on DDbT
Fundamental thought process behind this approach:
Understand data, their types, relationships first before preparing test designs, automation or DQ checks.
Map the data , data elements, relationships, lineage in Metadata tables.
Use the metadata derived for testing the data inside the application.
Lets assume below as a simple BI/DW application to understand:
Source
Staging
Data Mart Cube
What is DDbT?
Cont.… Let us say there are only two tables in source( to make it understandable) and below is sample data:
Note that Order table has foreign key referencing to Product ID column of Product table
Product Column Types
ProductID Name UnitPrice Column_name Type
1 Mango 100 ProductID int
2 Organe 0 Name char
3 Apple 100 UnitPrice int
4 Grape 10
5 Junkproductname -100
Order Column Types
OrderID ProductID OrderQty Column_name Type
1 1 5 OrderID int
2 1 5 ProductID int
3 2 -3 OrderQty int
4 3 5
5 4 100
6 4 100
7 5 0
Cont.…
In target Staging server, we are just pulling tables from Source and appending surrogate key like productkeyid, orderkeyid, and adding some information which is needed for BI/DW as recordinserttime, recordmodifiedate, and implementing delta logic based on productid.
Now say, once data is present in Staging server, we have a salesfact table which is the fact table in our DataMart as below:
Salesfact
ProductKey OrderQuantity UnitPrice SalesAmount
11 10 100 1000
22 -3 0 0
33 5 100 500
44 200 10 2000
55 0 -100 0
Lets get on with DDbT on sample app So we have a simple enough BI/DW app and now lets assume some reports are made once data moves
into Analysis service cubes from datamart, and from cube to reports.
Now think of the scenario when this would have been a V1 project then would have to create your data dictionary on own , as you cant do reverse engineering at that point of time.
So, lets get to that meat of the topic on exact steps to perform DDbT:
1. Create Data Dictionary for different data sources, idea is to organize dictionaries
2. Create a generic table which have columns which drives your ETL Logic, eg. Last modified date, product ids etc in case of delta ETLs and for full ETL only table/column names may be sufficient
3. Create a Stored procedure, which will do objects existence BVT against data dictionary in one shot. This SP can be always called in post installation and your BVT of objects will be automated.
4. Create a Stored procedure, which will take parameters from table created in Step 2 and compare with pulled data, this is Dev logic of pulling data v/s Test logic.. We can run this stored procedure after one run of ETL in test to check automatically in every release.
5. Create a Stored procedure, which will do data growth monitoring Once one round of ETL is complete in test to check if suddenly data in some table have dropped or grown due to release changes. This will run in one shot and show us all tables where data has changed by more than X percent against it was before run of the ETL at table level.
6. Create a Stored procedure which can do DQ checks (and also identify patterns of data for certain columns to be set by the user) automatically for you by looping through each parameter in table of Step 2 and putting results into a different table.
7. Create Metadata Lineage Framework to Test Dimension and Facts (Explained in more detail with sub steps in subsequent section)
How to do
DDbT?
New Topology of sample app with DDbT
Source
Staging
Data Mart Cube
DDbT
This database holds data dictionaries, Test
Logics, Stored Procedures , Data
Lineage Information and Test Results
Step 1: Create Data Dictionary for different data sources, idea is to organize dictionaries Schema Table Description
dbo Order
dbo Product
dbo SalesFact
Schema Table Number Column Datatype Size Nullable InPrimaryKey
IsForeignKe
y Description
dbo Order 1 OrderID Int 4 N Y N
dbo Order 2 ProductID Int 4 N N Y
dbo Order 3 OrderQty Int 4 N N N
dbo Product 1 ProductID Int 4 N Y N
dbo Product 2 Name Char (20) 20 N N N
dbo Product 3 UnitPrice Int 4 N N N
dbo SalesFact 1 ProductKey Int 4 N N N
dbo SalesFact 2 OrderQuantity SmallInt 2 Y N N
dbo SalesFact 3 UnitPrice Money 8 Y N N
dbo SalesFact 4 SalesAmount Money 8 Y N N
Above is simple Data Dictionary created using DCT tool at www.CodePlex.com , notice that this does not have description for columns/tables but ideal assumption is they will be there for better understanding of objects, in our sample app example they are self explanatory.
Step 2: Create a generic table which have columns which drives your ETL Logic, eg. Last modified date, product ids etc in case of delta ETLs and for full ETL only table/column names may be sufficient -- Creating below Generic table for our sample application
CREATE TABLE ddbt_tw11 ( sourceserver CHAR(40) NOT NULL, sourcedatabase CHAR(40) NOT NULL, sourcetable CHAR(40) NOT NULL, destinationserver CHAR(40) NOT NULL, destinationtdatabase CHAR(40) NOT NULL, destinationtable CHAR(40) NOT NULL, deltacolumn CHAR(40) NULL, rulestring VARCHAR(MAX) -- If other transformations are there, input them here so that you can form dynamic SQL later to do Test ) INSERT INTO ddbt_tw11 VALUES ( 'SalesOLTP', 'SalesOLTPDB', 'Product', 'TW11Server', 'TW11Staging', 'Product', 'Productid', NULL ) SELECT * FROM ddbt_tw11
SourceServer Sourcedatabase Sourcetable DestinationServer Destinationtdatabase
Destinationtable DeltaColumn RuleString
SalesOLTP SalesOLTPDB Product TW11Server TW11Staging Product Productid NULL
Step 3: Create a Stored procedure, which will do objects existence BVT against data dictionary in one shot. This SP can be always called in post installation and your BVT of objects will be automated.
Step 4: Create a Stored procedure, which will take parameters from table created in Step 2 and compare with pulled data, this is Dev logic of pulling data v/s Test logic.. We can run this stored procedure after one run of ETL in test to check automatically in every release. Think of this as a wrapper SP which holds all your test logics in it and you can run this SP (Counter Dev Logic tests) Step 5: Create a Stored procedure, which will do data growth monitoring Once one round of ETL is complete in test to check if suddenly data in some table have dropped or grown due to release changes. This will run in one shot and show us all tables where data has changed by more than X percent against it was before run of the ETL at table level. Data Dictionary in Step1 makes it implementable with ease.
T- S
Q L
is
E S S E N T I A L
Step 6: Create a Stored procedure which can do DQ checks (and also identify patterns of data for certain columns to be set by the user) automatically for you by looping through each parameter in table of Step 2 and putting results into a different table
set @srcTable = 'tblCustomer‘
set @tgtTable = 'tblCustomerTgt‘
set @pksrcTable = 'intID'
set @pktgtTable = 'intID'
-To find out the differences missing records
SET @SQL = 'select * from ' + @srcTable +' Src' +
' full outer join ' +
@tgtTable+' tgt '+
'ON SRC.'+@pksrcTable+'=TGT.'+ @pktgtTable+
' where src.'+@pksrcTable+' IS NULL or tgt.'+@pktgtTable+' IS NULL'
SELECT @sql
execute sp_executesql @SQL
-To find out the differences mismatch records
SET @DIFFSQL = 'SELECT * from tblCustomer Src
FULL OUTER JOIN
tblCustomerTgt Tgt
ON SRC.intID = TGT.intID
WHERE
SRC.txtName <> TGT.txtName'
Example of simple DQ
checks
Step 7: Create Metadata Lineage Framework to Test Dimension and Facts (also Intermediate Tables)
ProductID Name Unit Price 1 Mango 100 2 Orange 200
CustomerID CustomerName 1 Anandam 2 Narendra
OrderID ProductID Customerid OrderQty
100 1 1 5
200 2 2 6
ProductKeyID Name Unit Price .. 11 Mango 100 … 22 Orange 200
Typical OLTP Data in Staging
Data in Dimensional Model
CustomerKeyID CustomerName
..
55 Anandam
.. ..
66 Narendra
Dimensions
ProductKeyID CustomerKeyID OrderQty Unit Price Sales Amount 11 55 5 100 500 22 66 6 200 1200
Process Step7-a: For Dimensions, map each and every column in the table (except the surrogate) and create the data lineage. If there are intermediate tables also, create the data lineage . Populate tblDataLineageDimensions as below for above example
tblStagingName
StagingColumnName
tblDimensionName
DimensionColumnName
Free Text SQL (for Filter, concatenation, or any lofgic) at Staging
Criteria for Uniqeuly IDENTIFYING THE ROW IN Staging ( normally the Source Natural Key)
Criteria for Uniqeuly IDENTIFYING THE ROW IN Dimension ( normally the Source Natural Key)
Free Text SQL (for Filter, concatenation, or any lofgic) at Dimension
Product Name Dimension Product
Name Name Name
Product Unit Price Unit Price Name Name
.. .. .. ..
… .. .. .. ..
Sales Fact
Step 7-b: Run a stored procedure to compare the value against the tblDataLineageDimensions for set of rows by forming the dynamic SQL based on the Staging and Dimension Column. If values match, DQ is same, else we have to flag the suspect record.
tblStagingName StagingColumnName
tblFactName FactColumnName
Free Text SQL (for Filter, concatenation, or any lofgic) at Staging
Criteria for Uniquely identifying row in Staging ( normally the Natural Key)
Criteria for Uniquely identifying row in in Fact( normally the composite PK Key)
Free Text SQL (for Filter, concatenation, or any logic) at Fact
Order OrderQty Sales Fact
OrderQty JOIN tblProduct ->Productid, JOIN tblCustomer->CustomerID
pRODUCTid, cUSTOMERID
PRODUCTKeyID, CustomerKeyID
JOIN tblDimProduct->ProductKeyID, JOIN tblDimCustomer->CustomerKeyID
Product Unit Price Sales Fact
Unit Price .. .. .. ..
Step 7-c: Populate tblDataLineageFacts
Step 7-d: For Computed Columns, which are not there in DB, here is the process
tblStagingName(list down all tables which help to derive the computed fact)
StagingColumnName
tblFactName FactColumnName
Free Text SQL (for Filter, concatenation, or any lofgic) at Staging
Criteria for Uniquely identifying row in Staging ( normally the Natural Key)
Criteria for Uniquely identifying row in Fact( normally the composite PK Key)
Free Text SQL (for Filter, concatenation, or any lofgic) at Fact
Order,Product OrderQty, UnitPrice
Sales Fact Sales Amount JOIN tblProduct ->Productid, JOIN tblCustomer->CustomerID
ProductID, CustomerID
ProductKeyID, CustomerKeyID
JOIN tblDimProduct->ProductKeyID, JOIN tblDimCustomer->CustomerKeyID SalesAmount =(Order.OrderQty*Product.UnitPrice)
Step 7-e: R Run a stored procedure to compare the value against the tblDataLineageFacts for set of rows by forming the dynamic SQL based on the Staging and Fact Columns. If values match, DQ is same, else we have to flag the suspect record
Data Dictionary Lifecycle – when does it start and end?
Business
• Creates Data Dictionary for Business Process. E.g.: Data in some flat files, excels, basic mapping of subject areas, entities and rules
SD/PM
• Maps the Data Dictionary of the processes to subject areas.
• Creates logical data models maintaining the relationships and data definitions.
Dev
• Creates the Physical Data Model , Data Flow Diagram.
• Documents the Metadata for each Data Dictionary element's flow
QA
• Verify, Utilize and Build upon the metadata created by Dev team to map to current proposed Metadata framework, enhance it more for DDbT
QA duties
at each
phase
Review for completeness, relationships, and business
rules
Review for logical data entities and
relationships, exception's,
rules, definitions, hierarchy
Review for meta data, rules, data
flow, order of data flow rules
Test design using DDbT, DQ Checks ,
execution using DDbT
steps and refinement of
dictionary
O n e
V e r S i o n
O f
T r u t h
DDbT in Scenario Focused Engineering
One of the challenges in adoption of Scenario Focused Test Design (SFTD) for BI/DW application is how to design Data flow/Data Quality test cases using SFTD recommended steps. Many a times, Use cases for data flow are generic at very high level which really does not help tester to design SFTD for data flows and DQ checks at object levels.
DDbT can be involved in QA strategy using SFTD and can help here to drive adoption of SFTD. Below is the proposed way how to get DDbT part of SFTD:
Understand Use cases
Understand Data Dictionaries provided by Business / SD / Dev
Validate Mappings and rules in Test Data Dictionary
Design Data flows from data dictionary (already in step 1-7 of DDbT) [How about a GUI to do so?]
Design data flows in SFTD vision using pre-defined 6 controls
Color code Positive and Negative data test cases
Number the flows
Generate pseudo code for automation
Finally Test more DQ scenarios with SFTD/DDbT combination. The value is QA teams can understand entire data flow while doing DDbT along with SFTD. This can even provoke thoughts of Program managers and Developers to maintain their own data dictionaries and move more quality upstream.
Big Wins with Data Dictionary Testing
o With increasing number of Sources in DW, complexity will be abstracted to minimum from DW Tester perspective.
o Proactively identify DQ/ performance issues as soon as the data patterns are not normal
o More Test coverage, rather Test team would be sure what they have tested and what not!
o BVT can be fully automated
o Regression suite is good potential to automate with DDbT
o After one time setup of Metadata, drastic reduction in effort and time from testing perspective for subsequent releases of the Project
o Know Data well , more specifically domain data of application !!!!!
Limitations
DDbT require one time setup effort
Unless Test Team is well versed with application, it may become challenge to implement DDbT, however following this methodology will help in knowing application and data better.
Re-work is anticipated if not correct!
Maintenance effort
Haven’t tried in project yet!
Closing thoughts :: Are we ready to implement DDbT process for BI/DW testing?
One common definition of data dictionary will solve many problems.
Contact [email protected] and [email protected]