Data Validation Framework for challenging business environment...
Transcript of Data Validation Framework for challenging business environment...
Prasenjit Ghosh. DirectorBalram Mishra. Project Manager
Abhisek Mohanty. Project ManagerDevipriya Selvaraju. Technology Lead
Infosys Limited
Logo of your organization
1
Data Validation Framework for challenging business environment in modern era
2
Abstract
In today's IT industry, Data Analytics plays a crucial role in solving various business challenges. The insights are generated from diverse and massive data, accumulated from variety of sources having varied origins. Predominantly structured and unstructured data are collaborated from the source to a data lake. From the data lake, the required insights are generated using various modern day User Interface technologies.
Fundamentally the challenges reside profoundly in the core i.e. quality of the base data post migrations and transformations, and essentially has “huge data volume” and “diverse data sources” at the root. Therefore, it is of utmost importance to have a comprehensive data validation framework which can address the above challenges and also should be flexible to be plugged in for various functions like developer self-test, independent QA, load testing etc.
Our attempt is to present a practitioner's view based on a real time project challenge and the solution framework implemented using an open source readily available framework. The early benefits from the usage of this solution are encouraging. Also the solution has the potential to be enhanced / leveraged further depending on the context specific needs.
3
Challenges in evaluating data quality post migration/transformation
Developer Tester
QA
Variety in Data Sources –
Oracle, Mongo DB, CSV
etc.
Type of Data
Transformation – One
Time Migrations as well as
Incremental Updates.
Inability to identify
missing records.
Inability to validate data at
attribute level.
Challenges in tracking
records over incremental
migration.
Manual Testing - effort
consuming, error prone.
Need of automated reporting -
Summary, Details, and Trend
Analysis.
AutomationProcess
Easy to useTool
4
Addressing the Challenges
5
Practitioner’s View: Case Study : Real Time Problem Statement
# Problem Statement Description
1 Mismatch Count Mismatch in the record count between Source and Target
2 Missing Record Set Record drops in migration/transformation
3 Attribute Mismatch Mismatch in the attributes in the records
4 Incremental Validation Issues with Incremental data migration from Source to Target.
6
Practitioner’s View: Case Study : Solution Approach
7
Practitioner’s View: Case Study : Result (1/3)
Current Pain Points SolutionTedious way to validate the record countbetween source and target.
Precise count difference between source and Mongo.
Inability to identify missing records. Identification of missing records
Inability to validate data at attribute level. Identification of attribute level mismatch
Challenges in tracking records over incremental migration.
Incremental validation for count/data missing/attribute mismatch.
Late evaluation of final result.(Shift- Left)Validation by the developer himself.
Apart from addressing the above pain points, this solution has capability for
• CSV and Mongo data Comparison• Mongo and Elastic Comparison• Multiple data sources (ex. Oracle, Mongo and Elastic) comparison in one go.
8
Developers use this tool for validating the data transfer/transformation accuracy between Oracle and Mongo. The benefits realized are-
• Quick visibility of mismatch on huge volume of data• Visibility on data latency via comparison of source and target• Helps performance tune the data stream flow• Decouple need of data quality check as part of migration. There by enabling focus and
faster turnaround time to deliver large migration
Data Entity Use Case Record CountQuery
Execution time
Accuracy % Before using Apache Drill
After using Apache Drill
User RoleIdentification of Missing Records
6,387,608 18 mins
85% 98%Products Master 1,338,572 8 mins
Item Master 4,546,279 10 mins
Practitioner’s View: Case Study : Result (2/3)
9
With the capability of Apache drill to integrate with reporting tool (tableau in our case) we are able to get ready dashboard on required dimensions like overall Summary, Trends over time etc.With Trend graphs we get the insights like
• Increase in data mismatch with increase in data inflow.• Increase in latency with increase in data inflow.• Increase in data drops with increase in data inflow.• Data inflow spikes during month end, quarter end and year end.
Practitioner’s View: Case Study : Result (3/3)
10
Practitioner’s View: Case Study : Additional Points
• Distributed query optimization and execution
• Columnar execution
• Runtime compilation and code generation:
• Vectorization
• Optimistic/pipelined execution
Achieving Performance
Achieving Security
• Authentication
• Encryption
• Impersonation
• Authorization
11
References & Appendix
• https://en.wikipedia.org/wiki/Apache_Drill• www.ijmer.com/papers/Vol3_Issue1/DX31599603.pdf• https://www.infosys.com/it-services/validation...papers/.../data-quality-
migration.pdf
12
Question & Answers
13
Thank You!!!