Data Validation Framework for challenging business environment...

Prasenjit Ghosh. DirectorBalram Mishra. Project Manager

Abhisek Mohanty. Project ManagerDevipriya Selvaraju. Technology Lead

Infosys Limited

Logo of your organization

1

Data Validation Framework for challenging business environment in modern era

2

Abstract

In today's IT industry, Data Analytics plays a crucial role in solving various business challenges. The insights are generated from diverse and massive data, accumulated from variety of sources having varied origins. Predominantly structured and unstructured data are collaborated from the source to a data lake. From the data lake, the required insights are generated using various modern day User Interface technologies.

Fundamentally the challenges reside profoundly in the core i.e. quality of the base data post migrations and transformations, and essentially has “huge data volume” and “diverse data sources” at the root. Therefore, it is of utmost importance to have a comprehensive data validation framework which can address the above challenges and also should be flexible to be plugged in for various functions like developer self-test, independent QA, load testing etc.

Our attempt is to present a practitioner's view based on a real time project challenge and the solution framework implemented using an open source readily available framework. The early benefits from the usage of this solution are encouraging. Also the solution has the potential to be enhanced / leveraged further depending on the context specific needs.

3

Challenges in evaluating data quality post migration/transformation

Developer Tester

QA

Variety in Data Sources –

Oracle, Mongo DB, CSV

etc.

Type of Data

Transformation – One

Time Migrations as well as

Incremental Updates.

Inability to identify

missing records.

Inability to validate data at

attribute level.

Challenges in tracking

records over incremental

migration.

Manual Testing - effort

consuming, error prone.

Need of automated reporting -

Summary, Details, and Trend

Analysis.

AutomationProcess

Easy to useTool

4

Addressing the Challenges

5

Practitioner’s View: Case Study : Real Time Problem Statement

# Problem Statement Description

1 Mismatch Count Mismatch in the record count between Source and Target

2 Missing Record Set Record drops in migration/transformation

3 Attribute Mismatch Mismatch in the attributes in the records

4 Incremental Validation Issues with Incremental data migration from Source to Target.

6

Practitioner’s View: Case Study : Solution Approach

7

Practitioner’s View: Case Study : Result (1/3)

Current Pain Points SolutionTedious way to validate the record countbetween source and target.

Precise count difference between source and Mongo.

Inability to identify missing records. Identification of missing records

Inability to validate data at attribute level. Identification of attribute level mismatch

Challenges in tracking records over incremental migration.

Incremental validation for count/data missing/attribute mismatch.

Late evaluation of final result.(Shift- Left)Validation by the developer himself.

Apart from addressing the above pain points, this solution has capability for

• CSV and Mongo data Comparison• Mongo and Elastic Comparison• Multiple data sources (ex. Oracle, Mongo and Elastic) comparison in one go.

8

Developers use this tool for validating the data transfer/transformation accuracy between Oracle and Mongo. The benefits realized are-

• Quick visibility of mismatch on huge volume of data• Visibility on data latency via comparison of source and target• Helps performance tune the data stream flow• Decouple need of data quality check as part of migration. There by enabling focus and

faster turnaround time to deliver large migration

Data Entity Use Case Record CountQuery

Execution time

Accuracy % Before using Apache Drill

After using Apache Drill

User RoleIdentification of Missing Records

6,387,608 18 mins

85% 98%Products Master 1,338,572 8 mins

Item Master 4,546,279 10 mins


9

With the capability of Apache drill to integrate with reporting tool (tableau in our case) we are able to get ready dashboard on required dimensions like overall Summary, Trends over time etc.With Trend graphs we get the insights like

• Increase in data mismatch with increase in data inflow.• Increase in latency with increase in data inflow.• Increase in data drops with increase in data inflow.• Data inflow spikes during month end, quarter end and year end.


10

Practitioner’s View: Case Study : Additional Points

• Distributed query optimization and execution

• Columnar execution

• Runtime compilation and code generation:

• Vectorization

• Optimistic/pipelined execution

Achieving Performance

Achieving Security

• Authentication

• Encryption

• Impersonation

• Authorization

11

References & Appendix

• https://en.wikipedia.org/wiki/Apache_Drill• www.ijmer.com/papers/Vol3_Issue1/DX31599603.pdf• https://www.infosys.com/it-services/validation...papers/.../data-quality-

migration.pdf

https://en.wikipedia.org/wiki/Apache_Drill

http://www.ijmer.com/papers/Vol3_Issue1/DX31599603.pdf

12

Question & Answers

13

Thank You!!!

Data Validation Framework for challenging business environment...

Documents

Transcript of Data Validation Framework for challenging business environment...