Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

42
built by QuerySurge Automated Big Data Testing without Writing Code Testing of Hadoop and Data Warehouses Visually Bill Hayduk CEO/President RTTS Jeff Bocarsly, PhD Chief Architect QuerySurge /RTTS

Transcript of Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Page 1: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

Automated Big Data Testing

without Writing Code Testing of Hadoop and Data Warehouses Visually

Bill HaydukCEO/President

RTTS

Jeff Bocarsly, PhDChief Architect

QuerySurge /RTTS

Page 2: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Presentation Topics

built by

QuerySurge ™

• Testing a Data Warehouse

• Testing Big Data

• Current Data Testing Strategies

• About QuerySurge

• Demo

Page 3: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

About FACTS

Founded: 1996

Headquarters: New York Customer profile:• Fortune 1000 • 600+ customers

Strategic Partners:IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon Web Services

Software:

QuerySurge

RTTS is the leading provider of software & data quality for critical business systems

Page 4: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

“70% of enterprises have either deployed or are planning to deploy big data projects and programs this year”

– analyst firm IDG

“46% of companies cite data quality as a barrier for adopting Business Intelligence products.”

- InformationWeek

“Poor data quality is a primary reason for 40% of all business initiatives failing to achieve their targeted benefits.”

- analyst firm Gartner

Data Quality Issues

built by

QuerySurge ™

Page 5: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Business Intelligence (BI) software

CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine.

“The average organization loses $14.2 million annually through poor Data Quality.”

- Gartner

The Executive Office & Critical Data

potential problem areas

ETL

Source Data ETL Process Data WarehouseBig Data

Data Architecture

Flat Files

Page 6: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Data Warehouse Testing

built by

Page 7: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Data Warehouse: the Marketplace

“The data warehousing market will see a compound annual growth rate of 11.5% …to reach a total of $13.2 billion in revenue.”

- consulting specialist The 451 Group

Data Warehouse software vendors

- Analyst firm Gartner’s Magic Quadrant for Data Warehouse Database Management Systems

Leaders

Challengers

built by

QuerySurge ™

Page 8: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Extract

built by

QuerySurge ™

Legacy DB

CRM/ERP DB

Finance DB

Testing the Data Warehouse: the ETL process

Source Data

ETL Process Target Data Warehouse

Transform

Load

Page 9: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Testing the Data Warehouse: Test Entry Points

Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point test entry point test entry points

built by

QuerySurge ™

Legacy DB

CRM/ERP DB

Finance DB

ETL ETL

Source Data ETL Process Target DW ETL Process Data MartBusiness

Intelligence software

Page 10: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Big Data Testing

built by

Page 11: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Big Data Vendors

built by

QuerySurge ™

 Big Data technology & services market will grow at a 26.4% CAGR to $41.5 billion through 2018, or about 6x the growth rate of the overall IT market. 

- Analyst firm IDC

Page 12: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Basic Hadoop Architecture

MapReduce(Task Tracker)

HDFS(Data Node)

MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)

HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)

machine

Cluster Add more machines for scaling, from 1 to 100 to 1,000

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node.

Job Tracker accepts jobs, assigns tasks, identifies failed machines

Page 13: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

MapReduce(Task Tracker)

HDFS(Data Node)HiveQLHiveQL

HiveQL

HiveQL

HiveQL

HiveQL

Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files

• create• insert • update • delete• select

Hive

Page 14: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

2 Use Cases:

Hadoop

Data Warehouse

NoSQL

Hadoop Data Warehouse

Page 15: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point

built by

Business Intelligence

software

ETL

Source Data

Source Hadoop ETL Process Target DWH

built by

QuerySurge ™

Use Case #1:Data Warehouse & Hadoop

test entry point test entry points

Page 16: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Use Case #2: MongoDB, Hadoop, Data Warehouse

Relational DB & Data WarehousingSource Data

@

BI, Analytics & ReportingIngestion

built by

QuerySurge ™

test entry point

test entry point

test entry point

test entry point test entry point

Page 17: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

2 Prevalent Data Testing Strategies

built by

1) Stare & Compare (also known as sampling)

2) Minus Queries

Page 18: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Strategy #1: Stare & Compare

built by

QuerySurge ™

• Review Mapping Document (business rules, data flow mapping, data movement requirements)

• Write Tests in SQL editor• Execute 2 Tests: 1 at Source & 1 at Target • Dump results to 2 Excel files• Compare results by eye (‘Stare & Compare’ or ‘sampling’)

Issue with Stare & Compare:Impossible to visually compare billions of data sets.

Result: usually less than 1% of data is compared

Example: Current QuerySurge customer has:

• a single test with 100 million rows & 200 columns • = 20 billion data sets • the client has > 7,000 total tests

Page 19: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

MINUS QUERIES subtract one result set from another result set to show difference Comment: MINUS QUERIES need to be executed 2x (Source MINUS Target; Target MINUS Source)

Result sets may not be accurate when dealing with duplicate rows of data

No historical data from past testing – audit and regulatory issues

Processing of minus queries puts pressure on the servers

Double execution means 2x testing time and resource utilization

Potential for false positives (bad data could exist on both sides of an ETL leg)

Data Testing Strategy #2: Minus Queries

Minus Query #1: Table_1 MINUS Table_2

Minus Query #2: Table_2 MINUS Table_1

Result Set #1

Result Set #2

ISSUES with MINUS QUERIES

Write 2 MINUS queries in SQL editor

Execute MINUS queries 2x

Page 20: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Data Testing Strategies

built by

QuerySurge ™

a fundamental issue with both current strategies:

Assumption that all team members understand and can write SQL or HQL code

Page 21: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

About QuerySurge ™

built by

Page 22: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

What is QuerySurge ™?

the collaborative Big Data Testing solution that finds bad data & provides a holistic view

of your data’s healthData Testing

built by

Page 23: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

the QuerySurge advantage

built by

QuerySurge ™

Automate the entire testing cycle Automate kickoff, tests, comparison, auto-emailed results

Create Tests easily with no programming ensures minimal time & effort to create tests / obtain results

Test across different platforms data warehouse, Hadoop, NoSQL, database, flat file, XML

Collaborate with team Data Health dashboard, shared tests & auto-emailed reports

Verify more data & do it quickly verifies up to 100% of all data up to 1,000 x faster

Integrate for Continuous Delivery Integrates with most Build, ETL & QA management software

Page 24: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Collaboration

Testers - functional testing - regression testing- result analysis

Developers / DBAs- unit testing- result analysis

Data Analysts- review, analyze data - verify mapping failures

Operations teams - monitoring- result analysis

Managers- oversight- result analysis

Share information on the health of your data

built by

QuerySurge ™

Page 25: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

QuerySurge™ Architecture

Web-based…

Installs on...

Linux

Connects to…

…or any other JDBC compliant data source

built by

QuerySurge ™

QuerySurgeController

QuerySurgeServer

QuerySurgeAgents

Flat Files

Page 26: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

SQL

HQL

SQL

HQL

SQL

SQL

QS pulls data from data sources QS pulls data from target data store QS compares data quickly QS generates reports, audit trails

How QuerySurge Works

Reports, Data Health Dashboard, auto emails

built by

QuerySurge ™

Source Data Target DataData Stores• Databases • Data Warehouses • Data Marts

Flat Files• Fixed Width• Delimited• Excel

Big Data stores• Hadoop • NoSQL

Data Warehouses

XML

Page 27: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

all QuerySurge™ Modules

Design Library

SchedulingDeep-Dive Reporting

Run Dashboard

Query Wizards

Data Health Dashboard

Page 28: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Design Library• Create Query Pairs (source & target SQLs)• Great for team members skilled with SQL

QuerySurge™ Modules

Scheduling Build groups of Query Pairs Schedule Test Runs

built by

QuerySurge ™

Page 29: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Deep-Dive Reporting Examine and automatically

email test results

Run Dashboard View real-time execution Analyze real-time results

QuerySurge™ Modules

built by

QuerySurge ™

Page 30: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

QuerySurge Test Management Connectors

built by

QuerySurge ™

Drive QuerySurge execution from your Test Management Solution

Outcome results (Pass/Fail/etc.) are returned from QuerySurge to your Test Management Solution

Results are linked in your Test Management Solution so that you can click directly into detailed QuerySurge results

• HP ALM (Quality Center)

• Microsoft Team Foundation Server

• IBM Rational Quality Manager

Integration with leading Test Management Solutions 

Page 31: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

QuerySurge & DevOps: Continuous Delivery & Integration

built by

QuerySurge ™

Automated Testing

Automated Reporting

Automated Launch

Data Integration/ETL solutions

QuerySurge ™

and many others…

email report

Test Management solutions

QuerySurge ™

email report

and many others…

QuerySurge ™

Automated Build solutions

email report

Page 32: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

Introducing the newQuery Wizards

We just made data testing REALLY EASY! 

No programming needed

Testing Big Data Visually

Page 33: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

From a recent poll1 of: • Big Data Experts• Data Warehouse Architects• Solution Architects• ETL Architects

Recent Survey: Data Experts

Consensus Answer: 80% of data columns have no transformation at all

Our Question: What % of columns in your projects have no transformations at all?

1Poll conducted by RTTS on targeted LinkedIn groups

Why is this important?

Page 34: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Fast and Easy. No programming needed.

built by

QuerySurge ™

QuerySurge™ Modules

Compare by Table, Column & Row

• Perform 80% of all data tests

• Automatically generates SQL & HQL code

• Opens up testing to novice & non-technical team members

• Speeds up testing for skilled SQL coders

• provides a huge Return-On-Investment

Page 35: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

QuerySurge™ Modules

3 Types of Data Comparison Wizards:

The Query Wizards also provide you with automated features for:o filtering (‘Where’ clause) ando sorting (‘Order By’ clause)

Column-Level Comparison:This is great for Big Data stores and Data Warehouses 

Table-Level Comparison:This comparator is great for Data Migrations and Database Upgrades. 

Row Count Comparison:Great for all - Big Data stores, Data Warehouses, Data Migrations and Database Upgrades.

Page 36: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Uses: Tests the columns that have no transformations,

which means it tests approximately 80% of your data store without you writing any SQL code

Tests: Big Data, Data Warehouses

Value added: novice or non-technical: no coding needed,

productive immediatelyexperienced user: saves time

built by

QuerySurge ™

SQL

SQL

HQL

SQL

SQLHQLSQLSQL

SQL

SQL

SQL

SQL

HQLSQL

SQL

HQL

SQL

SQLSQL

Page 37: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

pick Source & Target

pick Comparison Type

Select Tables & Columns

Auto-generated SQLAuto-generated SQL

(we picked Column-Level Comparison)

Filter (‘Where’ clause)

Page 38: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Uses: Verifies data loads when no transformation occurs

Tests: data migrations, upgrades

Value added: novice or non-technical: no coding neededexperienced user: saves time

SQLSQLHQL

SQL SQL

SQL

HQL

SQL

SQL

built by

QuerySurge ™

Page 39: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

Use: Verify that the amount of rows from the source match the amount from the target

Tests: Big data, data warehouse, data migration, database upgrades, data interfaces

Value added:novice: no coding neededexperienced user: saves time

built by

QuerySurge ™

SQLHQL

SQLSQL SQL

SQL

SQLSQL

SQL

HQL

HQL

HQL

_________Total

Page 40: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

05/01/2023 40built by

QuerySurge ™

Training CoursesData Warehouse Testing• Data Warehouse & ETL Testing Fundamentals (1 day)• Fundamentals of QuerySurge (1 day)• Introduction to SQL for QuerySurge (1 day)• Advanced SQL techniques for QuerySurge (1 day)

Big Data Testing• Big Data And ETL Testing Fundamentals• Introduction To Big Data Testing Using Hive And HQL

ConsultingRTTS, the software quality experts (and developer of QuerySurge), provides consulting solutions to the challenges of Big Data & Data Warehouse / ETL Testing

• Jumpstart 2-week program – combines training courses, mentoring, consulting

• Staff Augmentation – add additional RTTS resources to your team

• Outsourcing - RTTS can perform all testing, including planning, design, execution

Page 41: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

(1) Trial in the Cloud of QuerySurgeTM, including self-learning tutorial that works with sample data for 3 days

(2) Downloaded Trial of QuerySurgeTM, including self-learning tutorial with sample data or your data for 15 days

(3) Proof of Concept of QuerySurgeTM includes our team of experts assisting you for 30 days

for more information on (1), (2) and (3),

Go to http://www.querysurge.com/compare-trial-options

TRIAL IN THE CLOUD

built by

QuerySurge ™

Free TrialsQuerySurge™

Proof of

Concept

Page 42: Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

built by

QuerySurge ™

QuerySurge Demo