Using Hadoop to build a Data Quality Service for both real-time and batch data
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
804 -
download
2
Transcript of Using Hadoop to build a Data Quality Service for both real-time and batch data
![Page 1: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/1.jpg)
Using Hadoop to build a Data Quality Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin
![Page 2: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/2.jpg)
About Us:• Alex Lv ([email protected])
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya ([email protected])Lead Product Manager - Data Products
Platform & Engineering at eBay
![Page 3: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/3.jpg)
Agenda• Background• Introduction to Griffin• Demo
![Page 4: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/4.jpg)
Background
![Page 5: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/5.jpg)
eBay Marketplace at a Glance
Q2 2016 data
$19.8B GMV in Q2 2016
10MNew listings added via mobile per
week
300MSearches each day
65%Transactions that ship for free
(in US, UK, DE)
80%Items sold as new
1BLive listings
One of the world’s largest and most vibrant marketplaces
![Page 6: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/6.jpg)
Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec
![Page 7: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/7.jpg)
Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec
![Page 8: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/8.jpg)
Big Data @
We manage one of the largest data platforms in
the world
We utilize one of the largest data platforms in
the world
![Page 9: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/9.jpg)
Challenging to ensure data quality for such scale!Challenges at eBay:• No unified view of data quality across multiple systems and teams• No shared platform to manage data quality• No system to measure near real-time data quality
![Page 10: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/10.jpg)
What is Data Quality?
Definition
• How well it meets the expectations of data consumers
• How well it represents the objects, events, and concepts it is created to represent Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions
![Page 11: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/11.jpg)
Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals, thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality
![Page 12: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/12.jpg)
Our Goal
A solution with all the below capabilities
Capability Commercial DQ software
Open source DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x
![Page 13: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/13.jpg)
Griffin
![Page 14: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/14.jpg)
What is Griffin?• Data Quality Platform built on Hadoop and
Spark Batch data Real-time data Unstructured data
• A unified process to detect DQ issues Incomplete Inaccurate Invalid ……
• An open source solutionhttps://github.com/ebay/griffin
![Page 15: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/15.jpg)
Data Quality Framework in GriffinDe
fine
Mea
sure
Anal
yze
• Define Data Quality Dimensions• Define Metrics, Goals, Thresholds
Calculators running on Source
RDBMS
Accu
racy
Com
plet
enes
s
Uni
quen
ess
Tim
elin
ess
Valid
ity
Cons
isten
cy
Metrics
MetricsRepository Scorecards
• Scorecard Reports generated and displayed• Measurement values and quality scores calculated
and stored• Data quality trending graphs generated
MeasureRepository
![Page 16: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/16.jpg)
Technical Highlights
Real-time Fast Massive Pluggabl
e
![Page 17: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/17.jpg)
Component Design
![Page 18: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/18.jpg)
Measure Calculator Example – Accuracy of Viewitem~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view• User Id• Page Id• Site Id• Title• Date• ……
X 100%
![Page 19: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/19.jpg)
Use Cases• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics
![Page 20: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/20.jpg)
Now life is easier……
![Page 21: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/21.jpg)
Demo
![Page 22: Using Hadoop to build a Data Quality Service for both real-time and batch data](https://reader036.fdocuments.in/reader036/viewer/2022070603/586fdde11a28ab18428b69df/html5/thumbnails/22.jpg)
We are open sourceand welcome contributions
Github: https://github.com/eBay/griffinBlog: http://www.ebaytechblog.com/?p=5877/Contact: [email protected]