2015 - Extract SF - Data Quality
-
Upload
ignacio-elola-villar -
Category
Documents
-
view
192 -
download
6
Transcript of 2015 - Extract SF - Data Quality
It's Time to Start Caring About Data Quality
Data Quality at Scale
Ignacio Elola
Everyone is talking about how useful data is
data can save your business
data can save your life
but...
all that is only true if you have the right data
data tend to be dirty and unstructured
specially web data!
Let’s start simple
I’ve created an extractor
I’ve pass a bunch of queries (bulk)
and got a dataset
How can you QA this data?
eyeballing
eyeballing we can find anomalies without having domain expertise
Quick summary:
- create extractors- combine extractors
- schedule data extraction
What if we need to scale up?
if you have:- more than ~3 datasources
- more than ~2 extractors per ds- big volume of queries- pre or post processing
you will need:- people to create and maintain
extractors- process to clean and validate
data
Data Quality
think about it pre and post data extraction!
tips and tricks to increase data quality
XPaths
//div[@id="priceBlock"]/table/tbody/tr/td[b/@class="priceLarge"]/b
better than
//*[@id="priceBlock"]/table/tbody/tr[2]/td[2]/b[1]
Regex
More at:
http://support.import.io/knowledgebase/articles/341182-xpaths-regex
http://www.w3schools.com/xsl/xpath_intro.asp
Required column
measuring data quality
completeness
coverage
post extraction data quality improvements?
how we do it
Smart automation
anomaly detection
variance, variability, noise
normalization
confidence score
Human input
Transparency
summary
Data Quality is essential
think about it from the very beginning
develop a process to measure data quality before scaling up
if you don’t want to reinvent the wheel - contact us!
Thank [email protected]