Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions...

67
Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect www.scribekey.com

Transcript of Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions...

Page 1: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Finding and Fixing Data Quality Problems

NEARC Fall, 2010 Newport, RI

Brian Hebert, Solutions Architect

www.scribekey.com

Page 2: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Goal: Help You Improve Your Data

2

• Provide definition for Data Quality• Consider Data Quality within context of several data

integration scenarios• Suggest a framework and workflow for improving Data

Quality• Review tools and techniques, independent of specific

products and platforms• Help you plan and execute a Data Quality improvement

project or program• Review take-aways and Q&A

www.scribekey.com

Page 3: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Essential Data Quality Components

Meaning Structure

Contents

Note: These fundamental elements of data quality overlap. 3

www.scribekey.com

Data is well understood, well structured, and fully populated with the right values FOR END USE.

Page 4: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Quality (DQ) DefinedMeaning: Names and definitions of all layers and attributes

are fully understood and clear for end users community (a.k.a. semantics).

Structure: The appropriate database design is used including attribute data types, lengths, formats, domains (lookup tables), and relationships.

Contents: The actual data contents are fully populated with valid values and match meaning and structure.

Metadata: Meaning, Structure, Contents described in Data Dictionary or a similar metadata artifact.

4www.scribekey.com

Page 5: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Scenarios: DQ Improvement as Data Integration1) You want to improve the data

quality in a stand alone independent dataset. Some aspect of meaning, structure, contents can be improved.

2) You want to combine multiple disparate datasets into a single representation. Departments, organizations, systems, functions, are merging or need to share info

Source Target

Source1

Source2

Target

For both cases, many of the same tools and techniques can be used. In fact, it’s often beneficial, in divide and conquer approach,

to always start with 1 5www.scribekey.com

Page 6: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Typical Data Quality/Integration Situations

6

• You need to clean up a single existing dataset• 2 departments in utility company: Customer billing and

outage management, central db and field operations• Merging 2 separate databases/systems: getting town

CAMA data into GIS• Consolidating N datasets: MassGIS Parcel Database, CDC

Disease Records from individual states• 2 city/state/federal organizations: Transportation and

Emergency Management need common view • Preparing for Enterprise Application Integration: wrapping

legacy systems in XML web services

Data is in different formats, schemas, versions, but provides some of the same information, examples:

www.scribekey.com

Page 7: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Scenario 1 Case Study: Cleaning Up Facility Data

7

• Organization maintains information on facility assets. • The information includes data describing basic

location, facility type, size, function, and contact information.

• Organization needs decision support database.• Data has some quality issues. • Case is somewhat generic, could apply to buildings,

complexes, sub-stations, exchange centers, industrial plants, etc.

• Idea: Some identification with your data.

www.scribekey.com

Page 8: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Solution: Workflow Framework and Foundation - Data Integration Support Database and Ordered Tasks

INVENTORY

STANDARDIZE& MAP/GAP

SCHEMAGENERATIONETL

VALIDATION

APPLICATIONS

Iterative Operations

COLLECTION

INTEGRATIONSUPPORT

DB

CENTRAL RDB

A B C

A B C

A B C

A B C

A B C

DATAPROFILING

8www.scribekey.com

Page 9: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Solution Support: Ordered Workflow Steps

9

• Inventory: What do we have, where, who, etc.? • Collection: Get some samples. • Data Profiling and Assessment: Capture and analyze the meaning,

structure, and content of source(s)• Schema Generation: One or several steps to determine what target data

should look like.• Schema Gap and Map: What are differences, description of how we get

from A to B• ETL: Physical implementation of getting from A to B, code, SQL, script, etc. • Validation: Do results match goals?• Applications: Test data through applications, aggregations.• Repeat Processing for Updates: Swap in a newer version of data source A.

www.scribekey.com

Page 10: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Inventory: The Dublin Core (+)NUM ELEMENT DEFINITION

1 Contributor An entity responsible for making contributions to the resource.

2 Coverage

The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.

3 Creator An entity primarily responsible for making the resource.

4 Date A point or period of time associated with an event in the lifecycle of the resource.

5 Description An account of the resource.

6 Format The file format, physical medium, or dimensions of the resource.

7 Identifier An unambiguous reference to the resource within a given context. 8 Language A language of the resource. 9 Publisher An entity responsible for making the resource available.

10 Relation A related resource. 11 Rights Information about rights held in and over the resource.

12 Source The resource from which the described resource is derived. 13 Subject The topic of the resource. 14 Title A name given to the resource. 15 Type The nature or genre of the resource.

http://dublincore.org/documents/dces

Question: How do you capture information on existing data? 10www.scribekey.com

Page 11: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Multiple Data Description Sources for Inventory

WebsiteDocumentation

Metadata

People/SME’s Data Itself

Gather info about data from a variety of sourceswww.scribekey.com 11

EmailINVENTORY

Page 12: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

The Data Profile: Meaning, Structure, Contents

NUM ELEMENT DEFINITION

1 DatasetId A unique identifier for the dataset

2 DatabaseName The name of the database

3 TableName The name of the database table

4 ColumnName The name of the data column

5 DataType The data type of the column

6 MaxLength The max length of the column

7 DistinctValues The number of distinct values used in the column

8 PercentDistinct The percentage of distinct values used in the column

9 SampleValues A sampling of data values used in the column

10 MinLengthValue The minimum length data value

11 MaxLengthValue The maximum length data value

12 MinValue The minimum value

13 MaxValue The maxim value

NUM ELEMENT DEFINITION 1 DatasetId A unique identifier for the dataset2 DatabaseName The name of the source database3 TableName The name of the source database table4 RecordCount The number of records in the table5 ColumnCount The number of columns in the table6 NumberOfNulls The number of null values in the table

The Table Profile is helpful for getting a good overall idea of what’s in a database

The Column Profile is helpful for getting

a detailed understanding of

database structure and contents

12www.scribekey.com

Page 13: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Profiling (and Metadata Import)

Data ProfilerRoads

Parcels

Buildings XML Metadata Import

IntegrationSupport

DB

How Data Profiling Works

The profiler is an application that reads through data and gets names, structure, contents, patterns, summary statistics. You

can also learn about data through documentation and end users

Data Dictionary

No Metadata, End User

FGDC XML Metadata

13www.scribekey.com

Page 14: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Profiling: Table Information

• The Table Profile gives a good overview of record counts, number of columns, nulls, general completeness, and a list of column names.

• Very helpful for quickly getting an idea of what’s in a database and comparing sets of data which need to be integrated.

Database Name Records Columns Values Nulls Complete Columns

Acme FACILITY 53 12 636 19 97.01

ID,TYPE,FIRST_NAME,LAST_NAME,POS,ADDRESS,TOWN_STATE,ZIP,State,Tel,SECTION,AREA,INSPECT

Techno OFFICE 14 12 168 0 100

ID, FacilId, Num,Name,Address,Town,Area Code,Tel,Job,Dept,Value

14www.scribekey.com

Page 15: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Profiling: Column Information

• The Columns Profile provides detailed information on the structure and contents of the fields in the database.

• It provides the foundation for much of the integration work that follows.

Database Table Attrib Name Type Length Distinct%

Distinct NullsComplet

eAcme Facility ADDRESS String 50 50 94.3 3 94.34Acme Facility TYPE String 50 1 1.89 53 0Acme Facility FIRST_NAME String 50 51 96.2 0 100

Sample Values MaxLength Min Max Count

10 STUART PL.,100 MAIN STREET,100 Thompson Pl,125 Stratford Pl,128 22

10 STUART PL.

REGINA ROAD WEST 53

NULL 0 53Corporate, Finance, Maintenance, HR, 11Corporate Maintenance 53

15www.scribekey.com

Page 16: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Profiling: Domain Information • Domains provide one of the most essential

data quality control elements.

• List Domains are lists of valid values for given database columns, e.g., standard state abbreviations MA, CT, RI, etc. for State

• Range domains provide minimum and maximum values, primarily used for numeric data types.

• Many domains can be discovered and/or generated from the profile.

POSAccountantadminAdministrationARCHArchitectCEOCFOConsultantCTOData CollectionData EntryDatabaseDBADeveloperDismissedLead DeveloperLead programmerDatabase Admin

Question: Do you use profiling to get a concise summary of data details? 16

www.scribekey.com

Page 17: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Workshop Exercise: Example Profile

17www.scribekey.com

Page 18: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Workshop Exercise: Column Profile Analysis The profile itself is generated automatically. The real value is in the results of the

analysis: what needs to be done to data?

1) Are the column name and definition clear? 2) Are there other columns in other tables with the same kind of

data and intent with a better name?3) Is the data type appropriate, e.g., many times numeric data and

dates can be found in text fields. Is the length appropriate? 4) Is this a unique primary key? Does the number of values equal

the number of records? Is it a foreign key? 5) Is this column being used? Is it empty? How many null values

are there? What percent complete is the value set? 6) Is there a rule which can be used to validate data in this column

as:1) List Domain or Range Domain2) Specific Format (regular expression)3) Other rules, possibly involving other columns

7) Does the data indicate that the column should be split into another table and use an 1->N parent child relationship?

18www.scribekey.com

Page 19: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Profilers• http://www.talend.com/products-data-quality/talend-open-

profiler.php• http://en.wikipedia.org/wiki/DataCleaner• http://weblogs.sqlteam.com/derekc/archive/

2008/05/20/60603.aspx• http://www.dba-oracle.com/oracle_news/

2005_12_29_Profiling_and_Cleansing_Data_using_OWB_Part1.htm

• Request form at www.scribekey.com for profiler (shareware)

19www.scribekey.com

Learn about and test profilers with your own data.

Page 20: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Schema Generation Options• Use a data driven approach. Define the new schema as

more formally defined meaning, structure, and contents of source data.

• Use an external independent target schema. Sometimes this is a requirement.

• In divide and conquer approach, use data-driven first as staging schema. Improve data in and of itself. Then consider ETL to more formal, possibly standard, external target schema.

• Use a combination hybrid, using elements of both data-driven and external target schemas.

20www.scribekey.com

Page 21: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Model Differences: Production vs. Decision Support

The data models and supporting tools used in data warehousing are significantly different from those found across the geospatial community. Geospatial data modelers tend to incorrectly use production models for

decision support databases.

Normalized for referential integrity, complex and slower performing

queries, data is edited

De-normalized for easily formed and faster performing queries,

data is read-only

21www.scribekey.com

Page 22: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Normalization• Normalization can get complicated, 1st, 2nd, 3rd Forms,

Boyce-Codd, etc. • Some important basics:– Don’t put multiple values in a single field– Don’t grow a table column wise by adding values

over time– Have a primary key

• However, you should de-normalize when designing read-only decision support databases to facilitate easy query formation and better performance

22www.scribekey.com

Page 23: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

De-Normalization and Heavy IndexingMakes Queries Easier and Faster

• 1 De-Normalized Table: SELECT TYPE, LOCATION FROM FACILITIES

• 3 Normalized Tables: SELECT FACILITY_TYPES.TYPE, LOCATIONS.LOCATION FROM (FACILITIES INNER JOIN FACILITY_TYPES ON FACILITIES.TYPE = FACILITY_TYPES.ID) INNER JOIN LOCATIONS ON FACILITIES.LOCATIONID = LOCATIONS.ID;

• NAVTEQ SDC data is a good example. De-normalized, e.g., County Name and FIPS, highly indexed, very fast and easy to use.

FACILITIES

FACILITIES

FACILITY_TYPES

LOCATIONS

www.scribekey.com23

Page 24: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Distinguish between Decision Support and Production Database Models !!! Use Both When Necessary

Production OLTP database solutions typically use a middle tier for representing

higher level business objects and rules. This middle tier is often designed using UML and

implemented with an Object Oriented programming language.

Decision Support OLAP database solutions typically have no Middle tier. They present and access data directly through query language behind pivot

tables and report generators.

Presentation Layer

Business Logic Middle Tier Layer – UML – OO Language

Data Access Layer

OLTP Database

Presentation Layer

No Middle Tier

Data Access Layer

OLAP Database

24www.scribekey.com

Page 25: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Standardization, Modeling, and Mapping

Use real data to inform and keep your modeling efforts focused

ABSTRACTSCHEMA

REALRAW DATA

DataSchema

Gap

Solution

Strong TypesHighly Normalized

Lots of DomainsPerfect

Inconsistent TypesNot Normalized

No DomainsImperfect

Close the gap between the source data and the target schema

25www.scribekey.com

Page 26: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Database Refactoring ApproachPatterns approach, Gang of Four Book, great for new systems.

Innovation: Martin Fowler, Code Refactoring, fix what you have

Agile Database, Scott Ambler, Refactoring Databases

You are not starting from scratch; need to make modifications to something which is being used.

List of Refactorings26

www.scribekey.com

Page 27: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: Relationship Discovery and Schema Matching: Entities, Attributes, and Domain Values

No. Match Attribute1 Source Category2 Source Entity3 Target Category4 Target Entity5 Match Score6 Match Type7 SQL Where 8 Notes

Schema matching is necessary to discover and specify how categories, entities, attributes, and domains from one system map into another. Matching discovers relationships, Mapping specifies transforms

These maps are stored in XML documents, FME, Pervasive, etc. As with metadata, it can be useful to store these in an RDB as well.

No. Match Attribute1 Source Category2 Source Entity3 Target Category4 Target Entity5 Match Score6 Match Type7 SQL Where 8 Notes

Standardized Schema Matching Relationships

• Equal

• Overlap

• Superset

• Subset

• Null

• Basic set theory operations

• Can be used in end user query tools

• Can be used by other schema matching efforts and combined for thesaurus compilation.

27www.scribekey.com

Page 28: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Schema Matching: Map & Gap

• A Gap Analysis exercise can precede a full mapping to get a general idea on how two datasets relate.

• Gap Analysis is always in a direction from one set of data elements to another.

• Simple scores can be added to give an overall metric of how things relate.

ELEMENT DEFINITION EXAMPLE LINGUISTICS

Equal A is equal to B First Name Synonym

Subtype A is a more specific subtype of B Supervisor - Employee Hyponym

Supertype A is a more general supertype of B Employee - Supervisor Hypernym

Part A is a part of B Employee - Department Meronym

Container A is a container of B Department - Employee Holonym

Related A is related to B Department - Operation

28www.scribekey.com

Page 29: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: Schema Matching Entities Is Also Important

=

Multiple Hierarchical Feature Sets

Node and Edge Networks

Multiple Geometric Representations Multiple Occurrences

Polygon Centroid Relationships

Multiple Locations

Well documented schema matching information, as metadata, helps reduce and/or eliminate any confusion for integration developers and end users 29

www.scribekey.com

Page 30: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Mechanics of ETL: The Heart of the Matter• Change Name• Change Definition• Add Definition• Change Type• Change Length• Trim• Use List Domain• Use Range Domain• Split• Merge• Reformat• Use a Default• Create View

• Change Case• Add Primary Key• Add Foreign Key• Add Constraints, Use RDB• Split Table to N->1• Pivot• Merge Tables• Remove Duplicates• Remove 1 to 1 Tables• Fill in Missing Values• Remove Empty Fields• Remove Redundant Fields• Verify with 2rd source 30

www.scribekey.com

Page 31: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Use a Staging Data Store, Separate MR

Target

Staging

SourceKeep a definitive snapshot copy of source, don’t

change it.

Build final target dataset from staging data store.

Execute ETL in a staging data store. Expect

multiple iterations and

temporary relaxed data types will be

necessary

MR

Don’t mix actual data with metadata repository information, keep separate

databases

31www.scribekey.com

Page 32: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Choosing the Right Tool(s)• SQL• FME• ESRI Model Builder• Pervasive (formerly Data Junction)• Microsoft SQL Server Integration Services• Oracle Warehouse Builder• Talend• C#/VB.NET/OLE-DB• Java/JDBC• Scripts: VB, JS, Python, Perl• Business Objects, Informatica

Make the best use of the skills you have on your team. DB vs. code situations and teams. Use a combination of methods.

32www.scribekey.com

Page 33: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: Use SQL Views to Simplify Read Only Data

• SQL Views provide a flexible and easy mechanism for de-normalizing data coming from production databases.

• Views are typically what the end user in a database application sees. They hide the multi-table complexity lying underneath in the physical model.

• Think of the analysis database as a user that doesn’t need or want to see this underlying complexity.

• Good approach for generating GIS parcel datasets from CAMA property records.

• Can instantiate Views as Hard Tables, update on some regular basis bath SQL

VIEW

SQL

MULTIPLE LINKED TABLES

33www.scribekey.com

Page 34: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: SQL Data Manipulation• left(Address, instr(Address, ' ')) as AddressNum• left(POCName, instr(POCName, ' ')) as FirstName• right(POCName, len(POCName)-instr(POCName, ' ')) as LastName• int(numEmployees) as NumPersonnel• right(Address, len(Address)-instr(Address, ' ')) as StreetName• '(' & [Area Code] & ')' & '-' & Tel as Telephone• iif(instr(Tel, 'x')>0, right(Tel, len(Tel)-instr(Tel, 'x')), null) as

TelExtension• ucase(Town) as City• iif(len(Zip)=5, null, right(Zip,4)) AS Zip4• iif(len(Zip)=5, Zip, left(Zip,5)) AS Zip5

www.scribekey.com34

Page 35: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Look Up tables• Clean and consistent domains

are one of the most important things you can use to help improve data quality.

• As example, consider use of single central master street list(s) for state, town, utility, etc.

• One approach is to collect all of the variations in a single look up table and match them with the appropriate value from the master table.

35www.scribekey.com

Original MasterMain St. MAIN STElm St. ELM STELM STREET ELM STMain Street MAIN STNORTH STREET NORTH STNorth NORTH ST

Page 36: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Domain and List Cleaning Tools• There are powerful tools available to help match

variation and master values.• Example: SQL Server Integration Services Fuzzy Look Up

and Fuzzy Grouping Tools: http://msdn.microsoft.com/en-us/library/ms137786.aspx

• These can be used to create easily reusable batch processing tools.

• These list cleansing tools are often unfamiliar to geospatial data teams.

• The saved match lists are also valuable for processing new data sets.

36www.scribekey.com

Page 37: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Regular Expressions• Regular Expressions provide a powerful means for

validation and ETL, through sophisticated matching and replacement routines.

• Example: Original Notes field has line feed control characters. We want to replace them with spaces.

• Solution: Match “[\x00-\x1f]“ Replace With: “ “• Need code as C#, VB, Python, etc.

www.scribekey.com37

newVal = Regex.Replace(origVal, regExpMatch, regExpReplace);

Page 38: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Regular Expressions (cont.)

38www.scribekey.com

Start and grow a list of regular expression match and replace values.

Keep these in the Metadata RepositoryNeed code as C#, VB, Python, etc.

RegExp Name

^[1-9]{3}-\d{4}$ ShortPhoneNumber

^[0-9]+$ PosInteger

^[1-9]{3}-[1-9]{3}-\d{4}$ LongPhoneNumber

^[-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+)$ Double

(^[0-9]+$)|(^[a-z]+$) AllLettersOrAllNumbers

Page 39: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Custom Code is Sometimes Required

There is no simple SQL solution

The problem is more complicated than

simple name, type, or conditional value

change

You use preferred programming

language and write custom ETL 39

www.scribekey.com

Page 40: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Extending SQL• Most dialects of SQL; Oracle, Sql Server, Access, etc., allow

you to develop custom functions with language like C#, Java, etc.

• For example, you can build and use RegExpMatchAndReplace then use it in SQL

• You can also add table look up, scripting, etc. • Very powerful and flexible approach• Example: UPDATE FACILITIES SET NOTES = REGEXP(“[\x00-\x1f]“ , “ “), FACILITY_TYPE = LOOKUP(THISVAL)

40www.scribekey.com

Page 41: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Use Geo Processing to Fill In Attributes• Example: We want to

fill in missing Zip4 values to Facility points from polygon set.

• This is particularly valuable for creating hierarchical roll-up/drill-down aggregation datasets

• Use Arc Tool Box Spatial Join

41www.scribekey.com

Zip5 Zip401234 123401234 567801234 678901235 345601235 345601235 0123

Page 42: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Breaking Out N to 1• This problem occurs

very frequently when cleaning up datasets.

• We have repeating columns to capture annual facility inspections.

• This data should be pivoted and moved to another child table

• We can use SQL and UNION capability to get what we want.

• Can also reformat for consistency at the same time.

www.scribekey.com42

Page 43: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Use of Union To Pivot and Break OutSELECT Id as FacilityId, 2000 as Year,

iif(ucase(Insp2000) = 'Y' or ucase(Insp2000) = 'T', 'Yes', Insp2000) as Inspected from AcmeFac

UNIONSELECT Id as FacilityId, 2005 as Year,

iif(ucase(Inspect05) = 'Y' or ucase(Inspect05) = 'T', 'Yes', Inspect05) as Inspected from AcmeFac

UNIONSELECT Id as FacilityId, 2010 as Year, iif

(ucase(Insp_2010) = 'Y' or ucase(Insp_2010) = 'Y', 'Yes', Insp_2010) as Inspected from AcmeFac

www.scribekey.com43

Page 44: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: One-Off vs. Repeatable Transforms• Relying on manual tweaks and adjustments to completing

and filling in correct data values is problematic if you need to set up repeatable ETL processes.

• It’s much harder and more complicated to set up repeatable, ordered ETL routines, but well worth it if the data is going to be updated on an on-going basis.

• Example: a dataset is cleaned with lots of SQL, scripts, manual tweaks, etc. When a newer dataset is made available, the same tasks need to be repeated, but the details and the order in which they were performed were not saved.

• Suggestion: Be very aware of whether you are doing ETL as a one-off vs. something that will have to be repeated, and plan accordingly, save your work.

44www.scribekey.com

Page 45: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Bite the Bullet: Manual ETL• Sometimes the data problems

are so tricky that you decide to do a manual clean up.

• You could probably come up with code containing very large number of conditional tests and solutions (brain-teaser), but it would take longer than just cleaning the data by hand.

• Depends on whether you are doing a one-off or need to build something for repeatable import.

• This also applies to positioning geospatial features against a base map or ortho-image, e.g., after geocoding, etc. for populating x,y attributes.

45www.scribekey.com

Site Plan Locationrm 203 drawer A112 BRoom 100 dr. 1Bld A Rm 209 Drawer 11200 20 rm 500 d59 d 33Jones Rm G12 Bld 9Room 100 Drawer 108 30 6

Page 46: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Checking Against External Sources• One of the only ways to

actually ensure that a list of feature values is complete is by checking against an external source.

• In this case, the data in and of itself, does not necessarily provide a definitive version of the truth.

• You can not tell what may be missing or what may be incorrectly included in the list.

• Get redundant external information whenever it’s available.

46www.scribekey.com

• In some cases the only way to fill in a missing value is to contact the original source of the data.

• This can be highly granular and time consuming.

• Need to make decision on how important it is to have 100% complete data.

• This can be a case of diminishing returns.

Page 47: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Storing ETL Details in the Metadata Repository

• The combination of the MR and code based tools provides a very flexible and powerful environment for improving data quality.

• Many actions and parameters can be stored in the MR including LookUp, RegExp, SQL Select Clauses, and even runtime evaluation routines for code and scripts.

• Example: Translator makes use of configurable ETL Map and Action table found in the Metadata Repository

47www.scribekey.com

TargetEntity TargetAttribute ETLType Params SourceAttributeFacility AreaCode SQLSelect left(PhoneNumber,3) Facility StreetName LookUp StreetList AddressFacility Notes RegExp RemoveControlChars Facility SizeSqFeet Script CalcFacilityArea Facility Zip9 SQLSelect Zip5 & '-' & Zip4 Facility FacilityType LookUp FacilityType FacilityCategoryFacility FacilityName RegExp RemoveMultiSpace Facility SizeAcres Script CalcFacilityArea

Page 48: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Staging Table Design and SQL Command Types

• Separate Source and Target tables, requires joins.

• Can merge Source and Target into Staging table.

• Decide what kind of complexity is preferred.

• Can also split attributes and geometry, size factor, and use keys, initial insert, then updates, then recombine.

• Build a single large SQL statement for creating view or hard table from results.

48www.scribekey.com

Source Target

Source Target

2 SEPARARATE TABLES, REQUIRES JOINS

INSERT, UPDATE, SELECT

1 MERGED TABLE,HARDER TO CREATE,

COLUMN NAMES UNIQUE NO JOINS

UPDATE, THEN SELECT OUT

Page 49: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

ETL: Loop Design: Row, Column, Value Processing Order• Row wise – SQL errors will fail

for entire row, no values changed: UPDATE, INSERT, SELECT INTO

• Column wise – Single UPDATE statement against all columns in the table, fast performance making use of database engine

• Cell wise – UPDATE handles each Row.Column value individually, high granularity and control, slower performance

49www.scribekey.com

1 A 100 X2 B 101 Y3 C 102 Z

1 2 3A 100 XB 101 YC 102 Z

1 2 31 A 100 X2 B 101 Y3 C 102 Z

ROW WISE

COLUMN WISE

CELL WISE

Page 50: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Finding Errors and Validation• Run a post-ETL Profile to see before and after data

• Contents checker, which needs names, looks at actual values and checks against List Domains, Range Domains, and Regular Expressions

• Output, describing data problems, is written to an audit table and is used to help find and fix errors. Audit info has table, key to lookup, column name, problem, and value.

Question: How do you validate data after it has been transformed?

Table Id Column Rule ValueFacilities 101 Name Not NullFacilities 25543 SubType Domain: MyList ShipinFacilities 563 NumPersonnel Range: 1 - 1000 -500Facilities 223 Inspected RegExp: Date 0/930/1955

50www.scribekey.com

Page 51: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Overall Integration Operations Tracking • Very helpful to record and document all of the different

operations being performed against the data.

• In general, record who did what, when, how, and why, plus any other detailed information.

• Need to track both manual and automatic processing - tools can automatically record their activities in a tracking table – more difficult to track people’s activities

Action Who Date NotesCollection JS 3/24/2008 Profiling BH 2/30/2007 Did not complete, need new dataProfiling RG 7/7/1008 CompletedMapping PB 6/30/2008 Dataset incomplete

Question: How do you handle tracking overall operations on data?

51www.scribekey.com

Page 52: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Repeat Data Updates and/or New Data• A good test of how well the data integration operations have

succeeded is to process a new more up-to-date dataset destined to replace existing data in the system.

• This is where keeping track of what was done to the data, and in what order, is so important.

• Validate structure and contents. Changes in structure or content require a re-tooling of the mapping information.

• Consider having to next include a new and different data source, e.g., merging organizations combining facility data.

Question: How do you handle refreshing data from a new set of update data?

52www.scribekey.com

Page 53: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

12

Agile vs. Waterfall WorkflowsITERATIVE/EVOLUTIONARY:

Perform each step with a subset of material and functionality. Loop back and continue with lessons learned and more information. Divide and Conquer

Like steering a car: can’t fix position at onset, need to adjust as unforeseen conditions are encountered. Analogy to sculpting. Do something useful in a reasonable amount of time

WATERFALL:Perform each step completely before moving to the next.

1 2 3 4 5 6 7

8 9

10

Faster Benefits

for End

Users

53www.scribekey.com

11

Page 54: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

GIS Users Data ModelersStandards Bodies

The Tower of Babel

UML, XSD GML

ISO 19XXX,…

LayersAttributesSymbols

…?

Use Clear Communication Artifacts

Use table centric documents and models, e.g., RDB, Excel, HTML to communicate with end users and stakeholder in addition to UML, XSD, etc.

Not an either or – we need both. 54

www.scribekey.com

Page 55: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Dictionary and Metadata• When you’re done cleaning up the data, make sure you fully

describe meaning, structure, and contents metadata in a data dictionary.

• Must haves: Who, What, When, Where, Why, Accuracy, and Attribute Definitions!

• Present the data dictionary in an easy to use tabular format, e.g., Excel, HTML

• Ideally the metadata should live in the database along side the data it is describing

• Separating metadata from data, and using fundamentally different physical formats and structures leads to serious synchronization problems.

• Data providers are encouraged to produce FGDC/ISO geospatial metadata, as a report from repository

www.scribekey.com55

Page 56: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Example Census Data Dictionary HTML Browser

www.scribekey.comwww.scribekey.com

56

Page 57: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: FGDC/ISO XML Metadata in an RDBNUMELEMENT

1 Originator2 Publication_Date3 Title4 Abstract5 Purpose6 Calendar_Date7 Currentness_Reference8 Progress9 Maintenance_and_Update_Frequency

10 West_Bounding_Coordinate11 East_Bounding_Coordinate12 North_Bounding_Coordinate13 South_Bounding_Coordinate14 Theme_Keyword_Thesaurus15 Theme_Keyword16 Access_Constraints17 Metadata_Date18 Contact_Person19 Address_Type20 Address21 City22 State_or_Province23 Postal_Code24 Contact_Voice_Telephone25 Metadata_Standard_Name26 Metadata_Standard_Version

XML Metadata

XML Metadata

IMPORT

EXPORT

When this metadata is imported into an RDB, the full flexibility of SQL is available for very flexible management and querying a large collection of metadata as a set.

57www.scribekey.com

Page 58: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: XML Metadata After Import into RDB – Hierarchy PreservedName NodeValue ParentId ParentName LineageId LineageName

origin NOAA 5citeinfo 1.2.3.4.6 metadata.idinfo.citation.citeinfo.pubdateXpubdate 05/21/2004 5citeinfo 1.2.3.4.7 metadata.idinfo.citation.citeinfo.title

title NauticalNAVAIDS 5citeinfo 1.2.8 metadata.idinfo.descript

geoform vector digital data 5citeinfo 1.2.8.10 metadata.idinfo.descript.purpose

ftname NauticalNAVAIDS 5citeinfo 1.2.11.12.13.14metadata.idinfo.timeperd.timeinfo.sngdate.caldateX

abstract

Nautical Navigational Aids: US waters 17descript 1.2.16.18 metadata.idinfo.status.update

purpose Homeland Security 17descript 1.2.19 metadata.idinfo.spdomlangdata en 17descript 1.2.19.20 metadata.idinfo.spdom.boundingcaldate 20040528 24sngdate 1.2.25 metadata.idinfo.keywordstime unknown 24sngdate 1.2.25.26 metadata.idinfo.keywords.themecurrent publication date 22timeperd 1.2.25.26.27 metadata.idinfo.keywords.theme.themektprogress In work 28status 1.2.29 metadata.idinfo.accconstupdate Unknown 28status 1.2.30 metadata.idinfo.useconstwestbc -178.047715 32bounding 1.31.33 metadata.metainfo.metceastbc 174.060288 32bounding 1.31.33.34.35 metadata.metainfo.metc.cntinfo.cntperpnorthbc 65.634111 32bounding 1.31.33.34.37 metadata.metainfo.metc.cntinfo.cntaddr

southbc 17.650000 32bounding 1.31.33.34.37.39metadata.metainfo.metc.cntinfo.cntaddr.address

themekeyNautical Navigational Aids 51theme 1.3.50.51.52 metadata.idinfo.keywords.theme.themekey

58www.scribekey.com

Page 59: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Meta-Map for Data QA/QC

www.scribekey.com 59

Map metadata to summarize and highlight datasets by validation metadata.

Page 60: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Applications: Business/Geo-Intelligence Pivot Tables/Maps

Business Intelligence data exploration/viewing solutions make heavy use of pivot tables, drill-down, drill-through. With a data warehousing approach, geospatial

intelligence solutions can use a similar approach, with maps

STATE

COUNTY

TOWN

CENSUS TRACT

A B C

A B C

A B C

A B C

www.scribekey.com

View and analyze both data and metadata, quality,

completeness, etc.

60

Page 61: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Use a Data Integration Repository Database

A B C

A B C

A B C

A B C

Areas Entities

Attributes Domains

DATA INTEGRATIONREPOSITORY

The Data Integration Repository, implemented as an RDMBS, can be populated by both manual and automated methods and then used to generate metadata outputs, data dictionary content, schemas,

maps, etc.

Data Layers

Metadata

Documents

Assessments

www.scribekey.com61

Derivative Datasets

Meta-Maps

Pivot Tables

Schemas

Data Dictionary

Enhanced User Views

A B C

A B C

ETL

Validation

Page 62: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Data Quality Knowledge, Tools and Techniques

There is a wide variety of highly developed data quality, metadata repository centric knowledge, refactoring, tools, and techniques

available in mainstream IT data warehousing to make use of in helping to improve geospatial datasets.

62www.scribekey.com

Page 63: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Sidebar: Physical Formats and Simplicity• If data is in table format, CSV can be much

easier to work with than XML, can be 1/10th size

• Look at newer smaller less-verbose data exchange formats, e.g., JSON: http://www.json.org/

• XML is best suited for variable length data structures and nesting, e.g., object models

• RESTFULL Web Services vs. SOAP• Keep it simple

63www.scribekey.com

Page 64: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Recommendation: Use Broader Array of Mainstream IT Tools and Techniques to Solve Data Quality Problems

(Look Outside of the Geospatial Word) • Decision Support, Data Warehousing• RDBMS Metadata Repositories• Data Profiling, Refactoring• Business Intelligence, ETL, OLAP Cubes• Structured vs. Unstructured Data Access, Semantic Web • Flexibility through standard RDBMS logical/physical separation and the

use of Views• AGILE Solution Development• Data Quality Paradigm• Lean Manufacturing

64www.scribekey.com

Page 65: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Recap: Take-Aways• Data quality is determined by end users understanding of meaning, structure, and

contents.• Look at data quality and data integration tools and techniques from mainstream IT, data

warehousing, business intelligence• CLEARLY DISTINGUISH BETWEEN DECISION-SUPPORT AND PRODUCTION DATABASE

MODELS - DON’T USE HIGHLY NORMALIZED SCHEMA FOR DECISION SUPPORT DATABASES !!! MAKE DB EASY/FAST TO QUERY

• Use database profiling and refactoring approaches• Use a relational database (metadata repository (MR)) to capture, centralize, and

manage data quality and integration information• Distinguish one-offs from on-going updates and build repeatable ETL processing when

necessary.• SQL, coding skills, regular expressions for data manipulation are all important.• Choose tools that leverage skills you have on hand, preferred language/scripts, etc.• Communicate clearly with stakeholders, end users, team, with table centric artifacts• Don’t ignore data meaning as important element in data quality, build data dictionaries

(metadata), use clear definitions, don’t make up words.• Save ETL mappings, work, notes, scripts, etc., to help grow and reuse skills (MR) • Use an iterative, Agile approach to help ensure you reach goals in timely manner

65www.scribekey.com

Page 66: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Related Papers and ToolsDatabase Template KitMunicipal Permit Tracking Systemhttp://www.mass.gov (search for MPTS) - lots of SQL data cleansing info

NEARC Fall 09 Presentation:How Data Profiling Can Help Clean and Standardize GIS Datasets

NEARC Spring 10 PresentationUsing Meta-Layers for GIS Dataset Tracking andManagementhttp://gis.amherstma.gov/data/SpringNearc2010/C_Session3_0115-0245/

GovPlanning/KeepingTrackOfAllThoseLayers.pdf

Page 67: Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect .

Thank YouQuestions and Answers

www.scribekey.com 67