GIS Data Quality

49
GIS Data Quality GIS Data Quality Producing better data Producing better data quality through robust quality through robust business processes business processes Kim Ollivier BrightSta r

description

GIS Data Quality. Producing better data quality through robust business processes. BrightStar TRAINING. Kim Ollivier. Schedule Day One. Suggested breaks for the following times: Start: 9:00 Session 1 ( 90 min) Morning tea:10:30 to 10:45 Session 2 ( 105 min) - PowerPoint PPT Presentation

Transcript of GIS Data Quality

Page 1: GIS Data Quality

GIS Data QualityGIS Data Quality

Producing better data quality Producing better data quality through robust business through robust business

processesprocesses

Kim Ollivier BrightStar

TRAINING

Page 2: GIS Data Quality

Schedule Day OneSchedule Day One

Suggested breaks for the following times: Start: 9:00

Session 1 ( 90 min)Morning tea: 10:30 to 10:45

Session 2 ( 105 min)Lunch: 12:30 to 1:30

Session 3 ( 90 min) Afternoon tea: 3:00 to 3:15

Session 4 ( 105 min)Finish: 5:00

Each session will have an exercise or interactive discussion

Page 3: GIS Data Quality

TodayToday

IntroductionIntroduction What causes poor qualityWhat causes poor quality

LunchLunch

Assessing Quality processesAssessing Quality processes GIS upgrade project examplesGIS upgrade project examples

Page 4: GIS Data Quality

TomorrowTomorrow

Metadata Designing rules

Lunch

Data warehouse and ETL Feature maintenance

Page 5: GIS Data Quality

OverviewOverview

Introduce yourselfIntroduce yourself Your goals for this course?Your goals for this course?

Build a data quality systemBuild a data quality system Avoid the worst trapsAvoid the worst traps Be able to describe a project scopeBe able to describe a project scope

• Budget, timeline, prioritiesBudget, timeline, priorities

Page 6: GIS Data Quality

Sections of course based onSections of course based on

With permission from the author

ISBN 978-0-09771400-2

Page 7: GIS Data Quality

What is Data Quality?What is Data Quality?

“If they are fit for their intended uses in operations, decision making and planning.”

“If they correctly represent the real-world construct to which they refer.”

Page 8: GIS Data Quality

Spatial AccuracySpatial Accuracy

Page 9: GIS Data Quality
Page 10: GIS Data Quality

Statistical AccuracyStatistical Accuracy

Completeness Score = Relevant Relevant + MissingAccuracy Score = Relevant - Errors Relevant Overall Score = Relevant - Errors Relevant + Missing

Page 11: GIS Data Quality

CompletenessCompleteness

LINZ Bulk Data ExtractLINZ Bulk Data Extract metadata\metadata\meta.htmlmeta.html

Page 12: GIS Data Quality

Data ProfilingData Profiling

Find out what is thereFind out what is there Assess the risksAssess the risks Understand data challenges earlyUnderstand data challenges early Have an enterprise view of all dataHave an enterprise view of all data

Page 13: GIS Data Quality

Profile MetricsProfile Metrics

IntegrityIntegrity ConsistencyConsistency Completeness, DensityCompleteness, Density ValidityValidity TimelinessTimeliness AccessibilityAccessibility UniquenessUniqueness

Page 14: GIS Data Quality

SecuritySecurity

ConfidentialityConfidentiality PossessionPossession IntegrityIntegrity AuthenticityAuthenticity AvailabilityAvailability UtilityUtility

Page 15: GIS Data Quality

ConsistencyConsistency

Discrepancies between attributesDiscrepancies between attributes Exceptions in a cluster Exceptions in a cluster Spatial discrepanciesSpatial discrepancies

Page 16: GIS Data Quality
Page 17: GIS Data Quality
Page 18: GIS Data Quality

A GIS Data A GIS Data Quality SystemQuality System

Assess

Data Quality AssessmentData Profiling

Improve Prevent Recognise

Data CleaningMonitoring

Data IntegrationInterfaces

Ensuring Quality ofData Conversionand Consolidation

Building DataQuality Metadata

Warehouse

Monitor

Recurrent Data QualityAssessment

Page 19: GIS Data Quality

Course examplesCourse examples

LINZ coordinate upgrade 1998-2003LINZ coordinate upgrade 1998-2003 NSCC services upgrade 2008NSCC services upgrade 2008 Valuation roll structure and matchingValuation roll structure and matching ETL of utilites from SDE to AutocadETL of utilites from SDE to Autocad Address location issues NAR, DRAAddress location issues NAR, DRA

Documents and examples on memory stick

Page 20: GIS Data Quality

Exercise 1:Exercise 1:Nominate your databaseNominate your database

Select a representative example dataset Select a representative example dataset for later discussionfor later discussion

You may be responsible forYou may be responsible for Or, you have to integrateOr, you have to integrate Or, you have to load itOr, you have to load it Or, you supply it to othersOr, you supply it to others

Morning Tea

Page 21: GIS Data Quality

Assessing QualityAssessing Quality

1.1. Project stepsProject steps2.2. Required rolesRequired roles3.3. Defining the objectivesDefining the objectives4.4. Designing rulesDesigning rules5.5. Scorecard and MetadataScorecard and Metadata6.6. Frequency of assessmentFrequency of assessment7.7. Common mistakesCommon mistakes

Page 22: GIS Data Quality

Processes Affecting Data QualityProcesses Affecting Data Quality

Real-TimeInterfaces

Batch Feeds

Manual DataEntry

System Consolidations

Initial Data Conversion

Processes bringing data from outside

Process Automation

Loss of Expertise

New DataUses

System Upgrades

Changes notcaptured

Processes causingdata decay

Processes changing data from within

Data processing Data cleaning Data purging

Database

Page 23: GIS Data Quality

Outside: Initial Data ConversionOutside: Initial Data Conversion

Define data mappingDefine data mapping Extract, Transform, Load (ETL)Extract, Transform, Load (ETL) Drown in Data ProblemsDrown in Data Problems Find Scapegoat Find Scapegoat

Page 24: GIS Data Quality

Outside: System ConsolidationOutside: System Consolidation

Often from mergers (Auckland?)Often from mergers (Auckland?)• Unplanned, unreasonable timeframesUnplanned, unreasonable timeframes

Head-on two car wreckHead-on two car wreck Square pegs into round holesSquare pegs into round holes Winner – loser merging (50% wrong)Winner – loser merging (50% wrong)

Page 25: GIS Data Quality

Outside: Manual Data EntryOutside: Manual Data Entry

High error rateHigh error rate Complex and poor entry formsComplex and poor entry forms Users find ways around checksUsers find ways around checks Forcing non blanks does not workForcing non blanks does not work

Page 26: GIS Data Quality

Outside: Batch FeedsOutside: Batch Feeds

Large volumes mean lots of errorsLarge volumes mean lots of errors Source system subject to changesSource system subject to changes Errors accumulateErrors accumulate Especially dangerous if triggers Especially dangerous if triggers

activatedactivated

Page 27: GIS Data Quality

Outside: Real-Time InterfacesOutside: Real-Time Interfaces

Data between db’s in synchronisationData between db’s in synchronisation Data in small packets out of contextData in small packets out of context Too fast to validateToo fast to validate Rejection loses record, so acceptedRejection loses record, so accepted

Faster or better but not both!Faster or better but not both!

Page 28: GIS Data Quality

Decay: Changes Not CapturedDecay: Changes Not Captured

Object changes are unnoticed by Object changes are unnoticed by computerscomputers

Retroactive changes may not be Retroactive changes may not be propagatedpropagated

Page 29: GIS Data Quality

Decay: System UpgradesDecay: System Upgrades

The data is assumed to comply with the The data is assumed to comply with the new requirementsnew requirements

Upgrades are tested against what the Upgrades are tested against what the data is supposed to be, not what is data is supposed to be, not what is actually thereactually there

Once upgrades are implemented Once upgrades are implemented everything goes haywireeverything goes haywire

Page 30: GIS Data Quality

Decay: New Data UsesDecay: New Data Uses

““Fitness to the purpose of use” may not Fitness to the purpose of use” may not applyapply

Acceptable error rates may now be an Acceptable error rates may now be an issueissue

Value granularity, map scaleValue granularity, map scale Data retention policyData retention policy

Page 31: GIS Data Quality

Decay: Loss of ExpertiseDecay: Loss of Expertise

Meaning of codes may change over time Meaning of codes may change over time that only “experts” knowthat only “experts” know

Experts know when data looks wrongExperts know when data looks wrong Retirees rehired to work systemsRetirees rehired to work systems Auckland address points were entered Auckland address points were entered

on corners and the rest guessed, later on corners and the rest guessed, later used as exact.used as exact.

Page 32: GIS Data Quality

Decay: Process AutomationDecay: Process Automation

Web 2.0 bots automate form fillingWeb 2.0 bots automate form filling Transactions are generated without ever Transactions are generated without ever

being checked by peoplebeing checked by people Customers given automated access are Customers given automated access are

more sensitive to errors in their own more sensitive to errors in their own datadata

Page 33: GIS Data Quality

Within: Data ProcessingWithin: Data Processing

Changes in the programsChanges in the programs Programs may not keep up with changes Programs may not keep up with changes

in data collectionin data collection Processing may be done at the wrong Processing may be done at the wrong

timetime

Page 34: GIS Data Quality

Special GIS Data IssuesSpecial GIS Data Issues

Coordinate data not usually readableCoordinate data not usually readable Data models CAD v GIS Data models CAD v GIS Fuzzy matching is not Boolean (near)Fuzzy matching is not Boolean (near) Atomic objects harder to defineAtomic objects harder to define Features have 2,3,4,5 dimensionsFeatures have 2,3,4,5 dimensions Projection systems are not exactProjection systems are not exact Topology requires special operatorsTopology requires special operators

Page 35: GIS Data Quality

Within: Data PurgingWithin: Data Purging

Highly risky for data qualityHighly risky for data quality Relevant data may be purgedRelevant data may be purged Erroneous data may fit criteriaErroneous data may fit criteria It may not work the next yearIt may not work the next year

Page 36: GIS Data Quality

Within: Data CleaningWithin: Data Cleaning

En masseEn masse processes may add errors processes may add errors Cleaning processes may have bugsCleaning processes may have bugs Incomplete information about dataIncomplete information about data

Page 37: GIS Data Quality

Assessing Data QualityAssessing Data Quality

Data profilingData profiling Interview usersInterview users Examine data modelExamine data model Data GazingData Gazing

Page 38: GIS Data Quality

Data GazingData Gazing

Count the recordsCount the records Just open the sources and scrollJust open the sources and scroll Sort and look at the endsSort and look at the ends Run some simple frequency reportsRun some simple frequency reports See if the field names make senseSee if the field names make sense What is missing that should be thereWhat is missing that should be there

Lunch

Page 39: GIS Data Quality

Data CleaningData Cleaning

There are always lots of errorsThere are always lots of errors It is too much to inspect all by handIt is too much to inspect all by hand Data experts are rare and too busyData experts are rare and too busy It does not fix process errorsIt does not fix process errors You may make it worseYou may make it worse

Page 40: GIS Data Quality

Automated CleaningAutomated Cleaning

The only practical methodThe only practical method Needs sophisticated pattern analysisNeeds sophisticated pattern analysis Allow for backtrackingAllow for backtracking Data quality rules are interdependentData quality rules are interdependent

Page 41: GIS Data Quality

Common MistakesCommon Mistakes

1.1. Inadequate Staffing of Data Quality Teams Inadequate Staffing of Data Quality Teams 2.2. Hoping That Data Will Get Better by Itself Hoping That Data Will Get Better by Itself 3.3. Lack of Data Quality Assessment Lack of Data Quality Assessment 4.4. Narrow Focus Narrow Focus 5.5. Bad Metadata Bad Metadata 6.6. Ignoring Data Quality During Data Conversions Ignoring Data Quality During Data Conversions 7.7. Winner-Loser Approach in Data Consolidation Winner-Loser Approach in Data Consolidation 8.8. Inadequate Monitoring of Data Interfaces Inadequate Monitoring of Data Interfaces 9.9. Forgetting About Data Decay Forgetting About Data Decay 10.10. Poor Organization of Data Quality Metadata Poor Organization of Data Quality Metadata

Page 42: GIS Data Quality

MetadataMetadata

Data modelData model Business rules, relations, stateBusiness rules, relations, state Subclasses (lookup tables)Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XMLGIS Metadata (NZGLS or ISO) XML Readme.txtReadme.txt

Includes everything known about the data

Page 43: GIS Data Quality

Data ExchangeData Exchange

Batch or interactiveBatch or interactive ETL (Extract Transform Load)ETL (Extract Transform Load) ReplicationReplication Time differences in dataTime differences in data

Page 44: GIS Data Quality

GIS in Business ProcessesGIS in Business Processes

Integrates many different sourcesIntegrates many different sources Spatial patterns are revealedSpatial patterns are revealed Display thousands of records Display thousands of records

simultaneously with direct accesssimultaneously with direct access Location now seen as importantLocation now seen as important

Page 45: GIS Data Quality

ScorecardScorecard

DQ Score

Score SummaryScore Decompositions

Intermediate Error ReportsAtomic Level Data Quality Information

Page 46: GIS Data Quality

Case StudyCase Study

Outline a GIS data quality systemOutline a GIS data quality system Measles ChartMeasles Chart PrioritisePrioritise InterviewInterview Build up a scorecardBuild up a scorecard

Afternoon Tea

Page 47: GIS Data Quality

Assessment ExerciseAssessment Exercise

Split into pairsSplit into pairs Interview one person about their datasetInterview one person about their dataset Collect basic informationCollect basic information Devise a strategy for a profileDevise a strategy for a profile

Rotate pair with anotherRotate pair with another Interview other personInterview other person

Verbal reports to classVerbal reports to class

Page 48: GIS Data Quality

Major Upgrade ProjectsMajor Upgrade Projects

LINZ Coordinate upgradeLINZ Coordinate upgrade NSCC Coordinate upgradeNSCC Coordinate upgrade

Page 49: GIS Data Quality

ReferencesReferences

Data Quality Assessment – Arkady MaydanchikData Quality Assessment – Arkady Maydanchik