Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003...
Transcript of Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003...
1
February 18-19, 2003 1
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Yang W. [email protected], [email protected]
Northeastern UniversityPhone: 1-617-373-5052
Fax: 1-617-373-3166
Information Quality in Context
February 18-19, 2003 2
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Outline Introduction
Examples: What is Data Quality? Background: Motivation and Related Work
Research QuestionsConceptsStudy: Sites, Projects, Data, AnalysisResults: 3 Data Quality (DQ) Problem PatternsDQ Improvement: 10 Potholes (Root Causes) Summary and Lessons Learned
2
February 18-19, 2003 3
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Introduction Examples
Rosetta Stone found in 1799, inscription deciphered and published in 1822.The overture of 1805FedEx in 2002
February 18-19, 2003 4
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The 1805 OvertureIn 1805, the Austrian and Russian Emperors agreed to join In 1805, the Austrian and Russian Emperors agreed to join forces against Napoleon. The Russians said their forces forces against Napoleon. The Russians said their forces would be in the field in Bavaria by would be in the field in Bavaria by Oct. 20Oct. 20. . The Austrian staff planned based on that date in theThe Austrian staff planned based on that date in theGregorian calendarGregorian calendar. Russia, however, used the ancient. Russia, however, used the ancientJulian calendarJulian calendar, which lagged 10 days behind., which lagged 10 days behind.The difference allowed Napoleon to surround Austrian The difference allowed Napoleon to surround Austrian General Mack's army atGeneral Mack's army at UlmUlm on Oct. 21, well before the on Oct. 21, well before the Russian forces arrived.Russian forces arrived.Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, p. 390.
Acknowledgement: A. Morton and S. Madnick
3
February 18-19, 2003 5
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
FedEx 2002111502
010203
February 18-19, 2003 6
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Background and Motivation Documented risk, cost, impact of poor-quality dataUndocumented lost opportunitiesEveryday inconvenience
A global consumer product company wants to identify products made of the same materials for its global procurement plan
A major hospital faces difficulties in conducting cross-patient trend analysis for its proactive patience care program
An insurance company faces a dilemma of using their poor-quality marketing analysis results form making strategic business decisions.
Cumulated impact of poor DQ on organizational performance
Consumer dissatisfaction, unstable business operation, misguidedbusiness strategies, and missing business opportunities.
4
February 18-19, 2003 7
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Related Work and Work-in-progressInformation Manufacturing Model (Ballou et al, 1998) Data quality Dimensions (Wang and Strong, 1996)Data quality in Context (Strong, Lee, Wang, 1997)Data quality Measurement ( Pipino, Lee, and Wang, 2002)Data quality Assessment (Lee, Pipino, and Wang, 2002)Information Product (Wang, Lee et al, 1998)Information Product-MAP (Pierce et al, 2002)Quality Information and Knowledge (Huang, Lee, and Wang, 1999)Interdependencies: Data and Process (Lee and Katz-Hass, 2002)Process-embedded Data Integrity ( Lee et al) Knowledge at Work for Data Quality (Lee et al)Rules in Data Quality (Lee et al)Context-reflective DQ Problem-solving (Lee et al)Journey to Data Quality (Lee et al, MIT Press, Forthcoming)
February 18-19, 2003 8
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Research Questions
How do organizations define data quality?What data quality problems arise in organizations?How do organizations identify, analyze, and resolve data quality problems?Are there common data quality patterns?
Across OrganizationsAcross DQ projects
5
February 18-19, 2003 9
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Concepts
Data Production System
Data Consumer’s View
Multiple Data Quality Categories
February 18-19, 2003 10
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Data Production System
6
February 18-19, 2003 11
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Data Consumer’s View of DQ
Quality data is data that is fit for use by data consumers (Wang et al, 1996)
IQ Category IQ DimensionsIntrinsic IQ Accuracy, Objectivity, Believability, ReputationContextual IQ Relevancy, Value-Added, Timeliness, Completeness, Amount
of informationRepresentational IQ Interpretability, Ease of understanding, Concise representation,
Consistent representationAccessibility IQ Accessibility, Access security
February 18-19, 2003 12
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Multiple DQ Categories
Intrinsic DQ: information have quality in their own right.
Contextual DQ: information quality must be considered within the context of the task at handRepresentational DQ andAccessibility DQ emphasize the importance of the role of systems
7
February 18-19, 2003 13
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Study ConceptsDQ Project
Data-related actions taken to manage DQ problemsProblem finding (inquiry)Problem analysis (framing)Problem resolution (action)
DQ ProblemAny difficulty in collecting, storing/maintaining, and utilizing data.
DQ StakeholdersData collectorsData custodians (IS professionals)Data consumers
February 18-19, 2003 14
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Study Sites3 data-intensive and service-critical companies
Airline (GoldenAir)Hospital (BetterCare)HMO (HyCare)
Seriously attend to their IQ problemsVary in their computing environmentVary in how they attend to IQ
Software toolsIQA/DQATQM
8
February 18-19, 2003 15
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Data CollectionCollected 42 DQ project histories DQ Project: data-related actions taken in an organization to manage DQ Problems
DQ Problem: difficulties in collecting, storing, or using data.
Interviewed information stakeholders: Information collectorsInformation custodians (IS professionals) Information consumersManagers for information collectors, custodians, and consumers
February 18-19, 2003 16
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Data AnalysisPerformed content analysis of 42 transcribed DQ project histories
DQ dimensions are the content analysis codes
Performed pattern analysis of coded projectsClassified projects: by overriding DQ concern into four DQ categoriesWithin project: chronological order of DQ dimensionsAcross projects: group by common patterns of chronological DQ dimensions
Performed embedded case analysis of each project
9
February 18-19, 2003 17
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Example DQ Project Hospital
PROBLEM FINDING:TRACE DQA noticed a large increase in infectious disease patients
PROBLEM ANALYSIS:A possible error in collection and storage of dataCalled admissions to confirm this cause
PROBLEM RESOLUTION:Process: Trained personnel
Checked and Revised emergency room procedures Data: Admissions and IS work together to change data
February 18-19, 2003 18
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
ResultsThree DQ Patterns Identified
Intrinsic DQ patternInformation not used by consumersbelievability, reputation, objectivity
Accessibility DQ patternConsumers experience any barriers to accessing information accessibility, accessibility security, timelinessRepresentational DQ dimensions show up as underlying causes of accessibility DQconsistent, concise representation
Contextual DQ patternConsumer’s ( multiple) task (changing) context as critical context
10
February 18-19, 2003 19
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Pattern 1: Intrinsic DQMis-match between several sources of the “same” data
Hospital: TRACE vs. STATUS (ex. daily hospital bed utilization)“consistency” vs. “accuracy”
Airline: manual vs. warehouse, MMS vs. warehouse
Starts as a believability issueOver time, poor reputation of sources
STATUS develops poor reputation for qualityMMS develops poor reputation for quality
Subjective production of dataHuman judgment in coding
Multiple sources of same data Judgement involved in data production
Questionable Believability Questionable Objectivity
Poor Reputation
Little Added Value
Data not used
(1)(2)
Poor intrinsic dataquality becomescommon knowledge
Information about causesof mismatches accumulate
Mismatchesexist
Data not used because of littleAdded Value and poorreputation
Information aboutsubjectivityaccumulate
Data production processviewed as subjective
DQ Pattern 1: Intrinsic DQ
11
February 18-19, 2003 21
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Pattern 2: Accessibility DQ Technical Accessibility
Physical access (Airline)Computing resources (HMO)
Time to Access / Ease of Access:Amount of data (HMO)Privacy, confidentiality (HMO, Hospital)
Interpretability and Understandability:Coding, such as DRG coding (HMO, Hospital)
Representation and its Analyzability:Image and text data (HMO, Hospital)
Lack of computingresources
PoorAccessibilty
Privacy andconfidentiality
Access Security
Computerizing and data analyzing
Interpretability andUnderstandability
Concise andConsistent
Representation
Amount of Data
Timeliness
Barriers to data accessibility
Computerized datainaccessible whenneeded
Processing sloweddue to large datavolume; e.g.,weekend batchextracts
Large amount ofdata accumulated
Advanced ITpermitsstorage ofimage andtext data
Computerizeddata inaccessiblefor analysis due tolimited capabilitiesto summarizeacross image andtext data
Computerized datainaccessible becausemultiple specialists areneeded to interpret dataacross multiplespecialties
Computerizeddata coded,e.g., DRG andprocedurecodes
Technical data acrossmultiple specialtiesincluded in databases;e.g., medicalterminology, medicalmeasurements, andengineeringspecifications
Must protectconfidentiality
Computerized datainaccessible due totime and effort toget authorizedpermission toaccess
Computerized datainaccessible due toinsufficient systemsresources
Systemsdifficult toaccess;e.g.,unreliablenetwork
(3) (4) (5) (6) (7)
DQ Pattern 2: Accessibility DQ
12
February 18-19, 2003 23
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Pattern 3: Contextual DQ
Mis-match between information available and what information is relevant and adds value for information consumers
Missing data -- the easy caseData bundling and analyzability -- the hard case
Consider the hard case
February 18-19, 2003 24
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Pattern 3: Contextual DQ
Data bundling and analyzabilityIssue is aggregation
Across record (transaction) analysis of dataOften across distributed systems
Incompatible, distributed systems (HMO)Bundling Unit (Hospital)
1970’s: procedures performed in the hospital1980’s: patient visit, disease1990’s: patient across all visits, diseases
13
Data utilization difficulty
Operational dataproduction problems
Changing dataconsumers' needs Distributed Computing
Incomplete Data
Poor Relevancy
InconsistentRepresentation
Little Value Added
Inability to integrate oraggregate data results inpoor contextual DQ (datawith little value-added orrelevancy to dataconsumers' takes)
Computerized data arenot relevant to currentdata consumers' tasksdue to incomplete datafor analysis andaggregation
Dataproducersfail to supplycompletedata
Need for new dataNeed to aggregate databased on "fields"(attributes) that do notexist in the data
Need to aggregate,report and integrateacross autonomousand heterogeneoussystems
Integrated data fromdifferent systems addlittle value due toinconsistentlyrepresented data
(8) (9) (10)
DQ Pattern 3: Contextual DQ
February 18-19, 2003 26
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Organizational DQ Principles
Intrinsic DQ:Information has quality in its own right (Internal View)
Accessibility DQ:Information must be accessible, but secureInformation must be presented in a concise, but understandable representation.
Contextual DQ:DQ must be considered with the context of the task at hand
14
Data Quality Problem Pattern
Multiple sources ofsame data
QuestionableBelievability
Poor Reputation
Little Added Value
Data not used
Judgement involvedin
data production
QuestionableObjectivity
Barriers to data accessibility Data utilization difficulty
Lack ofcomputingresources
Privacy andconfidentiality Computerizing and data analyzing
Operationaldata production
problems
Changing dataconsumers'
needs
DistributedComputing
PoorAccessibilty
AccessSecurity
Interpretabilityand
Understandability
Concise andConsistent
Representation
Amount of Data
TimelinessPoor Relevancy
Incomplete DataInconsistent
Representation
Little ValueAdded
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
February 18-19, 2003 28
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The Road to Data Quality
Improving Data QualityAttend to Data Production Processes
Data collection Data storageData utilization
Attend to Key DQ Problems
15
The Information Production Road
February 18-19, 2003 30
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The Ten Potholes
1. Subjective information production2. Multiple sources of the same information 3. Information production errors4. Too much information5. Distributed, inconsistent information6. Storage of non-numeric information7. Lack of algorithms for non-numeric information8. Changing task environment of information consumers9. Security and privacy vs. accessibility
10. Lack of computing resources
16
Subjective Judgment
Multiple Sources of Same Data
17
Systemic Errors in Data Production
Large Volume vs. Timely Access
18
Distributed Heterogeneous Systems
Advanced Analysis: Image and Text
19
Nonnumeric Data
Environment/Market Change
20
Access vs. Security and Privacy
Lack of Computing Resources
21
Ten Potholes in the Road to Information Quality
InformationSources
InformationSystems
Infrastructure
TaskEnvironment
P1
P5
P4
P3
P2
P6 P7
P8
P9
P10
ComputerizedDatabase
InformationProduction
Process
InformationStorage &
MaintenanceProcess
InformationUtilizationProcess
MultipleSources
SubjectiveProduction
ProductionErrors
Too MuchInformation
Non-numericInformation
DistributedSystems
AdvancedAnalysis
Requirements
Changing TaskNeeds
Security &Privacy
Requirements
Lack ofComputingResources
Info
rmat
ion
Con
sum
ers
Information Custodians
Information Producers
February 18-19, 2003 42
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The Information Collection RoadKey IQ Problems
Multiple Sources of the Same Information (duplicate production)Subjective Information ProductionInformation Production Errors
22
February 18-19, 2003 43
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The Information Storage Road
Key IQ ProblemsToo much informationDistributed, inconsistent informationStorage of non-numeric information
February 18-19, 2003 44
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
The Information Utilization Road
Key IQ ProblemsLack of algorithms for non-numeric informationChanging task environment of information consumersSecurity and Privacy vs. AccessibilityLack of Computing Resources
23
February 18-19, 2003 45
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Lessons LearnedAccuracy is necessary, but not sufficient for high DQ.Attend to evolving DQ problem: DQ problems change as business needs change over time (global, cross-functional, integration).Attend to the entire Information Production System.Attend to the root-causes of key common IQ problems. Look beyond technical accessibility.Recognize that DQ is evaluated in the context of the changing tasks of multiple data consumers.
February 18-19, 2003 46
Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop
Key ReferencesLee, Y., Strong D., Kahn, B., and R. Wang, “AIMQ; A Methodology for Information Quality Assessment,” Information & Management, Vol. 40, Issue 2, December, 2002, pp 133-146. Huang, K. T., Y. Lee, and R. Wang, Quality Information and Knowledge, Upper Saddle River: NJ, Prentice Hall, 1999. Strong D., Y. Lee, and R. Wang, “Data Quality in Context,” Communications of the ACM, May 1997, pp. 103-110. http://web.mit.edu/tdqm