Data Quality (a.k.a. “ Data Heterogeneity ” )

22
Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton

description

Data Quality (a.k.a. “ Data Heterogeneity ” ). Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton. Objectives. Assess Data variability within and across institutions Assess impact of this variability on Secondary Use of EMR Generate specifications for Widgets - PowerPoint PPT Presentation

Transcript of Data Quality (a.k.a. “ Data Heterogeneity ” )

Page 1: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Data Quality(a.k.a. “Data

Heterogeneity”)Kent Bailey, Susan Rea Welch,

Lacey Hart, Kevin Bruce,

Susan Fenton

Page 2: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Objectives

Assess Data variability within and across institutions

Assess impact of this variability on Secondary Use of EMR

Generate specifications for Widgets– “Warning Label” for suspect data categories– Data quality audits with logs– Batch data correction / removal

Page 3: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Current Research: Effects of Variation on Diabetes Phenotyping Algorithm

Purpose: Compare data relevant to Type 2 DM eMERGE phenotyping algorithm between Intermountain and Mayo

Methods: 1. Identify adult subjects with evidence in any

semantic category of algorithm: ICD-9-CM codes for Diabetes Mellitus Abnormal glucose or HbA1C Antihyperglycemic medications Capillary glucose (Glucometer) procedures

Page 4: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Methods2. Collect relevant data on these subjects

– ICD-9-CM codes– Procedure codes– Demographic data– Smoking status– Body Mass index– Specialty of provider– Geographic info– Frequency of health care encounters

3. Describe variation between institutions

Page 5: Data Quality (a.k.a.  “ Data Heterogeneity ” )

AnalysisCompare (between institutions) frequencies of

data elements– ICD9 codes– overall and specific codes

Compare lab values– number and valuesCompare medications– Control for:

– Provider specialty– Geographic variables– Demographic variables

Page 6: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Interpretation

Assess impact of data heterogeneity on phenotyping at different institutions

Recommendations for– High throughput Phenotyping– High throughput screening for clinical trials

Generalization to other phenotypesHypothesis generation

Page 7: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Preliminary Mayo Results

Mayo Data: (ICD or abn.labs or capill. Glucose, limited to Olmsted and surrounding counties)

– 13,754 subjects 89% Caucasian, 2.5% African-American, 2.0% Asian 6.5% Native Am, Pac. Isl., other, unknown, refuse

– Mean current age 64, range 20 to 104– Sex: 53% male, 47% female

Page 8: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Preliminary Mayo resultsN=13,754

Smoking (n=11,626)– Current 66%, past 16%, never 13%, Unk 6%

BMI (limited to < 60) (n=6,338)– Mean 32.6 +/- 7.2– Median 31.6, quartiles (27.5, 36.6)

Page 9: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Preliminary Results: ICD9 codes

Complications– None 6743 (250.0)– Ketoacidosis 1 (250.1)– Hyperosmolality 2 (250.2)– Renal 398 (250.4)– Opthalmic 1385 (250.5)– Neuro 586 (250.6)– Peripheral Circ. 25 (250.7)– “other specified” 312 (250.8)– Unspecified 336 (250.9)

Page 10: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Preliminary Results: ICD9 codes

250.X0 Type 2 or unspecified, controlled or not

» specified as uncontrolled

250.X1 Type 1, controlled or not

» Specified as uncontrolled

250.X2 Type 2 or unspecified, uncontrolled

250.X3 Type 1, uncontrolled

Page 11: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Type 2/U vs. Type 1 DM codesMayo Data: n=13707

Type 1 DM

codes

Type 2/U DM codes

0 1+

0 6339

(46%)

6631

(48%)

1+ 483

(4%)

254

(2%)

Page 12: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Intermountain peek (sic)

Type 1 ICD9 codes

Type 2/U ICD9 codes

0 1+

0 -- 65,983

1+ 2,083 6,629

Disclaimer– don’t assume data are ready to compare between sites at this point

Page 13: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Back to Mayo SummarySample Lab data

Test name

N Min 1% Med. 99% Max

Glucose(P)

40,786 1 67 127 394 1300

Glucose POCT

211,746 25 63 141 392 600

Hemoglobin A1c, B

35,206 4.0% 5.1% 6.9

%

12.1%

16.7

%

Page 14: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Future DirectionsCarry out inter-institution comparisonStudy effects of geography, race, etc.Implement chart review (on random sample)

for “gold standard” definition of Type 2 DMUse of lab values /meds for definition of

continuous phenotype (DM-ness)Extrapolation / generalization to other

diseases /phenotypes

Page 15: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Data Quality(a.k.a. “Data

Heterogeneity”)

Susan Rea Welch

Page 16: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Conclusions: PhD ResearchCohort Amplification

– Knowledge Discovery from Databases (KDD)– Associative Classification Methods– Classification Rules for Diabetes and Asthma

comparably accurate Concise consistent with domain knowledge

– Contributed new knowledge Attributes for cohort identification Unanticipated comorbidity associations

Page 17: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Consistency and NoveltyDiabetes

Elevated quantitative lab glucose assays– Frequency 19%, Likelihood 87%– Less predictive than glucose by glucometer or Urine Microalbumin

Abnormal HbA1c test– Equivalent predictive power of HBA1c test order

Antihyperglycemic medications– Variable predictive strength:

Metformin, Insulin, Insulin Release Stimulators,Insulin Response Enhancers

Page 18: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Consistency and NoveltyAsthma

Medications were most predictive

– High Likelihood: Salmeterol, Leukotriene receptor antagonist

– Albuterol / Glucocorticoid combine: Pulmonary Procedures (CPT hierarchy) Female gender Abnormal CBC

Unexpected comorbidity associations– Suggests discovery of shared pathways

Page 19: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Associative Classification – What?

• Pattern discovery in transaction database• Independent of domain expertise

• Deductive, global associations in data

• Induce a general & accurate classifier

Page 20: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Associative Classification – Why?

• No domain expertise attribute selection

• Not affected by missing data

• Proven accuracy

• Understandable rules

• Independent rules

Page 21: Data Quality (a.k.a.  “ Data Heterogeneity ” )

Core Candidate Attributes

Diagnosis codesProvider specialtyLab observationsProcedure codes‘Abnormal’ lab obs. Imaging proceduresMedication listAge groupsFemale gender

Page 22: Data Quality (a.k.a.  “ Data Heterogeneity ” )

SHARPn Y2 Research Aims

Associations reliable across EHRs?

Improve algorithms’ sensitivity / specificity?

– AC attribute selection + other classifiers