Taming EHR Data

12
Taming EHR Data Jim Weatherall, PhD Head, Advanced Analytics Centre, AstraZeneca Visiting Lecturer, School of Computer Science, University of Manchester 14 th World Congress on Medical & Health Informatics, August 2013, Copenhagen Using Semantic Similarity to Reduce Dimensionality On behalf of the authors: Leila Kalankesh, School of Computer Science, UoM James Weatherall, AstraZeneca Thamer Ba-Dhfari, School of Computer Science, UoM Iain Buchan, Institute of Population Health, UoM Andy Brass, School of Computer Science, UoM

description

Taming EHR Data. Using Semantic Similarity to Reduce Dimensionality. Jim Weatherall, PhD Head, Advanced Analytics Centre, AstraZeneca Visiting Lecturer, School of Computer Science, University of Manchester 14 th World Congress on Medical & Health Informatics, August 2013, Copenhagen. - PowerPoint PPT Presentation

Transcript of Taming EHR Data

Page 1: Taming EHR Data

Taming EHR Data

Jim Weatherall, PhDHead, Advanced Analytics Centre, AstraZenecaVisiting Lecturer, School of Computer Science, University of Manchester

14th World Congress on Medical & Health Informatics, August 2013, Copenhagen

Using Semantic Similarity to Reduce Dimensionality

On behalf of the authors:

Leila Kalankesh, School of Computer Science, UoMJames Weatherall, AstraZenecaThamer Ba-Dhfari, School of Computer Science, UoMIain Buchan, Institute of Population Health, UoMAndy Brass, School of Computer Science, UoM

Page 2: Taming EHR Data

Biometrics & Information Sciences | GMD

Introduction

Problems with mining healthcare data

J.Weatherall | August 20132

Read Code Rubric

C10F. Type II Diabetes Mellitus,

1372. Trivial smoker < 1 cig/day

bd3j. Prescription of “Atenolol 25mg tablets”

G20. Essential hypertension

2469. Measurement of Diastolic Blood Pressure

246A. Assessment of Diastolic Blood Pressure

Research not primary purpose for collection

100s of 1000s of codes

10s of 1000s of dimensions

Large collections not easily visualised or interpreted

Page 3: Taming EHR Data

Biometrics & Information Sciences | GMD

Data

The Salford Integrated Record (SIR)

J.Weatherall | August 20133

Population ~220,000 Integrated primary and secondary care

information Individual Read Code entries captured in

primary care information systems Codes for diagnosis Codes for procedures

All clinical transactions in primary care and some in secondary care

Data extract for this analysis based on: GP data in date range 2003-2009

Containing 136M Read code entries Selected 24K patients with chronic

conditions Containing 443K Read code entries

Page 4: Taming EHR Data

Methods

Semantic Similarity

4 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

How alike are the meanings of two terms? ?

From Sanchez, J.Biomed.Inform, 2011

Measure ontological distance?

Or not?

Measure depth?

Page 5: Taming EHR Data

Methods

Semantic Similarity – which method?

5 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

Semantic Similarity Method

Ontological

Node-based

Edge-based

Hybrid

Corpus-based

Frequency

Context

Proximity

Combined

An ontology of methods!

Page 6: Taming EHR Data

Semantic similarity calculation

The Resnik measure

J.Weatherall | August 20136 Biometrics & Information Sciences | GMD

Term probability, based on frequency, including descendants and annotationsN

ccountcP

ccodesc )()(

)(1

)(log)( cPcIC 2 Log transformation, gives “Information Content”

)(),( 21Re MICAs CICccsim 3IC of “Most Informative Common Ancestor” gives similarity measure

P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language”, J Artif Intell Res, 1999

Page 7: Taming EHR Data

Analysis Plan

Stepwise approach to dimensionality reduction

7 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

Map patient records from diagnosis space into a similarity space

1

Map patient records into a low-dimensional vector space via PCA

2

Project patient records onto low-dimensional vector space and cluster patients by similarity

3

Page 8: Taming EHR Data

Analysis – Step

Mapping from diagnosis space to similarity space

8 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

p1 p2 … pn

p1 sim(p1,p1) sim(p1,p2) … sim(p1,pn)

p2 sim(p2,p1) sim(p2,p2) … sim(p2,pn)

… … … … …

pn sim(pn,p1) sim(pn,p2) … sim(pn,pn)

“The Similarity Matrix”

pi = patient isim(pi,pj) = similarity score between patients i and j

1

Page 9: Taming EHR Data

Analysis – StepsPCA on the similarity matrix, visualisation & clustering

9 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

2 3+

Natural co-morbidity: Diabetes is a risk factor for angina due to its accelerating effect on atherosclerosis

Page 10: Taming EHR Data

Discussion & Conclusion

Review & Outlook• Patients with similar diagnosis codes are grouped together• Therefore, the semantic similarity technique works, to some

degree• Therefore, this is a viable route to dimensionality reduction

in complex healthcare data sets

10 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD

New biomedical hypotheses?

Transferability of method?

Population level characterisation?

New data mining paradigms?

Exploring co-morbidity and co-treatment effects?

Page 11: Taming EHR Data

Thank You!

Page 12: Taming EHR Data

12

Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com

J.Weatherall | August 2013 Biometrics & Information Sciences | GMD