Taming EHR Data
description
Transcript of Taming EHR Data
Taming EHR Data
Jim Weatherall, PhDHead, Advanced Analytics Centre, AstraZenecaVisiting Lecturer, School of Computer Science, University of Manchester
14th World Congress on Medical & Health Informatics, August 2013, Copenhagen
Using Semantic Similarity to Reduce Dimensionality
On behalf of the authors:
Leila Kalankesh, School of Computer Science, UoMJames Weatherall, AstraZenecaThamer Ba-Dhfari, School of Computer Science, UoMIain Buchan, Institute of Population Health, UoMAndy Brass, School of Computer Science, UoM
Biometrics & Information Sciences | GMD
Introduction
Problems with mining healthcare data
J.Weatherall | August 20132
Read Code Rubric
C10F. Type II Diabetes Mellitus,
1372. Trivial smoker < 1 cig/day
bd3j. Prescription of “Atenolol 25mg tablets”
G20. Essential hypertension
2469. Measurement of Diastolic Blood Pressure
246A. Assessment of Diastolic Blood Pressure
Research not primary purpose for collection
100s of 1000s of codes
10s of 1000s of dimensions
Large collections not easily visualised or interpreted
Biometrics & Information Sciences | GMD
Data
The Salford Integrated Record (SIR)
J.Weatherall | August 20133
Population ~220,000 Integrated primary and secondary care
information Individual Read Code entries captured in
primary care information systems Codes for diagnosis Codes for procedures
All clinical transactions in primary care and some in secondary care
Data extract for this analysis based on: GP data in date range 2003-2009
Containing 136M Read code entries Selected 24K patients with chronic
conditions Containing 443K Read code entries
Methods
Semantic Similarity
4 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
How alike are the meanings of two terms? ?
From Sanchez, J.Biomed.Inform, 2011
Measure ontological distance?
Or not?
Measure depth?
Methods
Semantic Similarity – which method?
5 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
Semantic Similarity Method
Ontological
Node-based
Edge-based
Hybrid
Corpus-based
Frequency
Context
Proximity
Combined
An ontology of methods!
Semantic similarity calculation
The Resnik measure
J.Weatherall | August 20136 Biometrics & Information Sciences | GMD
Term probability, based on frequency, including descendants and annotationsN
ccountcP
ccodesc )()(
)(1
)(log)( cPcIC 2 Log transformation, gives “Information Content”
)(),( 21Re MICAs CICccsim 3IC of “Most Informative Common Ancestor” gives similarity measure
P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language”, J Artif Intell Res, 1999
Analysis Plan
Stepwise approach to dimensionality reduction
7 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
Map patient records from diagnosis space into a similarity space
1
Map patient records into a low-dimensional vector space via PCA
2
Project patient records onto low-dimensional vector space and cluster patients by similarity
3
Analysis – Step
Mapping from diagnosis space to similarity space
8 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
p1 p2 … pn
p1 sim(p1,p1) sim(p1,p2) … sim(p1,pn)
p2 sim(p2,p1) sim(p2,p2) … sim(p2,pn)
… … … … …
pn sim(pn,p1) sim(pn,p2) … sim(pn,pn)
“The Similarity Matrix”
pi = patient isim(pi,pj) = similarity score between patients i and j
1
Analysis – StepsPCA on the similarity matrix, visualisation & clustering
9 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
2 3+
Natural co-morbidity: Diabetes is a risk factor for angina due to its accelerating effect on atherosclerosis
Discussion & Conclusion
Review & Outlook• Patients with similar diagnosis codes are grouped together• Therefore, the semantic similarity technique works, to some
degree• Therefore, this is a viable route to dimensionality reduction
in complex healthcare data sets
10 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
New biomedical hypotheses?
Transferability of method?
Population level characterisation?
New data mining paradigms?
Exploring co-morbidity and co-treatment effects?
Thank You!
12
Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com
J.Weatherall | August 2013 Biometrics & Information Sciences | GMD