Health Data Science
Transcript of Health Data Science
1
Name:
Health Data Science
Wellcome Genome Campus Conference Centre, Hinxton, Cambridge, UK
11-12 June 2019
Scientific Programme Committee:
John Danesh
University of Cambridge, UK
Caroline Relton
University of Bristol, UK
Cathie Sudlow
University of Edinburgh, UK
Tweet about it: #HDS19
@ACSCevents /ACSCevents /c/WellcomeGenomeCampusCoursesandConferences
2
Wellcome Genome Campus Scientific Conferences Team:
Rebecca Twells
Head of Advanced Courses and
Scientific Conferences
Treasa Creavin
Scientific Programme
Manager
Nicole Schatlowski
Scientific Programme
Officer
Jemma Beard
Conference and Events
Organiser
Lucy Criddle
Conference and Events
Organiser
Sarah Offord
Conference and Events
Office Administrator
Zoey Willard
Conference and Events
Organiser
Laura Wyatt
Conference and Events
Manager
3
Dear colleague,
I would like to offer you a warm welcome to the Wellcome Genome Campus Advanced Courses and
Scientific Conferences: Health Data Science. I hope you will find the talks interesting and stimulating,
and find opportunities for networking throughout the schedule.
The Wellcome Genome Campus Advanced Courses and Scientific Conferences programme is run on a
not-for-profit basis, heavily subsidised by the Wellcome Trust.
We organise around 50 events a year on the latest biomedical science for research, diagnostics and
therapeutic applications for human and animal health, with world-renowned scientists and clinicians
involved as scientific programme committees, speakers and instructors.
We offer a range of conferences and laboratory-, IT- and discussion-based courses, which enable the
dissemination of knowledge and discussion in an intimate setting. We also organise invitation-only
retreats for high-level discussion on emerging science, technologies and strategic direction for select
groups and policy makers. If you have any suggestions for events, please contact me at the email address
below.
The Wellcome Genome Campus Scientific Conferences team are here to help this meeting run smoothly,
and at least one member will be at the registration desk between sessions, so please do come and ask
us if you have any queries. We also appreciate your feedback and look forward to your comments to
continually improve the programme.
Best wishes,
Dr Rebecca Twells Head of Advanced Courses and Scientific Conferences [email protected]
4
General Information
Conference Badges
Please wear your name badge at all times to promote networking and to assist staff in identifying you.
Scientific Session Protocol
Photography, audio or video recording of the scientific sessions, including poster session is not
permitted.
Social Media Policy
To encourage the open communication of science, we would like to support the use of social media at
this year’s conference. Please use the conference hashtag #HDS19. You will be notified at the start of
a talk if a speaker does not wish their talk to be open. For posters, please check with the presenter to
obtain permission.
Internet Access
Wifi access instructions:
Join the ‘ConferenceGuest’ network
Enter your name and email address to register
Click ‘continue’ – this will provide a few minutes of wifi access and send an email to the
registered email address
Open the registration email, follow the link ‘click here’ and confirm the address is valid
Enjoy seven days’ free internet access!
Repeat these steps on up to 5 devices to link them to your registered email address
Presentations
Please provide an electronic copy of your talk to a member of the AV team who will be based in the
meeting room.
Poster Sessions
Posters will be displayed throughout the conference. Please display your poster in the Conference
Centre on arrival. There will be one poster session during the conference, taking place on Tuesday
11 June at 18:00 – 19:30. The page number of your abstract in the abstract book indicates your
assigned poster board number. An index of poster numbers appears in the back of this book.
Conference Meals and Social Events
Lunch and dinner will be served in the Hall. Please refer to the conference programme in this book as
times will vary based on the daily scientific presentations. Please note there are no lunch or dinner
facilities available outside of the conference times.
All conference meals and social events are for registered delegates. Please inform the conference
organiser if you are unable to attend the dinner.
A cash bar will be open from 19:00 – 23:00 each day.
Dietary Requirements
If you have advised us of any dietary requirements, you will find a coloured dot on your badge. Please
make yourself known to the catering team and they will assist you with your meal request.
If you have a gluten or nut allergy, we are unable to guarantee the non-presence of gluten or nuts in
dishes even if they are not used as a direct ingredient. This is due to gluten and nut ingredients being
used in the kitchen.
5
For Wellcome Genome Campus Conference Centre Guests
Check in
If you are staying on site at the Wellcome Genome Campus Conference Centre, you may check into
your room from 14:00. The Conference Centre reception is open 24 hours.
Breakfast
Your breakfast will be served in the Hall restaurant from 07:30 – 09:00
Telephone
If you are staying on-site and would like to use the telephone in your room, you will need to
contact the Reception desk (Ext. 5000) to have your phone line activated – they will require your
credit card number and expiry date to do so.
Departures
You must vacate your room by 10:00 on the day of your departure. Please ask at reception for
assistance with luggage storage in the Conference Centre.
Taxis
Please find a list of local taxi numbers on our website. The Conference Centre reception will also
be happy to book a taxi on your behalf.
Return Ground Transport
Complimentary return transport has been arranged for 15:30 on Wednesday, 12 June to Cambridge
station and city centre (Downing Street), and Stansted and Heathrow airports.
A sign-up sheet will be available at the conference registration desk from 15:30 on Tuesday, 11 June.
Places are limited so you are advised to book early.
Please allow a 30-40-minute journey time to both Cambridge city and Stansted airport, and up to 3
hours to Heathrow airport.
Messages and Miscellaneous
Lockers are located outside the Conference Centre toilets and are free of charge.
All messages will be available for collection from the registration desk in the Conference Centre.
A number of toiletry and stationery items are available for purchase at the Conference Centre
reception. Cards for our self-service laundry are also available.
Certificate of Attendance
A certificate of attendance can be provided. Please request one from the conference organiser
based at the registration desk.
Contact numbers
Wellcome Genome Campus Conference Centre – 01223 495000 (or Ext. 5000)
Wellcome Genome Campus Conference Organiser (Zoey) – 07747 024256
If you have any queries or comments, please do not hesitate to contact a member of staff who will
be pleased to help you.
6
Conference Summary
Tuesday, 11 June 2019
09:00 – 10:00 Registration with refreshments
10:00 – 10:10 Welcome and introduction
10:10 – 11:10 Keynote lecture Andrew Morris, Health Data Research, UK
11:10 – 12:40 Session 1: Computational medicine
12:40 – 14:00 Lunch
14:00 – 15:30 Session 2: Precision medicine
15:30 – 16:00 Afternoon tea
16:00 – 17:30 Session 3: Integrating multi-omics into health research
17:30 – 18:00 Lightning talks for poster session abstracts
18:00 – 19:30 Poster session and drinks reception
19:30 Served conference dinner
Wednesday, 12 June 2019
09:00 – 10:30 Session 4: Molecular epidemiology and population health
10:30 – 11:00 Morning coffee
11:00 – 12:00 Session 5: Platform and technologies for health data science
12:00 – 13:30 Lunch
13:30 – 15:00 Session 6: Digital health
15:00 – 15:10 Closing remarks
15:10 – 15:30 Take away refreshment
15:30 Coaches depart to Cambridge City Centre and Train Station, and Heathrow
Airport via Stansted Airport
7
Health Data Science
Wellcome Genome Campus Conference Centre,
Hinxton, Cambridge
11 – 12 June 2019
Lectures to be held in the Francis Crick Auditorium
Lunch and dinner to be held in the Hall Restaurant
Poster sessions to be held in the Conference Centre
Spoken presentations - If you are an invited speaker, or your abstract has been selected for a
spoken presentation, please give an electronic version of your talk to the AV technician.
Poster presentations – If your abstract has been selected for a poster, please display this in the
Conference Centre on arrival.
Conference programme
Tuesday, 11 June 2019
09:00 - 10:00 Registration
10:00 - 10:10 Welcome and Introduction
Programme Committee: Caroline Relton, University of Bristol, UK
10:10 - 11:10 Keynote Lecture
Chair: Caroline Relton, University of Bristol, UK
Health Data Research UK (HDR UK): A national institute for health data
science
Andrew Morris
Health Data Research, UK
11:10 - 12:40 Session 1: Computational medicine
Chair: John Danesh, University of Cambridge, UK
11:10 Using the EHR for research and intervention and enriching the clinical
phenotype through smartphones and wearables
Richard Dobson
King’s College London, UK
11:40 Large-scale correction of faulty diagnoses in population-wide Danish
registry data
Isabella Friis Joergensen
Novo Nordisk Foundation Center for Protein Research, Denmark
12:10 Mining health and healthcare utilization patterns during pregnancy to
discover risk factors for postpartum depression
Jyotishman Pathak
Weill Cornell Medicine, USA
8
12:25 Validating the identification of neurological diseases in UK Biobank
Kristiina Rannikmae
University of Edinburgh, UK
12:40 - 14:00 Lunch
14:00 - 15:30 Session 2: Precision medicine
Chair: Cathie Sudlow, University of Edinburgh, UK
14:00 Characterization of inpatient trajectories and analysis of missingness
patterns from Electronic Health Records data
Maria Herrero-Zazo
EMBL-EBI, UK
14:15 Implementing genomic medicine in the US and globally
Teri Manolio
Division of Genomic Medicine NHGRI, USA
14:45 Understanding genetic effects on gene expression in health and disease
Sarah Kim-Hellmuth
New York Genome Center, USA
15:15 Validation of a machine learning algorithm for diagnosing type 1
myocardial infarction on a population-scale dataset
Dimitrios Doudesis
University of Edinburgh, UK
15:30 - 16:00 Afternoon Tea
16:00 - 17:30 Session 3: Integrating multi-omics into health research
Chair: Teri Manolio, Division of Genomic Medicine NHGRI, USA
16:00 Regulation of the plasma proteome by polygenic disease risk
Michael Inouye
University of Cambridge, UK
16:30 (Gen)-omics of obesity
Cecilia Lindgren
Li Ka Shing Centre - BDI, UK
17:00 A transcriptome-wide Mendelian randomization study to uncover
tissue-dependent regulatory mechanisms across the human phenome
Tom Richardson
MRC Integrative Epidemiology Unit, UK
17:15 Common genetic variation influences the probability that new-born
babies are classified as clinically small or large for gestational age
Robin Beaumont
University of Exeter, UK
9
17:30 - 18:00 Lightning Talks
Chair: John Danesh, University of Cambridge, UK
18:00 - 19:30 Poster Session with Drinks Reception
19:30 Conference Dinner
Wednesday, 12 June 2019
09:00 - 10:30 Session 4: Molecular epidemiology and population health
Chair: Caroline Relton, University of Bristol, UK
09.00 Polygenic risk scores and potential clinical applications
Brent Richards
McGill University, Canada
09:30 Taking genomics into the clinic: Whole-genome sequencing of 13,000
rare disease patients in a national healthcare system
Chris Penkett
NIHR BioResource, UK
10:00 Antenatal exposure to ultraviolet B radiation and learning disabilities:
Population cohort study of 422,512 children
Claire Hastie
University of Glasgow, UK
10:15 A comparison of stroke in urban and rural populations: Exploring
stroke risk and survival in Lancashire
Frances Biggin
Lancaster University, UK
10:30 - 11.00 Morning Coffee
11:00 - 12:00 Session 5: Platform and technologies for health data science
Chair: Michael Inouye, University of Cambridge, UK
11:00 A novel unsupervised approach to biomedical named entity
recognition and linking using neural networks
Zeljko Kraljevic
King's College London, UK
11:15 Creating knowledge graphs from literature to enable prioritisation of
potential drug targets
Joseph Mullen
SciBite Ltd, UK
11:30 The Accelerating Medicines Partnership Type 2 Diabetes Knowledge
Portal
Aravind Sankar
European Bioinformatics Institute (EMBL-EBI), UK
10
11:45 Using Privacy Enhancing Technologies (PETs) to integrate clinical
phenotypes, routine healthcare data and genomics in the public cloud
Neil Walker
NIHR BioResource, UK
12:00 - 13:30 Lunch
13:30 - 15:00 Session 6: Digital health
Chair: Richard Dobson, King’s College London, UK
13:30 Creating ‘operating systems’ for biomedical discovery and care
Calum MacRea
Harvard Medical School, USA
14:00 Refining national approaches to low value care
Aoife Molloy
Imperial College London, UK
14:15 Using wearable devices and genetics to estimate and validate
mechanisms of sleep
Andrew Wood
University of Exeter, UK
14:30 Towards a self-learning healthcare system
Tianxi Cai
Harvard University, USA
15.00 - 15:10 Closing Remarks
Programme Committee: Cathie Sudlow, University of Edinburgh, UK
15.10 - 15:30 Take Away Refreshments
15:30 Coaches depart to Cambridge City Centre and Train Station, and
Heathrow Airport via Stansted Airport
11
These abstracts should not be cited in bibliographies. Materials contained herein should be
treated as personal communication and should be cited as such only with consent of the
author.
12
Notes
S1
Spoken Presentations
Health Data Research UK (HDR UK): A national institute for health data science
Andrew Morris
Health Data Research UK, UK
Healthcare is arguably the last major industry to be transformed by the information age.
Deployments of information technology have only scratched the surface of possibilities for
the potential influence of information and computer science on the quality and cost-
effectiveness of healthcare. In this talk, the vision, objectives and scientific strategy of HDR
UK will be discussed; specifically, the opportunities provided by computer science, artificial
intelligence and "big data" to transform health care delivery models. Examples will be given
from nationwide research and development programmes that integrate electronic patient
records with biologic and health system data. Two themes will be explored; specifically:
How the size of the UK (65M residents), allied to a relatively stable population and unified
health care structures facilitate the application of data science to support nationwide quality-
assured provision of care.
How population-based datasets can be integrated with biologic information and enabled by
data science to facilitate (i) epidemiology; (ii) drug safety studies; (iii) enhanced efficiency of
clinical trials through automated follow-up of clinical events and treatment response; and, (iv)
the conduct of large-scale genetic, pharmacogenetics, and family-based studies essential for
precision medicine.
S2
Notes
S3
Using the EHR for research and intervention and enriching the clinical phenotype
through smartphones and wearables
Richard Dobson
King's College London, UK
Digital health records present potentially transformative opportunities for novel research of
direct clinical relevance. There is much value in the unstructured portion of the electronic
health record (EHR), often in the form of rich narrative, which is currently unused, yet
important for understanding health interactions that are either not recorded in, or are less
obvious from, the structured data (e.g. multi-morbidity, cancer diagnosis). I will describe
efforts (and open source tools) to use this unstructured data in hospital records to enable
research-ready, actionable, real-time and large-scale EHRs.
The integration of NLP derived phenotypes from EHRs with other modalities including
imaging, mobile health and genomics will help generate a more complete picture of the
patient. The data streamed from devices such as wearables and smartphones enables the
shift from sporadic and sparse patient engagement to pervasive and continuous monitoring
throughout the disease continuum from pre-diagnosis (risk, health and wellness), through
diagnosis (early diagnosis, progression monitoring, non-pharmalogical interventions) to post
intervention (adherence, relapse prediction, efficacy monitoring). As such, remote monitoring
provides insight into what is happening to patients in real time, and also a direct intervention
by feeding back to patients. I will describe progress and challenges faced in mobile health
studies such as the IMI2 RADAR-CNS (RADAR-CNS.org) programme and the generalised
open source platforms developed to support this work.
S4
Notes
S5
Large-scale correction of faulty diagnoses in population-wide Danish registry data
Isabella Friis Joergensen1, Søren Brunak1
1Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3B, DK-2200 Copenhagen N, Denmark.
Diagnostic errors are common, and it has been estimated that everyone will experience at
least one diagnostic error in their lifetime. Common and potentially harmful diagnostic errors
include under-, mis- and over-diagnoses. We use Chronic Obstructive Pulmonary Disease
(COPD) to showcase a new approach to identify faulty diagnoses. Underdiagnosis of COPD
has been estimated to be 50-80%, while 5-60% are misdiagnosed. Earlier studies evaluate
the airway obstruction in population-wide data and identify undiagnosed patients as people
with airway obstruction but no COPD diagnosis, while misdiagnosed patients have a COPD
diagnosis, but no airway obstruction. Here we use an alternative disease trajectory
approach.
We used the Danish National Patient Registry containing hospital diagnoses for the entire
Danish population (6.9 million people) covering more than two decades. COPD patients
were used to identify common, temporal diagnostic correlations that were combined to
longer disease trajectories. These typical disease trajectories were compared with non-
COPD patients to discover undiagnosed COPD. Low similarity between COPD patients and
typical disease trajectories identified mis- and over-diagnosed patients.
In total, 284,154 COPD patients resulted in 61,521 and 428 common disease trajectories
with three and five consecutive diseases, respectively. This systematic, data-driven analysis
of common disease trajectories and patient similarity from an entire population identified
3,569 potentially under-diagnosed COPD patients, which comorbidities are very similar to
COPD patients. The potentially undiagnosed patients were repeatedly diagnosed with
respiratory infections, like influenza and pneumonia, indicating undiagnosed COPD.
Furthermore, we identified 4,553 potentially mis- or over-diagnosed COPD patients that do
not display the typical COPD comorbidities. The group of potentially under- and mis-
diagnosed patients die significantly earlier and have significantly fewer respiratory tests than
the general COPD patient. They die shortly after the COPD diagnosis and more than 10%
are diagnosed with lung cancer.
The method is entirely general and not limited to COPD and could improve the diagnostics
processes to reduce errors, avoid harm, and find optimal diagnoses faster in population-wide
health data.
Keywords: Diagnostic error, underdiagnosis, misdiagnosis, overdiagnosis, Chronic
Obstructive Pulmonary Disease
S6
Notes
S7
Mining health and healthcare utilization patterns during pregnancy to discover risk factors for postpartum depression
Shuojia Wang, MS, Alison Herman, MD; Jyotishman Pathak, PhD; Yiye Zhang, PhD, MS
Weill Cornell Medicine, Cornell University, New York, NY, USA; School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
Objective: The US Preventive Services Task Force recently asserted a level B
recommendation that women at risk for postpartum depression be offered or referred for
preventive psychotherapy. However, there are significant knowledge gaps in determining
which women are at risk, particularly in regard to medical contributors. Furthermore, scales
that have been developed thus far to assess risk do not consider how risk may evolve over
the course of pregnancy and may have limited feasibility on a population-wide scale. This
project attempts to address this screening problem using machine learning technology and
electronic health records to describe patterns of health and healthcare utilization during
pregnancy that may predict risk for postpartum depression (PPD).
Methods: Using electronic health record data from Weill Cornell Medicine and NewYork-
Presbyterian Hospital from 2015 to 2017, we followed pregnant women's (n=9978) clinical
visits from their first trimesters to childbirth. Clinical events, such as billing diagnoses and
medication orders from physician encounters, were mined using machine learning
algorithms and modeled into clusters based on patterns of healthcare utilization. Clusters
were subsequently analyzed in relation to the outcome of PPD within one year after delivery.
Results: We discovered three clusters, which differ in the distribution of demographics,
health, and healthcare utilization. The cluster with the highest prevalence of PPD had
statistically significant higher age, higher body mass index, and higher likelihood of being
unmarried. In addition, this cluster also had statistically significant higher rates for medical
and/or obstetric complications during pregnancy and higher number of thyroid-related
prescriptions.
Conclusions: Applying machine learning algorithms to electronic health record data may be a
powerful tool for more accurately identifying women at risk for PPD, particularly when that
risk changes over the course of pregnancy due to medical or obstetric complications.
Acknowledgments: Research reported in this manuscript is supported in part by the Walsh
McDermott Scholarship, NIH R01 MH105384, NIH P50 MH113838, and the Chinese
Scholarship Council.
S8
Notes
S9
Validating the identification of neurological diseases in UK Biobank
Kristiina Rannikmae, Tim Wilkinson, Kathryn Bush, Kenneth Ngoh, Naomi Allen, Christian Schnier, Qiuli Zhang, John Nolan, Debbie Bathgate, Michael Bennett, David Breen, Jingjie Cheng, Richard Davenport, Fergus Doubal, Gordon Duncan, Robin Flaig, David Henshall, Aidan Hutchison, Chris Lerpiniere, Conor Maguire, John O’Brien, Scott Osborne, Suvankar Pal, Tom Russ, Rustam Al-Shahi Salman, Neshika Samarasekera, Naomi Warren, Will Whiteley, Kirsty Wilson, Rebecca Woodfield, Cathie Sudlow
KR: Centre for Medical Informatics, University of Edinburgh, Edinburgh, UK and UK Biobank
Background: Disease cases among the 500,000 UK Biobank (UKB) participants can be
identified through self-report and linkage to routinely collected coded national health data.
We assessed the accuracy of these datasets in identifying neurological diseases.
Methods: We (1) created disease code lists for stroke, Parkinson's disease (PD) and
dementia; (2) identified all Edinburgh-based UKB participants with a code(s) in hospital
admission, primary care, death record data or self-report; (3) reviewed participants' medical
records to generate reference standard diagnoses; (4) calculated positive predictive values
(PPV) for all codes combined and stratified by source.
Results: There were 225 stroke, 78 PD and 120 dementia coded cases among 17,249
participants. 39% to 53% of codes occurred only in primary care data. PPVs ranged from
79% to 91%; they were higher for stroke and lower for PD in hospital versus primary care
data (89% versus 80%; 84% versus 95%), and similar for dementia in both datasets (87%).
There were too few cases in death record and self-report data to draw firm conclusions.
Conclusions: Neurological disease cases in UKB can be identified through linked coded data
with sufficient accuracy for many genetic and epidemiological studies. Primary care data is
an important source.
S10
Notes
S11
Characterization of inpatient trajectories and analysis of missingness patterns from Electronic Health Records data
Maria Herrero-Zazo1, Thore Buergel2, Tomas Fitzgerald1, Victoria L. Keevil3,4, John Bradley4 and Ewan Birney1
1 The European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom. 2 Health Data Science Unit. University Clinics Heidelberg 3 Department of Medicine for the Elderly, Addenbrooke's Hospital, Cambridge, United Kingdom. 4 Clinical Gerontology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom. 5 Department of Medicine, NIHR Cambridge Biomedical Research Centre, University of Cambridge, Cambridge, United Kingdom.
The implementation of Electronic Health Record (EHR) systems in secondary care, such as
the EPIC system at Cambridge University Hospitals NHS Foundation Trust (CUHNFT),
represents an unprecedented way to record real-time clinical information at the patient's
bedside. EHR systems collect clinically relevant information about patients, including vital
signs (such as blood pressure or temperature), blood test results and medications, providing
a time-based description of clinical progress from hospital admission to discharge. This
large-scale data provides a unique source of information for studying the relationships
between clinical observations, treatments and final hospital outcomes. While traditional
statistical techniques cannot study these multiple and interrelated relationships in their
entirety, advanced statistical methods such as machine learning algorithms can recognise
patterns within these data and relate multiple patient characteristics to defined outcomes, for
example inpatient death.
We introduce a collaboration between Addenbrooke's Hospital (as part of CUHNFT) and the
European Bioinformatics Institute (EMBL-EBI) to bring together clinical and technical
expertise for the development of machine learning models that can process large amounts of
complex clinical data collected over time during hospital admission episodes.
This clinical longitudinal data is a type of irregular multivariate time series characterized by
high intra- and inter-individual heterogeneity due to unevenly sampled observations, different
lengths of follow-up periods and high rates of missingness. Many time series analysis and
feature extraction approaches rely on regular time series data and therefore imputation of
missing observations is a common pre-processing step.
In this work, we present the results of several imputation methods proposed for time series
data, such as linear interpolation, moving average or last observation carried forward,
applied to various vital signs and laboratory test results data. Missingness patterns are
extracted from EHR data present in the public Multiparameter Intelligent Monitoring in
Intensive Care (MIMIC-III) database and replicated on examples of the same data set
without missing observations. The error between the real and imputed values is calculated
providing an accurate estimate of the performance of these imputation methods on the
actual data. The methodology is flexible, enabling the analysis of novel imputation
approaches and their comparison with baseline methods providing an imputation
assessment framework that can be applied to clinical data from different EHR systems.
S12
Notes
S13
Implementing genomic medicine in the US and globally Teri A. Manolio, M.D., Ph.D. National Human Genome Research Institute (NHGRI), National Institutes of Health The growing availability of reliable, cost-effective genetic testing and increasing knowledge about the influence of genetic variation on human health have spurred the implementation of genomic medicine into clinical care. As defined by the National Human Genome Research Institute (NHGRI), genomic medicine as an emerging medical discipline that involves using genomic information about an individual as part of their clinical care, and the health outcomes and policy implications of that clinical use. Genomic medicine is already advancing diagnosis, treatment, and prevention in the fields of oncology, pharmacology, rare and undiagnosed diseases, and infectious diseases. Yet many barriers remain, including limited evidence of efficacy, scarcity of genomics expertise, lack of standards, and difficulties in integrating genomic results into electronic medical records. In the past several years, the governments of over a dozen countries have established national genomic-medicine initiatives to address these barriers. Some have focused on genomic testing of large numbers of patients with rare diseases or cancer, paired with development of workforce and infrastructure. Others have launched large-scale, population-based sequencing projects returning genomic results directly to participants, and still others are focusing on development of informatics infrastructure such as data standards and policies and platforms for data sharing. These research and implementation programs are expected to speed the evaluation and incorporation, where appropriate, of genomic technologies and findings into routine clinical care. Actual adoption of successful approaches in the clinic will depend upon the willingness, interest, and energy of practitioners, participants, professional societies, and payers to share their data and experience and to promote the responsible use of these approaches..
S14
Notes
S15
Understanding genetic effects on gene expression in health and disease
Sarah Kim-Hellmuth1,2,3
1New York Genome Center, New York, NY, USA; 2Max-Planck-Institute of Psychiatry,
Munich, Germany; 3Department of Systems Biology, Columbia University, New York, NY,
USA
Over the last decade, genome-wide association studies (GWAS) have identified thousands
of genetic variants robustly associated with complex traits and diseases. Detailed
characterization of cellular effects of genetic variants is essential for understanding biological
processes that underlie these genetic associations to disease. One approach to address this
challenge is to map genetic effects on the transcriptome across numerous conditions. In this
talk, I will discuss our recent work within the Genotype-Tissue Expression (GTEx)
consortium, which has built the most comprehensive atlas to date of expression quantitative
trait loci (cis-eQTLs) across human tissues. Using the final v8 data release, with RNA-
sequencing data from 17,382 samples across 49 tissues of 838 individuals with genome
sequencing data, we gain insights into the tissue-, cell type-, sex- and age-specificity of cis-
eQTLs at an unprecedented breadth and depth. Taking into account this context-specificity
of cis-eQTLs significantly improves our understanding of the underlying mechanism of
genetic risk to disease, which can ultimately inform management of disease risk and
treatment.
S16
Notes
S17
Validation of a machine learning algorithm for diagnosing type 1 myocardial infarction on a population-scale dataset
Dimitrios Doudesis1,2, Jason Yang2, Thanasis Tsanas1, Catherine Stables2, Anoop Shah2, Atul Anand2, Fiona Strachan2, Ken Lee2, Nicholas Mills1,2
1 Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK. 2 BHF Centre for Cardiovascular Science, University of Edinburgh, Edinburgh, UK.
Introduction
A number of rapid rule-in and rule-out pathways have been developed to assist in the
diagnosis of myocardial infarction in the acute setting. However, most do not take into
account the interaction between troponin levels, age, and gender. A proprietary statistical
model which takes these factors into account to predict the likelihood of type 1 myocardial
infarction (MI) has previously been created and validated (manuscript currently under
review), but it has not yet been tested against a large population-scale dataset.
Methods
We validated the model against a dataset consisting of 48,282 patients from High-STEACS,
a trial evaluating the use of high-sensitivity troponin assays (hs-cTnI) in the diagnosis of
suspected acute coronary syndrome. The model takes into account two consecutive hs-cTnI
results, the time between samples, age, and gender. It outputs the predicted risk of type 1 MI
(0-100).
The area under the receiver-operating-characteristic curve (AUC) was calculated for the
cohort as a whole. We also evaluated model performance statistics (sensitivity, specificity,
NPV, PPV) at two thresholds (1.6 and 49.7) suggested previously in the manuscript under
review corresponding to low-risk (rule-out) and high-risk (rule-in) respectively.
Results
Of the 48,282 patients, 22,057 had two consecutive hs-cTnI results available and were
included in the analysis. Type 1 MI occurred in 3,881 (17.6%) patients. The model was found
to have good calibration (AUC = 0.95 [95% CI 0.947 to 0.953]) overall. Using the
prespecified thresholds, 60.26% of patients were identified as low-risk (sensitivity 0.993
[95% CI 0.991 to 0.996], NPV 0.998 [95% CI 0.997 to 0.999]) and 16.51% were identified as
high-risk (specificity 0.945 [95% CI 0.942 to 0.949], PPV 0.728 [95% CI 0.715 to 0.741]).
These findings are preliminary subject to final data quality assurance checks and are
comparable to the findings of the original analysis, in a different dataset.
Conclusion
The model performs favourably when tested against a large population-scale dataset. By
taking into account the interaction between age, gender, and troponin levels, the model
could improve diagnostic pathways for MI by accurately identifying high-risk patients to be
targeted for prompt treatment and allowing early discharge in low-risk patients.
S18
Notes
S19
Regulation of the plasma proteome by polygenic disease risk
Michael Inouye
University of Cambridge, UK
Abstract not available at the time of printing.
S20
Notes
S21
(Gen)-omics of obesity
Cecilia Lindgren
Li Ka Shing Centre – BDI, UK
Obesity is a result of excess body fat accumulation, which is associated with adverse health
effects such as CVD, type 2 diabetes, and cancer. It’s a multifaceted, common chronic
disease with no current effective treatment strategy. Progress in understanding the aetiology
has been slow up until recently, with findings largely restricted to monogenic, severe forms
of obesity. However, technological and analytical advances have enabled detection of more
than 1000 obesity susceptibility loci and the number is rapidly increasing. These contain
genes suggested to be involved in the regulation of food intake through action in the central
nervous system (overall obesity) as well as in adipocyte function (body fat distribution).
These results provide plausible biological pathways that may, in the future, be targeted as
part of treatment or prevention strategies. Although the proportion of heritability explained by
these genes remains small, their detection heralds a new phase in understanding the
etiology of common obesity. I will discuss recent genetic and genomic studies of adiposity
that indicate a causal role for general and central adiposity in cardiometabolic disease, and
highlight potential mechanisms including insulin resistance and gene expression.
S22
Notes
S23
A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome
Tom G Richardson, Gibran Hemani, Tom R Gaunt, Caroline L Relton, George Davey Smith
MRC Integrative Epidemiology Unit (IEU), Population Health Sciences, Bristol Medical School, University of Bristol, Oakfield House, Oakfield Grove, Bristol, BS8 2BN, United Kingdom
Background: Developing insight into tissue-specific transcriptional mechanisms can help
improve our understanding of how genetic variants exert their effects on complex traits and
disease. By applying the principles of Mendelian randomization, we have undertaken a
systematic analysis to evaluate transcriptome-wide associations between gene expression
across 48 different tissue types and 395 complex traits.
Results: Overall, we identified 100,025 gene-trait associations based on conventional
genome-wide corrections (P < 5 x 10-08) that also provided evidence of genetic
colocalization. These results indicated that genetic variants which influence gene expression
levels in multiple tissues are more likely to influence multiple complex traits. We identified
many examples of tissue-specific effects, such as genetically-predicted TPO, NR3C2 and
SPATA13 expression only associating with thyroid disease in thyroid tissue. Additionally,
FBN2 expression was associated with both cardiovascular and lung function traits, but only
when analysed in heart and lung tissue respectively. We also demonstrate that conducting
phenome-wide evaluations of our results can help flag adverse on-target side effects for
therapeutic intervention, as well as propose drug repositioning opportunities. Moreover, we
find that exploring the tissue-dependency of associations identified by genome-wide
association studies (GWAS) can help elucidate the causal genes and tissues responsible for
effects, as well as uncover putative novel associations.
Conclusions: The atlas of tissue-dependent associations we have constructed should prove
extremely valuable to future studies investigating the genetic determinants of complex
disease. The follow-up analyses we have performed in this study are merely a guide for
future research. Conducting similar evaluations can be undertaken systematically at
http://mrcieu.mrsoftware.org/Tissue_MR_atlas/.
S24
Notes
S25
Common genetic variation influences the probability that new-born babies are classified as clinically small or large for gestational age
Robin N Beaumont, Sarah Kotecha, Bridget Knight, Sylvain Sebert, Andrew T. Hattersley, Marjo-Riitta Järvelin, Nicholas J. Timpson, Sailesh Kotecha, Rachel M Freathy
1. Institute of Biomedical and Clinical Science, University of Exeter, Exeter, UK 2. Department of Child Health, School of Medicine, Cardiff University, Cardiff, UK 3. Center For Life-course Health Research, University of Oulu, Finland 4. Department of Epidemiology and Biostatistics, Imperial College, London, UK 5. Medical Research Council Integrative Epidemiology Unit, University of Bristol, Bristol, UK
Introduction
Birth weight (BW) is an important determinant of neonatal mortality and morbidity. Babies
clinically Small or Large for Gestational Age (SGA or LGA), defined as sex- and gestational
age-adjusted BW below the 10th or above the 90th percentile, respectively, have higher risk
of complications at birth. Babies classified SGA/LGA may have experienced fetal growth
restriction or overgrowth, respectively, but the proportion of SGA/LGA babies experiencing
growth restriction or overgrowth is unknown. Recently, common genetic variants have been
identified for normal BW variation. We used genetic scores (GS) for BW, fasting glucose
(FG) and systolic blood pressure (SBP) to examine the effect of genetics on probability of
SGA and LGA.
Methods
We calculated maternal and fetal GS for BW in participants from the ALSPAC, EFSOCH,
NFBC1966 and NFBC1986 studies (n=12,125 babies and n=5,187 mothers) and maternal
FG and SBP GS from the ALSPAC and EFSOCH cohorts. We tested associations between
the GS and SGA/LGA to assess the extent to which the genetics of normal variation in birth
weight influences probability of being classified as SGA/LGA.
Results
Both maternal and fetal GS for BW showed strong association with SGA and LGA. A 1
decile higher maternal BW GS was associated with SGA (Odds Ratio 0.80 (95%CI:
0.76,0.87); P=2.0x10-9) and LGA (OR=1.23 (1.15,1.31); P=3.9x10-10). Fetal BW GS also
showed association with SGA (OR: 0.65 (0.60,0.71); P=2.4x10-23) and LGA (1.47
(1.36,1.59); P=1.7x10-21). Maternal FG GS showed similar association with both SGA (0.74
(0.58,0.94); P=1.5x10-2) and LGA (1.49 (1.19,1.87); P=5.7x10-4). Maternal SBP GS showed
strong association with SGA (1.31 (1.07,1.60); P=9.5x10-3), but weaker association with
LGA, with wide confidence intervals (0.87 (0.72,1.05); P=1.4x10-1).
Conclusions
The strong associations of SGA and LGA with the GS suggest a large proportion of babies
classified as SGA and LGA are the tails of the normal range of the BW distribution. In the
case of SGA it suggests a large number of babies classified as SGA are not growth
restricted, but in the lowest quantiles of the normal range of fetal growth
S26
Notes
S27
Polygenic risk scores and potential clinical applications
Brent Richards
McGill University, Canada
One of the goals of the Human Genome Project has been to use genetic information to
identify individuals at increased risk of disease. This goal has been difficult to realize since
most identified genetic variants have small effects and the resultant variance explained in
disease risk has been low. Using very large sample sizes it has become possible to sum up
the effect of many small effect size variants into polygenic risk scores, which when combined
with relatively simple machine learning techniques, have recently achieved clinically-relevant
amounts of variance explained.
In this talk, I will discuss how the genetic prediction of bone density could potentially be
applied to screening programs for osteoporosis—amongst the most common and costly
diseases. We have recently developed a polygenic risk score from 341,449 individuals which
is highly correlated with measured bone density measure (r = 0.48). Validating this measure
in 10,199 individuals from 5 cohorts, we demonstrate how genetic prediction can help to
improve osteoporosis screening efficiency by removing individuals from the screening
program unlikely to have low bone density. Such incorporation of a genetic prediction
reduces the number of people who need to be screened for osteoporosis by ~50% yet
maintains sensitivity to identify individuals eligible for therapy at 92% and leaves the
specificity largely unchanged.
Nevertheless, multiple barriers exist prior to incorporation of such polygenic risk scores into
clinical practice and I will provide an overview of some of these challenges.
S28
Notes
S29
Taking genomics into the clinic: Whole-genome sequencing of 13,000 rare disease patients in a national healthcare system
Christopher J Penkett [1,2], Kathleen Stirrups [1,2], Salih Tuna [1,2], Olga Sharmardina [1,2], Sri V.V. Deevi [1,2], Karyn Mégy [1,2], Rutendo Mapeta [1,2], Ernest Turro [1,2,3], Daniel Greene [1,2,3], Stefan Gräf [1,2], Matthias Haimel [1,2], Hana Lango Allen [1,2], F. Lucy Raymond [1,4,5], Willem H. Ouwehand [1,2,4,6] on behalf of the NIHR BioResource-Rare Diseases Consortium
1. NIHR BioResource, Cambridge University Hospitals, Cambridge, UK. 2. Department of Haematology, Cambridge University, Cambridge, UK. 3. MRC Biostatics Unit, Cambridge Institute of Public Health, Cambridge, UK. 4. Wellcome Sanger Institute, Hinxton, Cambridge, UK. 5. Department of Medical Genetics, CIMR, Cambridge University, Cambridge, UK. 6. NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK.
To study genetic sequence variants underlying unresolved Mendelian disorders and improve
interpretation of already identified high penetrance variants, a collection of ~13,000
individuals with a rare disease and their relatives has been whole genome sequenced with
an average 30x coverage. Participants were mainly recruited through the NIHR BioResource
at NHS hospitals in the UK using approved eligibility criteria for 15 different rare disease
domains.
We describe the population structure including ethnicity and relatedness estimation, high
level phenotypes collected using Human Phenotype Ontology (HPO) terms and quality
control and summary metrics for samples and variants. The resource contains over 170
million unique variants in the ~10,000 genetically independent samples with 47% of variants
previously unobserved in other large scale publicly available genome data-sets (e.g.
gnomAD, TOPMed, HGMD, UK10K).
We summarise the curation of gene lists and pertinent findings in diagnostic-grade genes for
the 15 domains. Over 1000 reports assigning pathogenic or likely pathogenic causal variants
have been issued following review by Multi-Disciplinary Teams. We show the power of a
recently developed rapid Bayesian association test, BeviMed, to identify ~25 novel genes
and to provide independent validation of recent rare disease gene discoveries by others.
Some of these novel genes have been reported and led to changes in patient management.
More NIHR BioResource rare disease project details can be found at:
https://bioresource.nihr.ac.uk/rare-diseases/rare-diseases; variants can be browsed at:
https://bioresource.nihr.ac.uk/wgs; and a biorxiv preprint is available here:
https://www.biorxiv.org/content/10.1101/507244v1 .
S30
Notes
S31
Antenatal exposure to ultraviolet B radiation and learning disabilities: Population cohort study of 422,512 children
Claire E Hastie, Daniel F Mackay, Tom L Clemens, Mark PC Cherrie, Albert King, Chris Dibben, Jill P Pell
University of Glasgow, University of Edinburgh, Scottish Government
Background
Learning disability varies by month of conception. The underlying mechanism is unknown
but vitamin D, necessary for normal brain development, is commonly deficient over winter in
high latitude countries due to insufficient ultraviolet radiation. This study aimed to determine
whether antenatal exposure to ultraviolet B radiation was associated with learning disability.
Methods
We linked the annual School Pupil Censuses conducted between 2007 and 2016 to
maternity records at an individual level across Scotland. Mean monthly ultraviolet B radiation
levels were derived from NASA satellite data. Univariate and multivariate logistic regression
analyses were used to explore the associations between overall, and trimester-specific,
ultraviolet B exposure and learning disabilities, adjusting for the potential confounding effects
of month of conception and sex.
Findings
Of the 422,512 eligible, singleton schoolchildren born at term in Scotland, 79,616 (18·8%)
had a learning disability. Total antenatal ultraviolet B exposure was associated with learning
disabilities (highest quintile; adjusted OR 0·553, 95% CI 0·509-0·601, p<0·001) with
evidence of a dose-relationship. The association was independent of ultraviolet A exposure
(highest quintile; adjusted OR 0·501, 95% CI 0·456-0·549, p<0·001). Significant
associations were demonstrated for exposure in all three trimesters.
Interpretation
Lack of maternal exposure to ultraviolet B radiation may play a role in the seasonal
patterning of learning disabilities. Trials are required to evaluate the effectiveness of vitamin
D supplements.
S32
Notes
S33
A comparison of stroke in urban and rural populations: Exploring stroke risk and survival in Lancashire
Frances Biggin, Kelly Heys, Matt Heys, Jo Knight, Peter Diggle
Lancaster University (Frances Biggin, Jo Knight, Peter Diggle), University Hospitals Morecambe Bay Trust (Kelley Heys, Matt Heys)
Stroke is a major cause of death and disability in the UK and the risk factors leading to
stroke are well researched. The aim of this study is to explore whether the effects of those
risk factors vary according to whether the patient lives in a rural or an urban area, and
whether survival after a stroke is also affected by this geographical distinction.
The data consisted of 1,045 cases and 54,356 controls. Cases were all first strokes admitted
to hospital in the University Hospitals Morecambe Bay Trust between 01/01/12 and
01/06/18. Hospital data was linked to GP data through the Trust's community data
warehouse, and this in turn was linked to census data to determine the urban or rural status
of each patient. Controls were also selected from the hospital database, such that they were
representative of the same source population as the cases.
A generalised linear model was used to analyse the data. The initial model included
covariates determined by previous studies to be important: age, gender, deprivation index,
smoking status, diabetes, previous TIA (Transient Ischaemic Attack), systolic blood pressure
and rurality. Stepwise backwards selection and likelihood ratio tests were then used to
determine the most appropriate model from these covariates. The final model showed that
after covariate adjustment, those living in the urban areas of Lancashire are at a slightly
elevated risk of stroke, with a risk ratio of 1.20 (95% CI 1.02 to 1.41). A Cox proportional
hazards model for survival analysis was fitted using the same procedure and showed that
there were no statistically significant differences between the survival of urban and rural
patients after a first stroke for either 30 day or overall survival.
Although this study shows that there is a small but statistically significant effect on stroke risk
of living in an Urban area this result is based on data from just one county. Lancashire has a
large rural population, but none of the areas classed as rural fall into the lower end of the
deprivation index scale, which may affect the results of this study. Future studies will extend
these results into the neighbouring county of Cumbria which has a more diverse rural
population.
S34
Notes
S35
A novel unsupervised approach to biomedical named entity recognition and linking using neural networks
Zeljko Kraljevic, Rebecca Bendayan, Daniel M. Bean, Richard J. B. Dobson
1. Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, U.K. (D.M.B., R.B., R.J.B.D., Z.K.). 2. NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, U.K. (R.B., R.J.B.D., Z.K.) 3. Health Data Research UK London, University College London, 222 Euston Road, London, U.K. (R.J.B.D.). 4. Institute of Health Informatics, University College London, 222 Euston Road, London, U.K. (R.J.B.D.).
Biomedical documents like Electronic Health Records (EHRs) contain a large amount of
information in an unstructured format. The data in EHRs is a hugely valuable resource
documenting clinical narratives and decisions at the point of care, but whilst the text can be
easily understood by human doctors it's very difficult to work with at scale for machine
learning and statistical modelling.
To uncover the huge potential of biomedical documents we need to extract and structure the
information in them. The task at hand can be split into two Natural Language Processing
problems: Named Entity Recognition (NER) and Named Entity Linking (NEL). The number of
entities, ambiguity of words, overlapping and nesting make this task significantly more
difficult than the traditional NLP problems. Besides the size and the complexity of the
entities, one of the main problems we face is the lack of training data, most biomedical
documents and especially patient records cannot be made publicly available. Even when
they can there are millions of biomedical concepts and manually labelling the data is not a
viable option.
We are proposing an unsupervised approach to NER+NEL that learns to extract and link
entities given a dictionary of medical concepts and a database of text documents. We have
validated our solution on the MedMentions dataset [Biomedical papers annotated with
mentions from the Unified Medical Language System - UMLS]. The dataset contains 4,392
biomedical documents with 352,496 annotated UMLS concepts. We have compared our
results to SciSpaCy a recently published state of the art NLP for biomedical documents.
Preliminary results show a soft F1 score of 0.798 (vs 0.692 for SciSpaCy) on all concepts
from UMLS, and soft F1 of 0.897 (vs 0.846 for SciSpaCy) for disease identification.
Overall, we show that our approach can detect and link millions of different biomedical
concepts with state-of-the-art performance, while still being lightweight and fast. Follow up
validations will be performed using other clinical cases from the Clinical Record Interactive
Search system [From South London and Maudsley NHS Foundation Trust] and King's
College Hospital where we have connected our solution with CogStack - an information
retrieval and extraction service for EHRs. This tool will potentially contribute to clinical
coding, temporal modeling of patients, phenotyping and event prediction.
S36
Notes
S37
Creating knowledge graphs from literature to enable prioritisation of potential drug targets
J.Mullen, O.Giles, R.Turner, M.Hughes
SciBite Limited, BioData Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1DR, United Kingdom
Here we present an integrated knowledgebase that harmonises data from the literature with
structured data from relevant open source data sets, including ChEMBL and the GWAS
Catalog. These data are aligned to SciBite life science ontologies (e.g. indication, gene,
drug). The knowledgebase brings together multiple evidence types around target
prioritisation into a single point of query. We show examples of how elaborate questions can
be asked of the knowledgebase in order to identify and prioritise potential targets relevant to
different disease areas.
Current trends in computational medicine focus on understanding disease mechanisms and
the diagnosis and treatment of human disease. Through understanding disease mechanisms
one can identify potential drug targets, a vital task in drug discovery. In the big data era,
voluminous datasets capturing data relevant to target identification are not only presented in
various formats they are also stored in decentralised locations, using multiple naming
conventions to describe the same entities. These issues make it very difficult to ask domain
specific questions, where answers depend on the consolidation of evidence from many
sources. Unstructured datasources, such as scientific journal articles, are particularly
problematic when it comes to extracting structured data. These difficulties arise from the fact
that, within the life sciences, entities are known to be highly synonymous (i.e. many terms
can refer to the same entity) as well as incredibly ambiguous (the same term may refer to
different entity types depending on the context). In order to extract meaningful data from the
literature, as well as integrate this with alternative structured datasources, one needs
accurate ontologies that are reflective of the area of application. Once these ontologies have
been created a means of identifying instances of these ontology concepts within the
literature must be achieved. At the heart of SciBite's technologies is TERMite, a named
entity recognition (NER) engine. TERMite makes use of over 80 manually curated life
science ontologies to identify instances of entities, such as genes or diseases, contained
within unstructured text. By making use of SciBite's NER engine, associations between
entity types can also be identified, e.g. how many documents contain a sentence that
mentions disease X and gene Y? Once this structured data has been extracted from
literature it can then be fed into numerous downstream analyses, including enhancing these
semantic triples with relevant structured sources of data; allowing knowledgebases to be
built around a particular domain of interest, such as target prioritisation.
S38
Notes
S39
The Accelerating Medicines Partnership in Type 2 Diabetes Knowledge Portal
Aravind Sankar, Dylan Spalding, the Accelerating Medicines Partnership in Type 2 Diabetes.
European Bioinformatics Institute (EMBL-EBI)
The Accelerating Medicines Partnership in Type 2 Diabetes Knowledge Portal
(http://www.type2diabetesgenetics.org) is the result of a collaborative effort between
researchers from the Broad Institute, EMBL-EBI, the University of Michigan and the
Wellcome Centre for Human Genetics. It is an open-access web resource comprising DNA
sequence, genetic association results and functional genomic information from studies on
type 2 diabetes (T2D) that aims to speed up the development of new diagnostics and
treatments by identifying promising biological targets. There are 66 datasets in the T2D
Knowledge Portal (T2DKP) to date, varying from small studies to large meta-analyses.
The T2DKP is supported by a set of Knowledge Bases (KBs) which supply and perform the
analyses on the data for the T2DKP. In order to conform to heterogenous ethical, legal and
societal issues (ELSI) of data from different geographical locations, datasets are hosted at
multiple federated KBs, such as the Broad Institute and EMBL-EBI, which respond to remote
queries from the T2DKP. All T2DKP tools and interfaces can query data housed at all
federated KBs, enabling joint analysis of all datasets across the network of KBs regardless
of geographical location of the datasets. Examples of federated datasets include the Oxford
BioBank and the GoDARTS datasets.
Genetic association results in the T2DKP may be explored using interactive Manhattan plots,
viewed on pages displaying epigenomic data, or comprehensive information about specific
genes and variants. The Genetic Association Interactive Tool (GAIT) allows users to
securely interact with individual level data to perform on-the-fly association tests, filtering
sample sets by multiple criteria: custom online analyses and association tests can then be
performed based on these criteria. Results include p-values and odds ratios. GAIT also
enables assessment of disease burden for genes by focussing on high impact variants.
LocusZoom visualizes associations across a region and custom association analysis; other
modules display PheWAS and Forest plots showing associations for a variant across all
phenotypes in the T2DKP or across a selection of phenotypes from UK Biobank. The
Genetic Risk Score (GRS) module allows association of 243 high-risk variants with different
phenotypes to determine how they are genetically related.
S40
Notes
S41
Using Privacy Enhancing Technologies (PETs) to integrate clinical phenotypes, routine healthcare data and genomics in the public cloud
Neil Walker, NIHR BioResource, CUHP/EAHSN HDR UK Sprint Exemplar Innovation Programme
NIHR BioResource for Translational Research, Cambridge Biomedical Campus, Cambridge, UK; Membership of the project team is listed on the Cambridge University Health Partners, Eastern Academic Health Science Network websites http://cuhp.org.uk/ , https://www.eahsn.org/
Rare diseases, cumulatively affecting 1 in 17 of the population, can be difficult to diagnose
and often have an unidentified genetic cause. Utilising integrated data from advances in
imaging, pathology, and genomic analyses is hampered by separation of data at different
sites, with privacy risk cited as the chief barrier to data sharing. An HDR UK award brought
together academic, NHS and industry partners to build the governance, data flows and IT
architecture required to enable data integration and interrogated across data centres, with
targeted subsets de-identified and transferred to the public cloud for analysis.
Participants from 3 rare disease cohorts recruited into the NIHR BioResource all consented
to access to their healthcare records for research.
Source data is held at 2 sites: a secure data centre (AIMES) with NHS- and wider internet-
connections, and the University of Cambridge High Performance Computing service (HPC).
AIMES holds curated clinical details coded as SNOMED-CT or HPO terms, and is receiving
routine healthcare data requested from 5 pilot NHS Trusts, and activity data from Public
Health England and NHS Digital. The HPC has bulk (2.5PB) genomic data on the same
participants, summarised in OpenCGA. Data at AIMES is transformed through ETL
processes to FHIR, NHS Digital's preferred electronic health record exchange format. An
i2b2 data warehouse acts as cohort discovery tool.
We describe 2 use cases:
1. Academic: using machine learning on longitudinal full blood count data with genomics -
using routine healthcare data for research
2. Industry partner: bringing graph database tools and a drug discovery knowledge-base to
the data - integrating genomic and routine healthcare data
For these use cases one widely-used PET - a Trusted Execution Environment - fails, through
want of computational flexibility: the answer to data existing in separate silos, is not always
to build a bigger silo. Instead, we describe the use of a Differential Privacy PET (Privitar) to
apply privacy policies to de-identify subsets of data that are then assembled from the 2
sources in the public cloud (Microsoft Azure).
S42
Notes
S43
Creating 'operating systems' for biomedical discovery and care
Calum A. MacRae, Rahul C. Deo, Benjamin Scirica, Dana Vuzman, Stanley Y. Shaw
Brigham and Women's Hospital, Harvard Medical School, Broad Institute of Harvard and
MIT, Harvard Stem Cell Institute
Among the major constraints on the rigorous use of modern analytic approaches in
biomedicine are the low information content of many datasets, their generally modest scale,
sparse metadata and limited systematic integration of clinical, translational and basic
science. Many of these challenges directly align current efforts to introduce digital
technologies to redesign care delivery, to implement genomics in clinical practice and to
transform the precision (resolution and scale) of biomedical data collection. Specific
examples of practical efforts to address each of these areas will be outlined and the potential
for integrated systems where data collection, analysis, interpretation and iterative
implementation converge around 'biological' transactions or operators will be highlighted.
S44
Notes
S45
Refining national approaches to low value care
Aoife Molloy (1,2), Marcus Dawson (2,3), Ben Crittenden(3), Johannes Wolff (2)
1 = Imperial College London, 2 = NHS England, 3 = Faculty AI
Low value care is defined as interventions that are ineffective or only effective in certain
narrow clinical circumstances. The Evidence Based Interventions (EBI) Programme is a
national partnership between clinicians, purchasers, regulators and guideline producers for
health care in England, which provides statutory guidance on 17 interventions that should
not be routinely performed or should only be performed in certain cases.
Identification of low value care is challenging, with limitations in search strategies and
database capture. In this study we sought to identify inappropriate interventions from
administrative data using variation analysis and a prediction model. Data was extracted from
the Secondary Uses Service (SUS) database comprising 14,333,224 records of elective
hospital admissions in England between April 2017 and December 2018. Intervention codes
were generated using procedure (OPCS)/diagnosis (ICD_10) pairs so clinical indication and
procedure performed were captured. The degree of geographical variation between clinical
commissioning groups (CCGs) was analysed for each of the 306,993 total unique
procedure/diagnosis codes. The interdecile range was then calculated for each treatment
across all CCGs. The complete data extract (14,333,224 records) was randomly split into
separate training, validation and test sets. A random forest classifier model was fitted to the
balanced training data and optimised on the validation set by systematically varying the
model hyperparameters. The analysis focused on records that were incorrectly classified as
inappropriate (i.e. the 1,986,083 records that were predicted to be class 1, but where the
actual label was class 0). The mean prediction probability was calculated for each
procedure/diagnosis code within this false positive group, which created a shortlist of 92,647
unique treatments with associated mean probabilities. The geographical variation analysis
and the random forest classifier results were combined. Interventions with a calculable
interdecile range and which the random forest classifier incorrectly predicted to be
inappropriate were selected. This created a new shortlist of 1,048 recommendations that can
be considered for the EBI programme.
The project outlines data-driven approaches to identify inappropriate interventions from
administrative data. Using variation analysis and machine learning to identify inappropriate
interventions for consideration for policy making, guidance production and further research
represents a novel and promising method for extracting clinically-relevant insights from the
vast resource of administrative patient-level data within the NHS.
S46
Notes
S47
Using wearable devices and genetics to estimate and validate mechanisms of sleep
Andrew R Wood1,16, Samuel E. Jones1, Vincent T. van Hees2, Diego R. Mazzotti3, Pedro
Marques-Vidal4, Séverine Sabia5,6, Ashley van der Spek7, Hassan S. Dashti8,9, Jorgen
Engmann5, Desana Kocevska7, Melvyn Hillsdon1, Annemarie I. Luik7, Najaf Amin7, Jacqueline M.
Lane8,9, Richa Saxena8,10, Martin K. Rutter11,12, Henning Tiemeier5,13, Zoltan Kutalik13,14, Meena
Kumari15, Cecilia Lindgren16, Aiden Doherty16, Timothy M. Frayling1, Michael N. Weedon1, Philip
Gehrman3
1) University of Exeter, Exeter, UK; 2) Netherlands eScience Center, Amsterdam, The
Netherlands; 3) University of Pennsylvania, Philadelphia, PA, USA; 4) Lausanne University
Hospital, Lausanne, Switzerland; 5) University College London, UK; 6) Université de Paris, Paris,
France; 7) Erasmus MC University Medical Center, Rotterdam, Netherlands; 8) Massachusetts
General Hospital, Boston, MA, USA; 9) Broad Institute of MIT and Harvard, Cambridge, MA,
USA; 10) Harvard Medical School, Boston, MA, USA; 11) University of Manchester, Manchester,
UK; 12) Manchester University NHS Foundation Trust, Oxford Road, Manchester, UK; 13)
Harvard TH Chan School of Public Health, Boston, MA, USA; 14) Swiss Institute of
Bioinformatics, Lausanne, Switzerland; 15) University of Essex, Colchester, UK; 16) Big Data
Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford,
UK.
The widespread availability of wearable devices, such as wrist worn accelerometers, offers the
potential to objectively measure sleep in millions of people, and thus better understand the role of
sleep in disease development. While polysomnography (PSG) is regarded as the gold standard
method of measuring sleep, it is impractical to perform in large cohorts. To better understand the
mechanisms of sleep, we aimed to derive accelerometer estimates of sleep duration, timing and
quality and perform subsequent genetic analyses on these phenotypes to identify novel genetic
associations.
Using accelerometer data in up to 85,670 participants from the UK Biobank study, we derived
measures of sleep duration (duration and variability), quality (sleep efficiency and number of
nocturnal sleep episodes), and timing (including nocturnal sleep and the least active 5-hours)
using an algorithm previously validated against PSG data. Genetic data available from the UK
Biobank was used for genome-wide association analyses and validation of phenotypes.
The phenotypic correlation between accelerometer-estimates of sleep timing and measures of
sleep duration and quality were low (-0.10 R 0.12), consistent with data from self-reported
chronotype (“morningness”) and sleep duration in the UK Biobank (R = -0.01). We observed a
stronger correlation between sleep duration and sleep efficiency (R = 0.57). Heritability estimates
ranged from 2.8% (95% CI 2.0%, 3.6%) for sleep duration variability, to 22.3% (95% CI 21.5%,
23.1%) for the number of nocturnal sleep episodes. We identified 47 genetic associations at
P<510-8. These include 26 novel associations with measures of sleep quality and 10 with
nocturnal sleep duration. We replicate a previously reported gene locus (PAX8) associated with
sleep duration based on self-report data. We observe variants previously associated with restless
legs syndrome associate with multiple sleep traits. As a group, sleep quality loci are enriched for
serotonin processing genes.
Using accelerometer data in the UK Biobank, we have derived eight measures of sleep
characteristics. Genetic associations between these measures and known restless legs
syndrome associations provides validation of our methods. Work will now focus on the effects of
sleep and classified daytime activity types on disease risk. Activity types will be derived from
machine learning methods using labelled accelerometer data from camera experiments (random
forests and Hidden Markov Models). Associations with adverse health outcomes will be
assessed using prospective data from the UK Biobank.
S48
Notes
S49
Towards a self-learning healthcare system
Tianxi Cai
Harvard University, USA
The wide adoption of electronic health records (EHR) systems has led to the availability of
large clinical datasets available for discovery research. EHR data, linked with bio-repository,
is a valuable new source for deriving real-word, data-driven prediction models of disease risk
and progression. Yet, they also bring analytical difficulties. Precise information on clinical
outcomes is not readily available and requires labor intensive manual chart review.
Synthesizing information across healthcare systems is also challenging due to heterogeneity
and privacy. In this talk, I’ll discuss analytical approaches for mining EHR data with a focus
on scalability, reproducibility and automated knowledge extraction. These methods will be
illustrated using EHR data from Partner’s Healthcare and Veteran Health Administration.
S50
Notes
P1
Poster Presentations
An empirical study of the effects of black-box anonymisation on observational research
Mehrdad A. Mizani, Aziz Sheikh
Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh
Access to individual-level data is essential for generalisable observational studies. Data
holders provide individual-level data only after anonymisation to protect patient privacy. The
commonly applied anonymisation methods in the UK are variations of k-anonymity, synthetic
datasets, statistical disclosure control, and differential privacy. These methods incorporate
deleting, randomising, perturbating, or aggregating data elements at the feature, record or
cell levels. Consequently, they introduce noise and potentially reduce data utility. These
anonymisation methods are typically applied as a black-box approach leading to potential
uncertainties regarding the integrity and representativeness of anonymised data. This
uncertainty can lead to undetected bias, faulty assumptions and wrong interpretations in
observational studies.
We aim to empirically analyse the effects of anonymisation on the outcome and
interpretation of observational studies based on Machine learning (ML) and Artificial
Intelligence (AI). We have developed a pilot analytic software for assessing the effects of
anonymisation techniques on data utility in a range of observational study designs (i.e.
ecological, cross-sectional, case-control, and retrospective cohort studies). We have tested
our platform on the adult dataset of the University of California, Irvine (UCI) Machine
Learning Repository and aim to extend our research to the whole cohort of the UK Biobank
for asthma outcome.
We will use algorithmically-defined asthma outcome data from UK Biobank as a baseline
gold standard to assess the effects of anonymisation at three levels, as observed in the pilot
stage: 1) feature engineering, 2) ML/AI and epidemiological models and 3) interpretation of
the outcomes. We will apply prominent methods of anonymisation- such as aggregating,
feature suppression, value generalisation, and clustering -on the baseline data. The splitting
of data into train-validation-test datasets and the application of ML methods (e.g. logistic
regression) will be followed. The evaluation of feature engineering (e.g. changes in
probability distribution and correlation), ML/AI models (e.g. distance measurement) and
epidemiological characteristics (e.g. odds ratio) will be compared across various anonymity
methods.
We seek to stimulate discussion on approaches currently used to maintain data anonymity
and to consider the potential impact of these on analytic considerations and the development
of novel approaches to maintain data granularity whilst at the same time preserving
anonymity.
Funding: Mehrdad A. Mizani's Newton International Fellowship is funded by the Academy of
Medical Sciences and the Newton Fund.
P2
Negation Detection in NHS Brain Imaging Reports
Dominic Sykes, Andreas Grivas, Claire Grover, Richard Tobin, Cathie Sudlow, William Whiteley, Andrew McIntosh, Heather Whalley, Beatrice Alex
Centre for Clinical Brain Sciences, Usher Institute and School of Informatics, University of Edinburgh
With the advancement of natural language processing (NLP) technology, it is now possible
to extract structured information from Electronic Healthcare Record (EHR) raw text at
reasonably high accuracy. However, a challenge affecting performance is the accurate
distinction between positive and negative mentions of clinical terms. For example,
distinguishing between 'evidence of tumour present' and 'no evidence of tumour present' has
important implications. It is essential therefore that NLP methods include a negation
detection component.
Using a validated set of manually-annotated brain imaging reports from the Edinburgh
Stroke Study (ESS), we tested four negation detection methods, including two rule based
systems and two neural network (NN) methods trained on a subset of ESS:
(i) 'EdiE-R-Neg' - part of an existing pipeline developed to process radiology reports
(ii) 'pyConText' - a rule-based implementation and extension of 'ConText' developed for
medical text
(iii) 'biLSTM-Neg' - a bidirectional Long Short-Term Memory (biLSTM) architecture
(iv) 'FFNN-Neg' - a feed-forward neural network (FFNN).
Radiology reports (N=630) from the ESS data were annotated for 12 named entity types
(e.g. tumour, subdural haematoma) and 4 modifiers (e.g. location, temporal information).
The automated methods were tested on the unseen ESS test set (N=266 reports) containing
961 negated annotations.
The EdIE-R-Neg system (specifically developed for radiological reports) performs best at
98.02 (96.92-98.72) precision (P; positive predictive value) 97.61 (96.43-98.40) recall (R;
sensitivity) and 97.81 F1 score (harmonic mean of P and R) for negation annotations (with
95% confidence intervals for P and R reported in brackets). This is compared to a rule-based
negation detection system developed for medical text more generally, pyConText, which
performs worst (P=91.52 (89.61-93.09), R=94.28 (92.62-95.57), F1=92.88)). The two NN
systems (both trained on ESS data) perform almost as well as the local rule based method
(biLSTM: P=98.40 (97.38-99.02), R=96.05 (94.61-97.10), F1=97.21; FFNN-Neg: P=97.81,
R=(96.68-98.56), R=97.61 (96.43-98.40), F1=97.71). The difference between the two NN
approaches is that biLSTM-Neg has higher precision than recall for negative annotations,
while FFNN-Neg is more balanced across metrics.
In summary, we report comparable high levels of performance for the NN methods to a rule
based system designed specifically for such records. This comparison demonstrates the
suitability of different approaches for the use of negation detection to improve NLP accuracy
in this context.
P3
Semantics and graph databases: developing, applying and mapping a spectrum of
phenotype ontologies to integrate health-related data
Tim Beck1, Richard Bramley2, Tom Shorter1 and Anthony J Brookes1
1 Department of Genetics and Genome Biology, University of Leicester, Leicester 2 NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester
The term “phenotype” is used to define an aggregated set of medically and semantically
distinct concepts such as a trait (e.g., blood glucose level), medical signs and symptoms
(e.g., hyperglycemia), and disease (e.g., type 2 diabetes). A spectrum of publicly available
ontologies is available that define overlapping phenotype domains. Two use cases
demonstrate our approach for harmonising both non-clinical and clinical qualitative research
data: integrating phenotype content across thousands of summary-level genome-wide
association studies (GWAS); and enabling the Genetics and Vascular Health Check study
(GENVASC) primary care dataset which consists of many thousands of codes to be
interrogated using a specialised disease-specific application ontology.
GWAS Central (https://www.gwascentral.org/) is the world’s largest collection of summary-
level GWAS data. The phenotype content is annotated using MeSH and the Human
Phenotype Ontology (HPO), where MeSH provides detailed disease descriptors and HPO
provides granular signs and symptoms coverage. We have created an automated pipeline
to map between the phenotypes coded in GWAS Central and phenotypes coded to other
ontologies in GWAS-related databases to create the comprehensive ‘PhenoMap’ resource.
The use of an underlying graph database allows simple and fast retrieval of complex
hierarchical structures that are difficult to model in relational systems.
GENVASC uses primary care data to investigate how risk prediction for coronary artery
disease (CAD) can be improved. The introduction of both the General Data Protection
Regulation, and SNOMED CT as the single structured vocabulary used within the NHS, has
presented two distinct challenges. Firstly, there is a necessity to justify the collection of all
SNOMED CT codes from the primary care record, and secondly, SNOMED CT is a verbose
clinical terminology which requires an interface with focussed specialised terms to allow it to
be fully accessible and useful to clinical researchers. We have developed an application
ontology for CAD (225 terms) ensuring alignment with external initiatives, such as the NIHR
HIC Metadata Catalogue. Using automated concept recognition techniques, we mapped our
CAD ontology to the UK Edition of SNOMED CT (>10K mapped terms) to produce a subset
of the terms in SNOMED CT which can be extracted from the primary care records for
participants.
P4
A scalable federated solution for UK health data encryption, linking and discovery
Professor Anthony Brookes, Theodoros Arvanitis(1), Tony Avery(2), Simon Ball(3), Don Campbell(4), Andy Carruthers(5), Saiful Choudhury(5), Cheryl Davenport(6), Spencer Gibson(7), Reiko Heckel(7), Umesh Kadam(7), Joe Kai(2), Danny Kirby(5), Darren Palmer(8), Phil Quinlan(2), Ajitha Ranasinghe(8), David Roberts(4), Mark Robertson(4), Des Shanahan(4), David Shepherd(7), Warren Stone(9), Phil Townes(8) and Marc Wadsley(7)
1: University of Warwick, UK 2: University of Nottingham, UK 3: University Hospital Birmingham, UK 4: Privitar 5: University Hospital Leicester, UK 6: Leicester City CCG, UK 7: University of Leicester, UK 8: Leicester Health Informatics Service, UK 9: NTT Data
As part of the HDR UK Digital Innovation Hub program we are running a 'Sprint Exemplar'
project to establish a method for responsible record-level linkage between effectively
anonymised healthcare datasets (primary care, secondary care, social care, etc) in a
federated network, along with a process for making these linked datasets safely
discoverable by researchers that might wish to access them. The project includes academic,
industry, biobanking and healthcare partners across Leicestershire and the Midlands, plus
the privacy software company Privitar. To date, we have combined our partner technologies,
and defined and begun implementing a viable architecture for the intended federated linkage
service. Source datasets have been agreed and Information Governance work undertaken to
enable access appropriate to the planned use. A generalised model for enabling linked data
discovery has been formulated, and this is being applied via previously developed 'Café
Variome' software. A UK wide 'Strategic Committee' has been formed to oversee progress
and make wider connections.
P5
Learn from the unknown: Leveraging missingness for outcome prediction in clinical time series data
Thore Buergel 1, Maria Herrero-Zazo 2, Tomas Fitzgerald 2 and Ewan Birney 2
1 Health Data Science Unit, University Clinics Heidelberg, Germany 2 The European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom.
The implementation of electronic health records (EHRs) with the ability to overcome the
reign of pen and paper in clinical context marks the birth hour for medical big data. Thereby,
a typical EHR is composed of descriptors of very diverge origins and scales, such as
physiological measurements, laboratory test results, pharmacological treatments, diagnosis
and patient demographics from hospital admission to discharge. The density of the clinical
time series is dependent on both, the severity of the disease as well as the ward, with the
densest sampling found in intensive care units (ICUs). In summary, EHRs are high-
dimensional, multi-scaled, irregular multi-variate time series containing information on patient
descriptors and outcomes.
Potential application of EHRs, beyond data collection lay in the development of machine
learning models for clinical decision support, risk modelling and patient stratification. The
foundation to all these approaches, is the correct representation of clinical time series data
for machine learning purposes. This is a non-trivial question, as this data exhibit strong
missingness. In consequence clinical time series vary between patients in both, recorded
descriptors and their specific frequency of recording. In contrary to other time series data,
however, this missingness, is as well informative, as laboratory tests for instance are only
requested upon incidence. Thus, in order to be able to fully exploit clinical time series data,
an optimal representation capturing the characteristic patterns of missingness is crucial.
In this work, we set out to investigate representations and identify the methods of choice for
machine learning from clinical time series data. Utilising data from the publicly available
MIMIC-III database, we assess combinations of time series representation techniques with
random forests, recurrent neural networks and temporal convolutional neural networks with
special emphasis on the question of leveraging the hidden information in missingness
patterns.
To be able to benchmark these diverse representations and modelling techniques, we define
a supervised learning problem and evaluate models based on their predictive performance.
More specifically, we assess the general static outcome of inpatient mortality from the first
48 hours of clinical time series data after admission to ICU. Here, we demonstrate predictive
performance increases when information on missingness is added. We report model
performances and give concrete insights in our flexible and transferable benchmark
environment covering the whole pipeline from data preprocessing, over encoding to
prediction.
P6
Early stages of type 2 diabetes: cohort study linking genetic liability with repeated metabolomics across early life
Caroline J. Bull 1,2,3*, Joshua A. Bell 1,2*, Marc J. Gunter 4, David Carslake 1,2, George Davey Smith 1,2, Nicholas J. Timpson 1,2, Emma E. Vincent 1,2,3
* Contributed equally 1 MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK 2 Bristol Population Health Science Institute, Bristol Medical School, Bristol, UK 3 School of Cellular and Molecular Medicine, University of Bristol, UK 4 Section of Nutrition and Metabolism, International Agency for Research on Cancer, Lyon, France
Type 2 diabetes develops for many years before diagnosis. We aimed to reveal its early
metabolic stages by examining genetic liability to adult type 2 diabetes in relation to detailed
metabolic traits across early life.
Data were from up to 4,765 offspring participants of the Avon Longitudinal Study of Parents
and Children (ALSPAC) birth cohort study. Linear regression models were used to examine
effects of a genetic risk score (GRS) for adult type 2 diabetes, which included 162 genetic
variants explaining 18% of variance from the most recent genome-wide association study,
on 4 repeated measures of 229 traits from targeted nuclear magnetic resonance
metabolomics. These traits included lipoprotein subclass-specific cholesterol and triglyceride
content, branched chain and aromatic amino acids, fatty acids, inflammatory glycoprotein
acetyls, and others, and were measured in childhood (age 8y), adolescence (age 15y),
young-adulthood (age 18y), and adulthood (age 25y).
Participants were 49.7% male and had progressively higher body mass index (BMI) across
time points (mean, standard deviation (SD)) BMI in kg/m2 was 16.2 (2.0) at age 8y and 24.8
(4.9) at age 25y, respectively. The prevalence of type 2 diabetes was low across time points
(< 5 cases when first assessed at age 15y; 7 cases (0.4%) when assessed at age 25y). At
age 8y, type 2 diabetes liability (per SD-higher GRS) was associated with lower lipids in
high-density lipoprotein (HDL) particle subtypes - e.g. -0.03 SD (95% CI=-0.06, 0.00;
P=0.03) for total lipids in very-large HDL. At age 15y, associations remained strongest with
lower lipids in HDL and became stronger with pre-glycemic traits including citrate (-0.06 SD,
95% CI=-0.09, -0.02; P=0.001) and with glycoprotein acetyls (0.05 SD, 95% CI=0.01, 0.08;
P=0.01). At age 18y, associations were stronger with branched chain amino acids including
valine (0.06 SD; 95% CI=0.02, 0.09; P=0.001), while at age 25y, associations had
strengthened with very-low density lipoprotein (VLDL) lipids and remained consistent with
previously altered traits including HDL lipids.
Early metabolic stages of type 2 diabetes likely involve lower HDL lipids, followed by higher
branched chain amino acid and inflammatory glycoprotein acetyl levels. These perturbations
are apparent in childhood as early as age 8y, decades before the clinical onset of disease.
Knowledge of traits specific to type 2 diabetes development may inform the targeting of key
pathways or the early prediction of individuals on course to more debilitating stages of
disease.
P7
Genome-wide association study of tramadol dose in Generation Scotland using Electronic Health Records
Toni-Kim Clarke, Johnathan Hafferty1, Mark Adams1, Archie Campbell2, David Porteous2, Caroline Hayward3 and Andrew McIntosh1
1Division of Psychiatry, University of Edinburgh, Edinburgh, UK 2Centre for Genomic and Experimental Medicine , Institute of Genetics & Molecular Medicine, University of Edinburgh, UK 3MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, UK
The increased use and dependence on opioids is a global concern. The recent opioid
epidemic in the United States largely has its origin in the over-prescription of opioid
analgesics to treat chronic pain. Deaths from both illicit and prescription opioids have
continued to rise steadily since 2000 and for those seeking treatment for opioid use disorder
(OUD), 75% cite prescription drugs as their first exposure to opioids. A risk factor for OUD is
the prescription of higher maximum doses of opioid analgesics for chronic pain. In order to
determine genetic factors associated with higher doses of opioid analgesics a genome-wide
association study (GWAS) of tramadol dose was performed in the Generation Scotland (GS)
cohort. Using electronic health records, members of GS were linked to the National
Prescribing Information System to obtain prescription dose information. 945 GS participants
were identified who had a received 3 or more tramadol prescriptions from April 2009 -
February 2015 (69% female, mean age 53.2 [S.D=13.5]). Maximum tramadol dose was
derived for each participant by extracting the dosage, frequency and medication strength
from the prescribing information. GWAS analyses did not find any loci associated with
maximum tramadol dose at the level of genome-wide significance; however, SNPs in genes
including CREB5 and TRPV2 were amongst those most significantly associated. CREB5 has
previously been associated with opioid disorder and TRPV2 polymorphisms are associated
with fibromyalgia. Although this is only a small sample, these findings demonstrate how
record linkage can be used to better understand the genetics of opioid dose and chronic
pain. Approaches such as these will also help to increase power in studies of OUD where it
is difficult to obtain large sample sizes.
P8
particleMDI: A Julia Package for the Integrative Cluster Analysis of Multiple Datasets
Nathan Cunningham, Jim E Griffin, David L Wild, Anthony Lee
NC: University of Warwick and The Alan Turing Institute, JEG: University College London, DLW: University of Warwick, AL: University of Bristol and The Alan Turing Institute
We introduce particleMDI, a Julia package for performing integrative cluster analysis on
multiple heterogeneous data sets with applications in genomc research. Our approach uses
a particle Monte Carlo method for inference where the cluster allocations are updated using
a particle filter within a Gibbs sampling approach. particleMDI updates cluster allocations
using a particle Gibbs approach which offers better mixing of the MCMC chain, but at greater
computational cost, than the original MDI algorithm. We outline approaches to improving the
computational performance of our algorithm, finding the potential for greater than an order of
magnitude improvement in performance. We apply our algorithm to the task of discovering
risk cohorts amongst 243 patients presenting with kidney cancer, using samples from the
Cancer Genome Atlas, for which there are gene expression, copy number variation,
methylation, protein expression and microRNA data. We identify 4 distinct consensus
subtypes and show they are prognostic for survival rate (p < 0.0001). This is driven mainly
by the copy number data, but the effect is strengthened by the other four data types,
demonstrating the value of integrating multiple data types.
P9
Enabling better access to healthcare data for the UK SME community
Mark Davies, Joanne Hartley, Tim Newton, Nicola Heron and Mark Samuels
Medicines Discovery Catapult, Alderley Park, Macclesfield, Cheshire, SK10 4TG.
Opportunity:
The ability to access and use health data offers many opportunities for SMEs to improve
healthcare, such as training AI algorithms or testing mobile health Apps. These could help
answer questions ranging from the epidemiology of rare diseases to understanding patients'
natural history of disease. Technological developments addressing key issues around data
privacy, such as the emergence of federated learning to solve the problem of moving patient
data. Appropriate access to health data gives SMEs the chance to find disease interventions
at far earlier stages than has been the case, ultimately leading to better development of new
medicines. Additionally, an ability to identify trends in healthcare data can be used to
optimise public health through better targeted treatments.
Problem statement:
The UK has a national healthcare system from cradle to grave, which provides extensive
opportunities for health research. Yet healthcare data and electronic patient records are
often fragmented. SMEs in particular, can find access procedures complex. Whilst there
exist a number of large data sets, e.g. UK Biobank, Hospital Episode Statistics (HES) and
Clinical Practice Research Datalink (CPRD), they can be inaccessible for many smaller
companies: the cost or requisite for analytical capabilities can preclude SMEs. Furthermore,
these companies would most benefit from linking these large datasets to complimentary data
(e.g. mobile health or lifestyle records).
Solution:
The Medicines Discovery Catapult is a not-for-profit company, funded by Government to
support the development of new medicines by SMEs. The Catapult provides scarce scientific
capabilities and acts as a national gateway to specialist facilities, technology and expertise
within the UK. Through research and engaging with SMEs, knowledge has been built around
what SMEs need in terms of healthcare data. With extensive know-how of the UK's health
data, the Catapult is helping SMEs to support development of new medicines for patients.
On top of its series of data access guides, the Catapult is already providing SMEs with
tailored advice and recommendations about where and how to source the data they need.
We have already provided support to a number of SMEs for accessing data from a range of
UK data sources. Furthermore, the results of our survey demonstrate the current challenges
facing SMEs and the workflow we are developing to help.
P10
ELIXIR: Providing a coordinated European Infrastructure for managing Human Genomics Translational Data and Services
Jennifer Harrow, On Behalf of ELIXIR
ELIXIR
ELIXIR unites Europe's leading life science organisations in managing and safeguarding the
increasing volume of data being generated by publicly funded research. It coordinates,
integrates and sustains bioinformatics resources across its member states and enables
users in academia and industry to access services that are vital for their research. There are
currently 22 countries involved in ELIXIR, bringing together more than 200 institutes and 600
scientists.
ELIXIR's activities are coordinated across five areas called 'Platforms', which have made
significant progress over the past few years. For instance, the Data Platform has developed
a process to identify data resources that are of fundamental importance to research and
committed to long term preservation of data, known as core data resources. The Tools
Platform has services to help search appropriate software tools, workflows, benchmarking as
well as a Biocontainer's registry, to enable software to be run on any operating system. The
Compute Platform has services to store, share and analyse large data sets and has
developed the Authorization and Authentication Infrastructure (AAI) single-sign on service
across ELIXIR. The Interoperability Platform develops and encourages adoption of
standards such as FAIRsharing, and the Training Platform helps scientists and developers
find the training they need via the Training e-Support System (TeSS).
The Beacon Project is an open sharing platform that allows any genomic data centre in the
world to make its data discoverable. The project is a first-of-its kind effort to make the
massive amounts of life sciences data being collected in healthcare and research settings
around the globe accessible and is being supported and funded by ELIXIR. To date, 70
beacons have been "lit," including seven in the UK and another nine across Europe, allowing
users unprecedented discovery of genomic variants in national and international cohorts.
The Authentication and Authorisation Infrastructure (AAI) provides a centralised user identity
and access management service (ELIXIR AAI). ELIXIR AAI will be used to access the
European Genome-Phenome Archive (EGA) resources and ELIXIR is working with the
GA4GH to have ELIXIR AAI approved as a standard. The focus now for ELIXIR Human
Genomics and Translational Data is to establish a federated suite of EGA services across
Europe, coordinating the national roadmaps and large EU projects to enable population
scale genomic, phenotypic, and biomolecular data to be accessible across international
borders.
P11
Generation Scotland: Scottish Family Health Study: Linkage to Electronic Health Records
Caroline Hayward, MRCHGU QTL in Health and Disease group.
MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, EH4 2XU, Scotland. Generation Scotland, Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, EH4 2XU, Scotland
Generation Scotland: Scottish Family Health Study (GS:SFHS) is a biobank designed to
study the genetics of areas of importance to public health. Over 24,000 adults were recruited
to the study from 2006 to 2011, with broad and enduring written informed consent for
biomedical research. Consent was obtained from most participants for GS:SFHS study data
to be linked to their Scottish National Health Service records, using their Community Health
Index number. This identifying number is used for NHS Scotland procedures and allows
healthcare records for individuals to be linked across time and location.
NHS Electronic Health Record (EHR) data on GS:SFHS participants with consent and
mechanisms for record linkage, plus genome-wide genetic information, is available for
associations relating to quantitative trait phenotypes and occurrence of prevalent and
incident diseases is investigated. The existing baseline study phenotypes and new, "omics"
and biomarker data obtained from analysis of stored biological samples (serum/urine)
present opportunities to gain insights into a range of complex conditions. This demonstrates
that EHR is a rich resource of real world data that can be used in research to characterise
the health trajectory of participants.
P12
Exploring chronic disease and multimorbidity through linked healthcare records in a local cohort
Catherine John, Robert C Free, Nicola Reeve, David J Shepherd, Louise V Wain, Martin D Tobin
University of Leicester (Department of Health Sciences, NIHR Leicester Biomedical Research Centre and Department of Respiratory Sciences), UK
Multimorbidity is the presence of multiple diseases or conditions in one patient. It is now the
norm in some age groups in high-income countries. However, many research studies are
designed to consider individual conditions in relative isolation. Studies with linked electronic
health records (EHR) can incorporate a broad range of diseases and risk factors to study
multimorbidity, as well as longitudinal follow-up of disease progression, treatment outcomes
and ageing.
EXCEED (Extended Cohort for E-health, Environment and DNA) is a population-based study
of over 10,000 adults based in Leicester. Recruitment is ongoing, with particular emphasis
on increasing representation from Black, Asian and minority ethnic (BAME) communities.
Data includes baseline anthropometry, spirometry, lifestyle factors, longitudinal health
information from EHR (primary care and Hospital Episode Statistics) and genome-wide
genotyping. Participants also consent to be contacted for recall-by-genotype and recall-by-
phenotype studies, facilitating precision medicine research and validation of EHR.
Mean participant age is 58.4 years (sd 8.5). Participants are slightly healthier than average
for common risk factors, in line with similar cohorts. For example, the total proportion who
are overweight/obese (64.2%) is slightly lower than similar age groups in Health Survey for
England. Nevertheless, 60% of participants with primary care EHR linkage have at least one
chronic condition (any qualifying diagnostic code for ≥1 of 16 conditions prioritised for
management in primary care by the Quality and Outcomes Framework), and over a quarter
have evidence of multimorbidity (≥2 of 16 conditions). Over 90% have ≥4 recordings of blood
pressure over time, and many have multiple results from routine blood tests such as serum
cholesterol (62.6% with ≥3 results) or creatinine (72.4% with ≥3 results), enabling the study
of progression and outcomes of common chronic diseases over time.
Linked primary care EHRs contain a wealth of data on chronic disease. Whilst there are
limitations - including misclassification, miscoding and the "clinical iceberg" effect - repeat
recording and multiple types of data can be used to improve classification. Many disease
definitions have been validated already - for example, using CPRD - and EXCEED allows for
further validation by recontacting participants. In addition, combining records such as
prescriptions, diagnoses and symptom codes can define complex phenotypes that may not
otherwise be easily studied. EXCEED is an important resource for the study of chronic
disease and multimorbidity in the era of genomics and precision medicine.
P13
MegaCox: Large-Scale implementation for Cox's Proportional Hazards Model
Alexander W. Jung, Moritz Gerstung
EMBL-EBI
The availability of population-scale health data sets ranging from medical records, wearable
devices, to various 'omics data allows for data analysis on unprecedented scale and depth.
However, the vast amount of data requires the development of new statistical methods. Of
particular interest in health data science is the prediction of time to event like death or
disease diagnosis.
Due to heavily skewed distributions and censoring, standard statistical methods are no
longer applicable and more advanced models in the context of survival analysis are needed.
Cox's Proportional Hazards Model (CPH) is extensively used in research, thanks to sound
statistical theory, flexibility, and meaningful parameter estimates.
Nevertheless, current implementation of the CPH requires the data to be fit into memory,
hence, limiting the scope for big data applications. Therefore, we developed an
implementation of the CPH that scales to massive amounts of data. We provide a simulation
study to show the consistency of our method and its scalability to big data.
This forms the methodological basis to analyze a population-wide hospital admission record
of 8+ million people to infere predictive disease trajectories.
P14
Adherence to cardiovascular medication across Scotland
Kirstin Leslie, Professor Jill Pell, Professor Colin McCowan, Professor Andrew Morris, Professor Dave Robertson.
University of Glasgow MRC Doctoral Training Partnership in Precision Medicine
Adherence to prescription drugs is a longstanding but increasing problem in disease
management. In recent years, a rise in chronic diseases, longer life-expectancies, increased
multi-morbidity, and use of medicine in disease prevention as well as treatment, has led to
an increase in the number of people taking drugs for extended periods of time, exacerbating
the problem of nonadherence. Failure to adhere to medication can contribute to reduced
clinical effectiveness, reduced cost-effectiveness, poorer overall health, and health
inequalities.
Despite the availability of efficacious drugs, cardiovascular disease (CVD) remains a leading
cause of global mortality, and prevalence of CVD is higher in Scotland than in other
developed countries; consequently, research into adherence to cardiovascular drugs could
be particularly useful in this setting. Furthermore, Scotland has valuable nation-wide
administrative databases which can be used for the study of adherence at a population level.
With these datasets, it is possible to define different levels of disease severity, identify
polypharmacy and comorbidities, and compare adherence across a range of drug classes.
This could prove important in targeting future interventions to improve adherence and for
informing prescribing practices.
Using the Prescribing Information System (PIS) linked to hospital admissions data (SMR01,
SMR04) and death certificates (NRS) we have defined four key patient subgroups: primary
prevention (n= 1,656,566), treatment for symptomatic CVD (n = 260,516), secondary
prevention (n=25,283), and secondary prevention with treatment (n=23,866). These groups
differ across key baseline characteristics such as age, gender, and commonly prescribed
CVD drugs.
This study is a novel take on the study of adherence to cardiovascular medication: a
longitudinal, Scotland-wide, retrospective study of adherence with cardiovascular drugs,
covering all ages and sociodemographic classes, giving greater insight into modifiable and
unmodifiable risk factors, and allowing identification of patient groups requiring extra support.
P15
Use of Hospital Episode Statistics in the UK Biobank to aid independent replication of genetic associations
Nadezda Lipunova1, Anke Wesselius2, Kar K. Cheng3, Frederik-Jan van Schooten4, Richard T. Bryan1, Jean-Baptiste Cazier1,5, Maurice P. Zeegers1,2
1Institute of Cancer and Genomic Sciences, University of Birmingham, United Kingdom, B15 2TT; 2Department of Complex Genetics, Maastricht University, The Netherlands, 3Institute for Applied Health Research, University of Birmingham, United Kingdom, 4Department of Pharmacology and Toxicology, Maastricht University, The Netherlands, 5Centre for Computational Biology, University of Birmingham, United Kingdom
Urinary bladder cancer (UBC) is a disease with great burden on healthcare systems and
patients. It is recognised research effort into management of existing cases would provide
most benefit, however, practical advances are scarce. This problem partially exists due to
the lack of adequate sample sizes with required phenotype data (i.e. recurrence,
progression), that is not collected routinely and relies on specific cohorts with long follow-up.
We aim to aid current research by mining Hospital Records Statistics (HES) to construct
meaningful UBC outcomes.
We have used HES on procedures carried out in a hospital setting, all coded under
Classification of Interventions and Procedures (version 4, OPCS4). Our developed algorithm
aims to identify specific patterns of administered procedures and their timing that would
represent UBC recurrence and/or progression. To validate their usefulness in a research
setting, we are currently carrying out an independent replication analysis of previously
reported genetic associations with UBC recurrence, progression, and survival (both overall-
and cancer-specific).
We encourage the use of HES data to aid both exploratory and confirmative research on
bladder cancer. Our proposed algorithm is currently specific to the OPCS4, used throughout
the United Kingdom. However, we believe the methodology is transferable and can be used
with other classifications of procedural codes.
P16
Analysis of Pulmonary Surfactant Pathway Genes and Association with Lung Cancer Risk
Jennifer Luyapan, Younghun Han, Todd A. MacKenzie, Christopher I. Amos
Quantitative Biomedical Science, Geisel School of Medicine, Dartmouth, Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine
Surfactant genes play a key role in maintaining the integrity of pulmonary airways by
encoding for proteins involved in the production, function, and catabolism of pulmonary
surfactant. Mutations in genes responsible for surfactant homeostasis have been associated
with fibrotic lung disease and lung cancer development. Previous studies identified genetic
variations in SFTPA, a gene responsible for surfactant metabolism and function, and its
association with lung cancer. Surfactant metabolism is regulated by many genes, and
comprehensive evaluations of genetic associations with lung cancer risk are lacking.
However, little attention has been paid to other surfactant proteins and their association with
lung cancer risk. Our comprehensive assessment of genes associated with surfactant
metabolism may provide new evidence to biological mechanisms by which surfactant genes
play a role in lung cancer etiology. We propose to perform a genetic association analysis of
single nucleotide polymorphisms (SNPs) genotyped within 100 KB of all surfactant genes
and associated proteins involved in their production and maturation. Genome-wide genotype
data for 29,266 cases and 56,450 controls were collected by the OncoArray Consortium to
study lung cancer risk. The OncoArray and Affymetrix chip arrays were configured to include
all known functional SNPs around surfactant genes to analyze their association with lung
cancer. We identified 3 susceptibility loci in cathepsin h (CTSH) gene associated with overall
lung cancer. We also identified 1 missense variant in surfactant protein b (SFTPB), 3
susceptibility loci in CTSH, and 34 susceptibility loci in telomerase reverse transcriptase
(TERT) genes that are associated with the lung histological subtype adenocarcinoma. In
addition, we identified 2 susceptibility loci in CTSH and 1 locus in adenosine deaminase 2
(ADA2) genes that are associated with small cell lung cancer. Here, all identified susceptible
loci achieved genome-wide significance. Given the importance of functional annotation of
discovered variants, all analyses will implement an approach to infer surfactant gene
expression levels by integrating data from expression quantitative trait loci (eQTL)
experiments from normal lung and whole blood tissues.
P17
Assessing for interaction between APOE e4, sex and lifestyle on cognitive abilities.
Dr Donald M. Lyall, Carlos Celis-Morales, Laura M. Lyall, Christopher Graham, Nicholas Graham, Daniel F. Mackay, Rona J. Strawbridge, Joey Ward, Jason M. Gill, Naveed Sattar, Jonathan Cavanagh, Daniel J. Smith, Jill P. Pell.
Institute of Health & Wellbeing, University of Glasgow, Scotland, UK. Institute of Cardiovascular and Medical Sciences, University of Glasgow, Scotland UK. Department of Medicine Solna, Karolinska Institute, Stockholm, Sweden.
Background: There are relatively well-known lifestyle risk factors for worse cognitive abilities,
such as smoking cigarettes. There is some evidence that people with apolipoprotein (APOE)
e4 genotype may be more vulnerable to such risk factors; i.e. that associations are
significantly larger in magnitude for e4 carriers vs. non-carriers. Our objective was to test for
interactions between APOE e4 genotype, and lifestyle factors on worse cognitive test scores
in middle-aged to older participants from the general population.
Methods: Using UK Biobank cohort data, we tested for interactions between APOE e4 allele
presence, lifestyle factors of high vs. low alcohol intake, smoking history, total physical
activity, obesity, and male vs. female sex, on cognitive tests of reasoning, information
processing speed and executive function (n range=70,988-324,725 depending on the test).
We statistically adjusted for potential confounders of age, sex, deprivation, cardiometabolic
conditions, and educational attainment.
Results: There were significant associations between APOE e4 and worse cognitive abilities,
independent of potential confounders, and between lifestyle risk factors and worse cognitive
abilities, however there were no interactions at multiple correction-adjusted P<0.05, against
our hypotheses.
Conclusions: Our results do not provide support for the idea that e4 genotype increases
vulnerability to the negative effects of lifestyle risk factors on cognitive ability, but rather
support a primarily outright association between APOE e4 genotype and worse cognitive
ability.
P18
A digital health approach for assessing glucose continuously in epidemiology cohort studies, including novel software for analysing continuously measured glucose data
Louise A C Millard [1,2,3], Nashita Patel [5], Kate Tilling [1,3], Melanie Lewcock [3], Peter A Flach [2], Debbie A Lawlor [1,3,4]
1 MRC Integrative Epidemiology Unit (IEU), Bristol Medical School (Population Health Sciences), University of Bristol, Bristol, UK 2 Intelligent Systems Laboratory, Department of Computer Science, University of Bristol, Bristol, UK 3 Population Health Sciences, Bristol Medical School, University of Bristol 4 Bristol NIHR Biomedical Research Centre 5 Department of Women and Children’s Health, School of Life Course Sciences, King College London, UK
While epidemiology studies typically measure glucose levels in the blood at a single or
widely spaced time points (e.g. every few years), there is increasing appreciation that
glucose levels and variability in free-living conditions during both the day and night, may
provide important health measures in clinical (e.g. diabetic or obese) and 'healthy'
populations. Continuous glucose monitors (CGM) record interstitial glucose 'continuously',
producing a sequence of measurements for each participant (e.g. glucose levels every 5
minutes over several days, both day and night). To analyze CGM data, researchers tend to
derive summary variables such as the area under the curve (AUC), to then use in
subsequent analyses. However, to date, a lack of consistency and transparency of precise
definitions used for these summary variables has hindered interpretation, replication and
comparison of results across studies.
We have developed GLU, an open-source software package for deriving a consistent set of
summary variables from CGM data. GLU derives a diverse set of summary variables
covering six broad domains: overall average levels, overall variability, glycaemic excursions,
fasting levels, variability from one moment to the next and post-event (e.g. after meals)
levels. For example, overall variability is depicted by the AUC while post-meal levels are
depicted by the AUC 1- and 2-hours after eating and time-to-peak-glucose. GLU also
provides two day-oriented approaches to dealing with missing data. The 'complete days'
approach includes only days with no missing values, while the 'approximal imputation'
approach imputes missing time periods using nearby non-missing values. GLU is
implemented in R and is available on GitHub at [https://github.com/MRCIEU/GLU].
We have used pilot data from the Avon Longitudinal Study of Parents and Children
Generation 2 (ALSPAC-G2) cohort to illustrate GLUs utility. Pregnant women are invited to
wear a CGM device on their buttock, abdomen or arm, for 6 days. We present the results of
preliminary analyses in 43 women, assessing the association of BMI with each GLU
summary variables. We found that a higher gestational BMI was associated with several
aspects of these CGM data - a higher overall mean glucose during both the day- and night-
time, higher time spent in hyper-glycaemia during the night-time, shorter post-prandial time
to peak, and higher variability during the night-time. Results using the complete days and
approximal imputed approaches to dealing with missing data were broadly consistent.
P19
Population Health meets Systems Biomedicine (POPMOL)
Jonathan G. L. Mullins, Ronan A. Lyons
Swansea University Medical School, Singleton Park, Swansea SA2 8PP, Wales, UK
Health data science repositories are being joined with systems biomedicine platforms. The
Human3DProteome platform (http://human.3dproteome.swan.ac.uk) is the first fully
integrated in silico platform of the entire human proteome (all 23,000 receptors, enzymes
etc., encoded by the human genome), providing a unique molecular level structural database
including characterisation of thousands of active sites, and millions of protein-compound
interactions, developed for drug discovery and virtual toxicity screening but now also being
developed as a powerful and comprehensive mechanistic framework for advancing health
data research. The Human3DProteome project includes 3D molecular structures for 830,000
polymorphic variant proteins (as listed in the UniProt database) - the largest structural library
of human disease phenotypes, with automated linkage to protein-protein interactions,
metabolic / signalling pathways and potential therapeutic areas. Machine learning
approaches are applied to widening discovery of interactions in biological and chemical
space, downstream metabolic and phenotypic effects.
The POPMOL project addresses:
Technical challenges of linking population and patient genomic data with the 3DProteome
architecture - bridging between sequence information stored in health data repositories
(whole genome, partial genome data, specific genes) and systems biomedicine visualisation
and analysis. This pipeline facilitates the key linkage of genome information to automated
computation of structural and functional impacts of missense, nonsense, insertion, deletion,
duplication, frameshift and repeat expansion mutations at molecular, pathway and disease
network levels.
System usability and application to specific areas of research interest. The 3DProteome
web interfaces were designed primarily for use by molecular and discovery scientists in
industry and academia. The usability challenge is to determine how these might best be
adapted to the needs of health data researchers and integrated with currently used
architectures in epidemiology studies.
This foundation will widen the scope and accessibility of mechanistic molecular
computational research (sequence-structure-function-phenotype relationships), to facilitate a
community approach to profiling and clustering specific variants to phenotypes; classification
of hitherto uncharacterised genetic variants; to stimulate development of new solutions in the
broad area of population genomics, furthering interaction with partners involved in research
incorporating wider genomic, metabolomic and proteomic data.
An early focus of the genome to molecular function framework is on developing capability for
identifying patterns and clusters of genetic differences associated with specific disease
manifestations and severity, working towards indices of risk and informing new precision
medicine strategies.
P20
Trends in two- and five-year survival after glioblastoma diagnosis: a systematic review and meta-analysis of population-based studies
Michael TC Poon, Cathie LM Sudlow, Paul M Brennan, Jonine D Figueroa
Usher Institute for Population Health Sciences and Informatics, The University of Edinburgh Centre for Clinical Brain Sciences, The University of Edinburgh
Background - Glioblastoma is the most common malignant primary brain tumour with a
median survival of ~14 months. The major advance in treatment came from a clinical trial
published in 2005 demonstrating radiotherapy with concomitant and adjuvant temozolomide
improved survival. We aimed to review and summarise 2- and 5-year survival in patients with
glioblastoma before and after the landmark trial. Methods - We searched MEDLINE and
Embase for population-based observational studies of adults with glioblastoma reporting
overall survival at ≥2 years published between January 2003 to 15 March 2019. Studies with
fewer than 50 patients were excluded. Study quality was assessed using a risk of bias tool
based on ROBINS-E. We stratified studies according to whether the recruitment period
ended before, or began in or after 2005. Overall survival at 2 and 5 years from non-
duplicating patients were meta-analysed using a random effects model. The I2 statistic
quantified heterogeneity. Results - The search identified 492 studies of which 57 met our
inclusion/exclusion criteria. The eligible studies represent 18 countries and most studies
came described the United States population (n=28, 48%). The majority of studies reported
2-year survival (n=49, 86%). Pooled estimate of 2-year survival was 10% (95% confidence
interval [CI] 6-14%; n/N=681/5926; 7 studies; I2=96%) for patients recruited before 2005;
and 18% (95% CI 14-21%; n/N=5127/30678; 12 studies; I2=98%) for patients recruited in or
after 2005. The pooled estimate of 5-year survival was 3% (95% CI 1-4%; n/N=418/15685; 6
studies; I2=96%) for patients recruited before 2005; and 3% (95% CI 2-4%; n/N=471/15083;
4 studies; I2=93%) for patients recruited in or after 2005. Two-year survival reported within
the same populations showed improvement in the later recruitment period, which was not
observed in 5-year survival. Interpretation - There may be an improvement in 2-year but not
5-year survival. Few populations allowed internal comparison and multiple unmeasured
confounders contributed to the high heterogeneity, which limit interpretation of these
observations in relation to the adoption of multimodal therapy. Uncertainties about survival
trends and the effect of modern treatment are best addressed using a large-scale
population-based study with detailed clinical information.
P21
Using changes in protein structure to guide clinical decisions: a case for Rifampicin Resistance
Stephanie Portelli 1, Nicholas Furnham 2 , Douglas E.V. Pires 1 and David B. Ascher 1
1 Department of Biochemistry and Molecular Biology, Bio21 Institute, University of Melbourne, Victoria, Australia 2 Department of Pathogen Molecular Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, United Kingdom.
Since its discovery over 50 years ago, rifampicin has been extensively used in the clinic to
treat the mycobacterial infections tuberculosis and leprosy. The rapid spread of rifampicin
resistance, mainly through missense mutations within the drug target gene rpoB, presents a
major challenge in treating both diseases. However, although genetic sequencing
techniques are able to identify known resistance causing mutations, they fall short of
understanding underlying mechanisms. I have explored the use of structural information to
bridge this gulf in knowledge, to guide the identification and characterisation of novel
mutations within the gene rpoB.
M. tuberculosis missense mutations were initially obtained from the literature and divided
into resistant (n=203) and susceptible (n=28) phenotypes. Next, the mCSM suite of
computational tools was used to qualitatively analyse the structural and molecular changes
brought about by these mutations. Our analysis included measurements for changes in
protein stability, dynamics, interaction binding affinities and physicochemical properties.
Resistance mutations having large detrimental effects were seen at a lower frequency within
a genome-wide association study and are thought to reduce bacterial fitness. These
measurements were used to train a binary classifier to identify likely resistance mutations
using sci-kit learn.
Our best classifier was trained using the KNN algorithm, which, following testing with an
independent, non-redundant blind test (n=88) performed with a precision of 88.2% and an
accuracy of 89.7%. Analysis of this model highlighted that changes in interactions within the
RNA polymerase complex, including to the nucleic acid and rifampicin, were a significant
driver of resistance. When tested using clinically resistant M. leprae mutations (n=40) from
the American Leprosy Mission, all mutations were predicted correctly, showing evidence of
clinical translatability for both mycobacterial infections.
This work illustrates the power of using structural information to interpret genomic variants.
Our structure-based tool is able to analyse missense mutations located throughout the RpoB
structure in both mycobacteria, and is the first genetics-based tool for drug resistance in
leprosy. Within the clinic, our tool is important for rifampicin stewardship, ultimately reducing
further resistance development.
P22
Generation Scotland: refining disease and improving prediction through electronic health record linkage and epigenetics.
David Porteous, on behalf of Generation Scotland
Centre for Genomic and Molecular Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.
NHS Scotland has some of the best and longest-established electronic health records (EHR)
available for research. Here we show how EHR linkage transforms Generation Scotland:
Scottish Family Health Study (GS:SFHS), a family-based genetic epidemiology study of
~24,000 volunteers from ~7000 families from a cross-sectional to a longitudinal study of
disease prevalence and incidence. In Scotland, EHR is made possible through the
Community Health Index (CHI), a unique identifier that tracks all NHS Scotland patients
through primary, secondary and tertiary care, from birth to death. We have now established
EHR linkage to all secondary care, dental, biochemical and dispensing NHS Scotland
datasets, overcoming technical and governance issues in the process. Linkage to primary
care data is in progress and hospital based image data will follow. Participants answered
extensive demographic, health and lifestyle questions, provided biological samples and
underwent clinical assessment at the time of recruitment. In addition to these baseline
measures of physical health, participants undertook tests of psychological, cognitive and
mental health. Biological samples of blood, serum, plasma and urine combine to form a
resource with broad consent for health related research, including genetics. Genome-wide
SNP association data is available on 20,000 participants and blood-based genome wide
methylation on 10,000.
To exemplify the broad utility of Generation Scotland, here we show how:
1. Pedigree-based heritability improves significantly on SNP-based heritability for multiple
metabolic and anthropometric traits.
2. Family structure also facilitates parent-of-origin studies that can identify couple-based and
epigenetic effects not detectable in UK Biobank.
3. Quantitative trait measures of personality and cognition can be used to refine the
diagnosis and categorisation of depression, while prescription switching can identify non-
responders.
4. Genome wide methylation explains moderate to large proportions of the variance in
lifestyle factors and complex traits, including smoking, alcohol consumption, cardiometabolic
outcomes, and lung function. Typically, the methylation-based predictors outperform GWAS-
based polygenic scores.
Researchers, both academic and commercial, can now use the linked datasets through a
managed access process (www.generationscotland.org) to undertake discovery, replication
and meta studies; identify prevalent and incident disease cases and matched healthy
controls; test potential biomarkers for prediction used stored biological samples; or invite
selected participants to volunteer for follow on studies, including repeat sample collection.
P23
Mental Health Pathfinder: Creating a mental health research resource in Scotland
Carys Pugh, Mark Adams, Toni-Kim Clarke, Matthew Iveson, Heather Whalley & Andrew McIntosh
Division of Psychiatry, Centre for Clinical Brain Sciences, University of Edinburgh
We will invite over 250,000 people to complete a mental health questionnaire and, with their
permission, link their responses to their electronic health records to create a Scottish
resource for mental health research.
In Scotland, people who use the NHS have a unique Community Health Index (CHI) number
that links their health records. SHARE (https://www.registerforshare.org) is a register of over
250,000 people in Scotland who have expressed an interest in taking part in health research.
Registering for SHARE takes less than one minute and the tendency to recruit healthy
volunteers is offset by recruiting people within health care settings. Participants have their
records linked via their CHI and are also able to opt-in to the collection of spare bloods from
routine health testing. So far, over 30,000 people have been genotyped using these bloods
and work is ongoing to increase capacity.
We intend to leverage the unique linkage possibilities of the Scottish NHS and the SHARE
register to build a resource for mental health research. All SHARE participants will be invited
to complete an online mental health questionnaire that has been adapted from the
questionnaire used by UK Biobank and the GLAD study (https://gladstudy.org.uk). The
questionnaire includes elements of the short form CIDI to assess lifetime and current
depression and anxiety, and we also ask about lifestyle, subjective wellbeing, experiences of
trauma and the use of alcohol and illegal substances. The questions are comparable to
previous questionnaires so it will be possible to substantially increase sample sizes
associated with prior analyses. By linking responses to routinely collected health data and to
genotype data, we will create a resource for genomic investigations of mental illnesses and
epidemiological studies of comorbities. We hope to develop a better mechanistic
understanding of mental illnesses, to help to stratify medicines and interventions by specific
mental health conditions, so that the treatment is tailored to the individual and so that the
burden of mental health illness is minimised for the individual, their family and society as a
whole.
P24
Integrating professional software engineering practices in medical research software
Patricia Ryser-Welch, Olly Butters
Newcastle University
Health data sets are getting bigger, more complex, and are increasingly being linked up with
other data sources. With this trend there is an increasing risk of patient identification and
disclosure. Two different ways of mitigating this risk are to use a federated analysis
approach or to use a data safe haven.
DataSHIELD (www.datashield.ac.uk) is an established federated data analysis tool that is
used in the medical sciences. This software has a variety of methods to reduce the risk of
disclosure built in. Here we describe the steps we are taking to apply modern software
engineering methodologies to DataSHIELD. The upcoming Medical Devices legislation
requires that software has more rigorous testing done on it. While this legislation does not
directly apply to software used for research, we think it is important the ideas behind this do
filter down to research software. For us these principles include testing that functions work,
as well as testing that they produce the correct answers. Using a static standard data set to
test against (that is publicly available) is also an important aspect. This work is being done in
a continuous integration framework using Microsoft Azure. Additionally all our software is
developed as open source.
In addition to the protection DataSHIELD provides on its own we are also integrating it into
our Trusted Research Environment as part of Connected Health Cities North East and North
Cumbria. This will give an extra level of protection to data that may automatically flow from
multiple data sources. Additionally, as analysis can be done in a federated way it means that
that data does not need to leave its data controller's environment. This opens up the
possibility of analysis happening across trusts and regions.
P25
Using phenotype ontologies to integrate human genetics and model-organism
research data
Tom Shorter, Anthony J Brookes and Tim Beck
Department of Genetics and Genome Biology, University of Leicester, Leicester
Phenotype ontologies are used to standardise phenotypic descriptions within many
biomedical research databases. In order to effectively search across databases that use
different ontologies, equivalent terms can be mapped. The “GWAS Phenomap” project
enables the discovery of genome-wide association study (GWAS) findings across a range of
GWAS and GWAS-related data sources by mapping their phenotype content with that of
GWAS Central (https://www.gwascentral.org/). GWAS Central currently provides a set of
online tools for integrating and comparing summary-level GWAS data from over 3000
studies measuring over 70 million p-values. These tools have been extended to allow a
range of mapped ontologies (MeSH, HPO, EFO, ICD-10) to be visualised and searched.
Running an ontology-driven query from the Phenomap interface interrogates the source
datasets and displays the matching studies, publications, and database cross-links. All data
sources can be queried using any of the ontologies, and phenotype data that are not defined
using any ontology are automatically annotated using concept recognition.
Furthermore, we have mapped between human and mouse phenotype ontologies to link
mouse phenotypic observations with their human equivalents in GWAS as part of a novel
approach to refine the regulatory loci identified in the mouse for lifestyle diseases. Our
automated pipeline maps Medical Subject Headings to Mammalian Phenotype (MP)
ontology terms, and using a concept recognition technique, associates free-text mouse trait
descriptions to MP ontology terms. For a range of lifestyle disease-related phenotypes, we
are able to identify the human intergenic regions that are syntenic to the mouse loci that are
associated with the equivalent phenotypes. By extending this approach and integrating
disease related mouse genetic studies with GWAS on a large scale, we can validate the
clinical relevance of mouse studies across many disease groups.
P26
CNTN5 genetic variation in Metabolic and Mental Health
Dr Rona Strawbridge, Soddy Sau Yu Leung1, Mark E.S. Bailey2, Breda Cullen1, Amy Ferguson1, Nicholas Graham1, Keira J.A. Johnston1,2,3, Donald M. Lyall1, Laura M. Lyall1, Claire L. Niedzwiedz1, Robert Pearsall1, Richard J Shaw1, Rachana Tank1, Joey Ward1, Daniel J. Smith1
1 Institute of Health and Wellbeing, University of Glasgow, Glasgow, UK. 2 School of Life Sciences, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK 3 Division of Psychiatry, College of Medicine, University of Edinburgh, Edinburgh, UK 4 Cardiovascular Medicine Unit, Department of Medicine Solna, Karolinska Institutet, Stockholm, Sweden
Epidemiology has convincingly demonstrated that people suffering from serious mental
illness (SMI, such as schizophrenia, bipolar disorder and major depressive disorder) have an
excess risk of obesity, diabetes and cardiovascular disease. It is unclear to what extent the
increased risk of cardiometabolic disease in people with SMI is due to social determinants
and lifestyle factors or whether there are shared biological mechanisms connecting mental
and physical illness.
The CNTN5 locus has been associated with a number of cardiometabolic traits, as well as
with SMI and suicidal behaviour, and might represent a mechanism common to mental and
physical illness. In this study we explored whether CNTN5 associations with cardiometabolic
traits overlap with, or are distinct from, the associations with psychological traits, including
suicidal behaviour and to assess trans-ancestral consistency in effects.
In the UK Biobank study (N=129,000-377,000) significant associations (multiple testing
corrected p <1x10-6) were observed between genetic variants in CNTN5 and neuroticism,
systolic and diastolic blood pressure, with suggestive associations (p <1x10-5) being
observed with mood instability, generalised anxiety disorder, major depressive disorder and
central adiposity. Linkage disequilibrium analysis indicates that these signals are
independent. Future analyses considering the multiple signals in this locus could provide
novel insights into behaviours relating to SMI.
P27
Discovering the UK’s Cohorts
Matt Styles, Tom Giles, Jurgen Mitsch, Philip Quinlan
Advanced Data Analysis Centre, Digital Research Service, University of Nottingham, Nottingham, UK
Medical research in the UK is still hindered by a perceived lack of suitable data and sample
resources that can be used by the research community. This perception contrasts starkly
with the knowledge that there are hundreds of potential resources that can supply data and
samples. The work of the Tissue Directory and Coordination Centre (TDCC) and backed by
the Medicines Discovery Catapult (MDC) has shown that the actual problem experienced by
both academic and commercial researchers is the inability to find suitable resources. The
consequence is that researchers tend to utilise the resources that are close by proximity
(because they can go and talk to the resource directly) rather than close to their research
goals. This approach is driven because there is no easy mechanism in which to discover
other potential national resources.
TDCC is the UK's national centre that is tasked with coordinating biobanks (samples and
data) to ensure that researchers re-use existing resources before seeking to collect new
samples or datasets. TDCC is a joint endeavour between the University of Nottingham and
University College London. Researchers can search on very high-level classification of
disease, gender and age and resources matching those criteria are displayed. This Directory
acts as a first filter but does not allow researchers to ask detailed feasibility questions.
It is clear that in order to support and utilise existing UK investment in health care resources
we must have a more efficient mechanism in which a researcher can ask a relatively detailed
question and understand whether resources exist that could support them. It is especially
important that researchers find out quickly if something is not feasible (rather than investing
time and finding it is not) and equally that referrals to the resources (such as Genomics
England) from the search system does not create irrelevant traffic. We are making a leap in
search capability across four of the UK's leading cohorts (Genomics England, UK Biobank,
ALSPAC and Generation Scotland) via a research collaboration with a leading technology
provider, BC Platforms, in order to deliver researchers a single place in which in-depth
queries can be performed in order to understand the feasibility of research using cohorts.
P28
Exploring the functional effects of genetic risk scores in pseudotime
Shu Mei Teo1,2, Christopher Yau 3,4, Michael Inouye 1,2,4,5,6
1 Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia 2 Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, United Kingdom 3 Centre for Computational Biology, Institute of Cancer and Genomic Sciences, University of Birmingham, United Kingdom 4 The Alan Turing Institute, London, United Kingdom 5 Department of Clinical Pathology, University of Melbourne, Parkville, Australia 6 School of BioSciences, The University of Melbourne, Parkville, Australia
Many complex human diseases are caused by a combination of genetic and environmental
exposures that interact with age of an individual. Time-series molecular data obtained from
longitudinal studies is ideal but challenging to conduct at a large scale. Recent advances on
the inference of pseudotime trajectories from single time-point data has shown that latent
temporal information could be extrapolated from cross-sectional data sets, therefore offering
new insight into dynamic disease processes without longitudinal studies. In parallel,
increasingly large genome-wide association studies have enabled the construction of
genomic risk scores (GRS) that represent an individual's risk of disease at birth. Yet, the
molecular pathways by which GRS confers disease risk have not been explored. I use
covariate-aware pseudotime approaches to model cross-sectional surveys of the human
proteome and its interactions with GRS of multiple complex human traits and diseases.
Proteins with significant GRS-pseudotime interaction effects are of interest; These represent
proteins whose trajectories are modulated by the genetic susceptibility to specific diseases.
P29
Oral anticoagulation in patients with Atrial Fibrillation: A longitudinal data linkage study.
Fatemeh Torabi, Daniel Harris, Ashley Akbari, Mike Gravenor, Ronan Lyons, Julian Halcox
Swansea University Medical School
Introduction Atrial Fibrillation (AF) is a common condition conferring a five-fold increase in the risk of stroke, which can be reduced by two thirds or more by use of anticoagulant (AC) drugs, where appropriate. It is unknown how variation in AC use impacts stroke incidence at a population level. We are developing a dynamic model to predict the impact of AF treatment on stroke risk at an individual and population level. Methodology We used the Secure Anonymised Information Linkage (SAIL) Databank to identify all people with linked data who had a diagnosis of AF prior to the index date (01-01-2016) and those with a diagnosis of stroke. Antithrombotic treatment prescriptions, stroke (CHADS-VASc) and bleeding risk (HAS-BLED) scores were evaluated on a representative index date. Subsequent analyses will be undertaken on a combined dataset of AF patients, identifying stroke related events, patient risk factors and treatments over a study period of 2012-18. A multivariable model will be developed to evaluate the impact of changes in AC treatment on incidence of AF-associated adverse outcomes in the Welsh AF population. Results We identified 57,721 patients with AF on 01-01-2016, of whom 35,443 (61.4%) were receiving AC . Of the 22,728 patients not receiving AC (non-AC), 35% were treated with anti-platelet drugs and 39% had previously been treated with AC. The proportions of patients at increased bleeding risk (HAS-Bled≥3) were similar in AC and non-AC populations across all levels of stroke risk. A preliminary survey found that the prevalence of AF increased from 2012-2016, but the incidence of stroke in known AF patients remained stable. Conclusions These data show that over a third of AF patients in Wales eligible for AC therapy in 2016 were not receiving it, despite a similar stroke and bleeding risk profile, on average in these patients as the treated population. We also found that despite sub-optimal AC in 2016 and an increase in the numbers of patients with AF, the incidence of stroke in this population did not increase between 2012-2016. We are undertaking statistical analyses to explore in detail relationships between stroke and bleeding risk factors, use of AC medication and adverse clinical outcomes in AF patients over 2012-2018. A better understanding of these relationships will inform development of strategies to optimise use of AC therapy for stroke prevention at an individual and population level in Wales and beyond.
P30
Predicting genetic ethnicity and relatedness from genotyping data (pretrel)
Salih Tuna, Salih Tuna [1,2], Chris Penkett [1,2], Will Astle [3], Kathy Stirrups [1,2], Willem H. Ouwehand [1,2,4,5] , Ernest Turro [1,2,3]
1. NIHR BioResource, Cambridge University Hospitals, Cambridge, UK. 2. Department of Haematology, Cambridge University, Cambridge, UK. 3. MRC Biostatics Unit, Cambridge Institute of Public Health, Cambridge, UK. 4. Wellcome Sanger Institute, Hinxton, Cambridge, UK. 5. NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK.
Prediction of ethnicity and relatedness from genomic markers is a crucial step in Quality
Control (QC) analyses for any large scale sequencing and genotyping study. By identifying
any mislabelled samples at an early stage in the project, the researcher can investigate any
potential issues in their laboratory/informatics pipelines, such as possible sample swaps or
duplications, and to ensure the correct samples are included in the final dataset.
Furthermore, by using an unrelated set, bias in population genetics parameters can be
removed from the mainstream analysis. This allows for example for accurate calculation of
Allele Frequencies (AF) for a particular mutation.
In the NIHR Bioresources , we have developed an R package (pretrel) that supports initial
SNP selection from the overall dataset, then calculates ethnicity and relatedness between
samples and finishes by identifying a maximum set of unrelated samples. pretrel can be run
on genotyping data obtained from a wide variety of platforms including Illumina sequencing
data and microarrays. The full analysis suite scales to large sample size, and has been run
on the GEL Whole-Genome Sequencing (WGS) data (~50,000 samples), NIHR
BioResource WGS data, targeted NGS data for rare disease panels and data from Axiom
arrays.
As a first step pretrel can select a reliable set of common SNPs from your dataset. The
analysis also uses genotypes from a reference project such as 1000 Genomes project to
infer ethnicities and part of the selection involves finding SNPs intersecting both the project
data and reference samples. VCF files from the project and reference samples are then
filtered for this SNP set whilst checking for data completeness, before being merged to one
VCF file with high quality genotypes. The main statistical method used for predicting
ethnicities is Principal Component Analysis (PCA) where the principal components are
calculated using the reference set and the samples of interest are then projected onto the
reference set using the GENESIS R package. By using these projections, we further assign
ethnicities for each sample based on their similarity to the reference samples. Pairwise
relatedness is then calculated with the GENESIS package, which starts with the ethnicity
PCA loadings and then calculates the IBD ratios and a kinship score.
P31
Café Variome, A Data Discovery Lattice
Colin Veal, Dhiwagaran Thangavelu, Gregory Warren, Marc Wadsley, Vagelis Ladas, Spencer Gibson, Tim Beck, and Anthony J Brookes
Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
Health data sharing often occurs in small-scale interactions between a limited number of
individuals or institutions. This approach helps to ensure correct data use and protects
patient confidentiality, but runs counter to widespread data re-use. A more effective
approach would be based upon a supportive "data discovery" layer that would allow data
seekers to query for the "existence" rather than the "substance" of particular datasets (i.e.,
without exposing the data themselves). Such services could grow into a comprehensive
lattice of resources that could be universally interrogated by suitably authorized users. To
this end, we have built Cafe Variome (www.cafevariome.org) as a flexible/customisable,
web-based, data discovery tool that can be quickly installed by any genotype-phenotype
data owner or network. This technology makes safe or sensitive content appropriately
discoverable, and facilitates effective patient 'match-making' with match settings controlled
by the user.
Via Cafe Variome data owners/custodians retain full control of their data. Approved data
elements (or obfuscated copies thereof) are remotely queried while remaining on the
custodian's local server. No transfer of data to the browser or other installations, and query
responses can be strictly limited (i.e., yes/no, or a count of query "hits"). The platform
includes:
- A data component, whereby different records and/or fields can be made differentially
discoverable by different users/networks.
- A query component, including simple and advanced "query-builder" interfaces, optionally
constrained by ontologies.
- An administration component, whereby a data custodian controls: who can perform
discovery on which datasets; whether any data can be viewed; the creation and
management of federated networks; and the site appearance/branding.
Current implementations support matching across RD mutation registries (EDS Consortium),
multi-site Alzheimer subject recruitment (EU-EPAD), and biomaterial finding across biobanks
(e.g., in the Leicester Bioresource). Café Variome can be and has been configured to
interoperate with alternative discovery providers (e.g., GA4GH Beacons) and supports
querying by consent and data use conditions (e.g., using the GA4GH "ADA-M" standard).
We are now developing and applying specific modules for the discovery and real time
phenotypic similarity matching requirements of the Solve-RD Horizon 2020 project.
Applications across the UK will also be explored within the HDR-UK national institute.
Funding: Provided by EU projects Solve-RD (#779257) and EJP-RD (#825575) and HDRUK.
P32
Real-time phenotypic similarity matching across federated datasets
Marc Wadsley, Colin Veal, Gregory Warren, Tim Beck, Dhiwagaran Thangavelu, Anthony Brookes
University of Leicester
Solving diseases increasingly relies on large resources spread across many sites. This is
particularly true for rare disease where obtaining sufficient numbers of subjects for thorough
analysis or trials is difficult. Finding subjects with similar phenotypes can often reveal new
knowledge such as shared variants or increase patient numbers for robust statistical
analysis. However, finding similar patients can be time consuming due to cumbersome
communication between sites, slow patient by patient pairwise mapping or limited matching
options.
Here we describe a platform that allows real time phenotypic similarity matching across a
federated network. The core functionality resides within a graph based model combining the
HPO ontology and patient phenotype descriptions and builds upon existing semantic
similarity methods based on information content (Resnik, Lin, Jiang-Conrath, relevance
information coefficient). The platform can be deployed within our existing Café Variome
architecture, providing flexible ontology based query interfaces, access controls and further
filtering options based on demographics, variant genotypes and consent amongst many
other options.
Benchmarking using a lightweight Linux server (2 cores, 4GB RAM) and populating the
platform with 10,000 patient equivalents (2 to 15 HPO terms per patient) generated from
7500 OMIM disease descriptions we found that queries of 1-5 HPO matched similar patients
in <100ms. This shows potential for high scalability not only within a single system but
across a large, international federated network.
This is now being deployed in the Solve-RD Horizon 2020 project and we are currently
adapting this to additional ontologies and for use in other health data contexts.
P33
Consumer-led breakthroughs in insulin-treated diabetes
Cyndi Williams, Peter Wooldridge
Quin Technology Ltd
100M people use insulin to treat diabetes. No doctor can tell anyone exactly how much
insulin to take and when. People with diabetes use trial-and-error to make hundreds of
decisions a day to keep themselves going. They're 2-3 times more likely to suffer from
fatigue, anxiety, stress and depression. Only 8% of them achieve outcomes that will keep
them from complications or an early death due to diabetes.
The outcomes are poor because there are just too many long-lasting, variable, overlapping
and poorly understood factors that affect blood glucose. Some are studied by researchers -
insulin injections, eating, activity - but still many more - stress, sleep, illness, menstruation,
travel, etc. - remain unstudied. It's impossible for any individual to know about or recall all
these events and manage them effectively with insulin. Doctors and researchers are equally
stuck. There's no way to study the events in isolation, the endocrine system is too complex
and poorly understood, and the molecular taxonomy for diabetes remains incomplete,
making only generic solutions possible.
The experience and expertise of the millions of people who are keeping themselves alive
every day taking insulin is a massive source of untapped diabetes knowledge. We are using
it to unlock the development of more personalised solutions for insulin-treated diabetes.
Our mobile app takes data from existing diabetes devices, wearables and phones, and uses
it to formalise and classify studied and unstudied events and evaluate their effects on blood
glucose. Each individual's unique self-care experience is codified into our Insulin Intervention
Taxonomy (patent pending). These rich highly structured data sets are the basis for
modelling and training machine learning algorithms to give personalised decision-making
guidance to our users via our app, and stratify them according to their experiences and
outcomes. New stratifications then network "users like me" via our app for peer support, and
drive new multi-omic research into more personalised treatments and pathways.
We are currently running a research program with 50+ people with diabetes who take
insulin. We are using their self-care insights to help shape the personalised decision-making
services in our app.
P34
Contextualised concept embedding for efficiently adapting natural language processing models for phenotype identification
Honghan Wu, Karen Hodgson, Susan Dyson, Katherine I. Morley, Zina M. Ibrahim, Ehtesham Iqbal, Robert Stewart, Richard JB Dobson, Cathie Sudlow
Centre for Medical Informatics, Usher Institute, University of Edinburgh, United Kingdom; IoPPN, King’s College London, United Kingdom; Health Data Research UK, UCL, United Kingdom
*Background*
Many efforts have been put to use automated approaches, such as natural language
processing (NLP),to mine or extract data from free-text medical records to picture
comprehensive patient profiles fordelivering better health-care. Reusing NLP models in new
settings, however, remains cumbersome -requiring validation and/or retraining on new data
iteratively to achieve convergent results.
*Materials and Methods*
We formally define and analyse the NLP model adaptation problem, particularly in
phenotype identification tasks, and identify two types of common unnecessary or wasted
efforts: duplicate waste and imbalance waste. A distributed representation approach is
proposed to represent familiar language patterns for an NLP model by learning phenotype
embeddings from its training data.Computations on these language patterns are then
introduced to help avoid or reduce unnecessary efforts by combining both geometric and
semantic similarities.To evaluate the approach, we cross validate NLP models developed for
six physical morbidity studies(23 phenotypes; 17 million documents) on anonymised medical
records of South London Maudsley NHS Trust, United Kingdom.
*Results*
Two metrics are introduced to quantify the reductions for both duplicate and imbalance
wastes. We conducted various experiments on reusing NLP models in four phenotype
identification tasks. Our approach can choose a best model for a given new task, which can
identify up to 76% mentions needing no validation & model retraining, meanwhile, having
very good performances (93-97% accuracy). It can also provide guidance for validating and
retraining the model for novel language patterns in new tasks, which can help save around
80% of the efforts required in blind model-adaptation approaches.
*Conclusion*
Adapting pre-trained NLP models for new tasks can be more efficient and effective if the
language pattern landscapes of old settings and new settings can be made explicit and
comparable. Our experiments show that the phenotype embedding approach is an effective
way to model language patterns for phenotype identification tasks and its operations can
guide efficient NLP model reusing.
P35
Population level individual risk assessment informed by microbial genomics: assessing TB transmission risk in England
David Wyllie, Esther Robinson, Eliza Alexander, Vlad Nikolayevskyy, Colin Campbell, Grace Smith
Public Health England
Background
WHO has proposed that countries with low tuberculosis incidence eliminate in-country
transmission. In the UK, as in other low incidence countries, many TB cases are imported,
and so incidence does not necessarily reflect UK-based transmission. However, neither the
best method of measuring in-country transmission, nor methods for rapid detection of groups
at risk of onward transmission, are well established. Routine whole genome sequencing of
TB isolates may be able to help both goals.
Methods
Maximal likelihood phylogenetic relationships were determined between M. tuberculosis
sequences from a universal, prospective M. tuberculosis sequencing program operating in
England. For the first M. tuberculosis isolation from each individual, we computed the time to
putative transmission, defined as subsequent isolation of an organism compatible with
descent from the first, but from a different individual. Survival analyses used to determine
risk factors for putative transmission. A model derived using data from one year's data from
part of England, was validated using independent data from the whole country obtained the
subsequent year.
Results
We examined TB isolates from 1420 individuals obtained in 2016/2017 from Midlands and
North of England, and 2653 from across England in 2018/19. Kaplan-Meier estimates of
transmission were generated both for the 2016/17 dataset and across England in 2018/19.
Univariate (KM) analyses showed transmission risk was higher in patients with multiple
recognised risk factors for TB transmission. However, in a multivariable proportional hazards
model derived from 2016/17 cases, only pulmonary disease, young age, imprisonment, and
the isolation of similar samples in the last 120 days significantly increased putative
transmission; recent immigration, and lineage 1 disease decreased risk. From these
variables, a risk prediction score was generated and successfully validated in the 2018/19
data set. Prediction was impacted relatively little by imprisonment status and recent
immigration status, the only variables not available from laboratory data.
Conclusions
Analysis of routine whole genome sequencing data can provide M. tuberculosis transmission
estimates; such analyses suggest declining transmission in the Midlands and North area of
the UK, which is compatible with known UK trends. Importantly, routine laboratory data in
England allow in-country tuberculosis transmission risk prediction in England, something
which could be added to laboratory reporting to drive targeted public health interventions.
More refined estimates are likely achieveable were more complex data incorporated,
perhaps using Bayesian models. Overall, the approach represents an application of data
fusion, and the provision of personalised medicine to disease control.
P36
Notes
P37
Notes
P38
Notes
Delegate List
Mehrdad A Mizani
Usher Institute of Population Health
Sciences and Informatics
Ashley Akbari
HDR-UK Wales & Northern Ireland
Beatrice Alex
University of Edinburgh
Chantal Babb de Villiers
PHG Foundation
Chiara Batini
University of Leicester
Robin Beaumont
University of Exeter
Tim Beck
University of Leicester
Fran Biggin
Lancaster University
Veronique Birault
Francis Crick Institute
William Bradlow
University Hospitals Birmingham NHS
Foundation Trust
Anthony Brookes
University of Leicester
Thore Manuel Buergel
EMBL-EBI
Caroline Bull
University of Bristol
Tianxi Cai
Harvard University
Archie Campbell
University of Edinburgh
Toni Clarke
University of Edinburgh
Adrian Cortes
GSK
Nathan Cunningham
University of Warwick
John Danesh
University of Cambridge
Mark Davies
Medicines Discovery Catapult
Carol Dezateux
QMUL & HDRUK London
Emanuele Di Angelantonio
University of Cambridge
Richard Dobson
King's College London
Dimitrios Doudesis
University of Edinburgh
Elliot Dryer Beers
Imperial College London
Demitra Ellina
F1000
Tomas Fitzgeraal
EMBL-EBI
David Ford
HDR-UK Wales & Northern Ireland
Thorsten Forster
LifeArc
Isabella Friis Joergensen
Novo Nordisk Foundation Center for
Protein Research
Kate Gardner
Guys and St Thomas’ NHS Trust
Katie Harron
UCL
Jennifer Harrow
ELIXIR
Joanne Hartley
Medicines Discovery Catapult
Claire Hastie
University of Glasgow
Caroline Hayward
MRC Human Genetics Unit
Maria Herrero Zazo
EMBL-EBI
Kathryn Holt
Monash University
Richard Houghton
Wellcome Sanger Institute
Michael Inouye
University of Cambridge
Catherine John
University of Leicester
Alexander Wolfgang Jung
EBI
Stephen Kaptoge
University of Cambridge
Sarah Kim Hellmuth
New York Genome Center
Nathalie Kingston
University of Cambridge
Jo Knight
Lancaster University
Zeljko Kraljevic
King's College London
Loic Lannelongue
University of Cambridge
Kirstin Leslie
University of Glasgow
Melissa Lewis Brown
Health Data Research UK
Waty Lilaonitkul
University College London
Cecilia Lindgren
Li Ka Shing Centre -BDI
Nadia Lipunova
University of Birmingham
Jennifer Luyapan
Dartmouth College
Donald Lyall
University of Glasgow
Ronan Lyons
HDRUK Wales & Northen Ireland
Calum MacRae
Harvard Medical School
Teri Manolio
NHGRI
Andrew McIntosh
University of Edinburgh
Louise Millard
Integrative Epidemiology Unit, University
of Bristol
Aoife Molloy
Imperial College London
Sarah Morgan
EMBL-EBI
Andrew Morris
Health Data Research UK
Joseph Mullen
SciBite Ltd
Jonathan Mullins
Swansea University
Vasilis Nikolaou
University of Surrey
Povilas Norvisas
BenevolentAI
Jyotishman Pathak
Weill Cornell Medicine
Chris Penkett
NIHR BioResource
John Perry
University of Cambridge
Michael Poon
Usher Institute of Population Health
Sciences and Informatics
Stephanie Portelli
University of Melbourne
David Porteous
University of Edinburgh
Carys Pugh
University of Edinburgh
Kristiina Rannikmae
University of Edinburgh
Caroline Relton
University of Bristol
Christopher Rentsch
LSHTM/Yale
Brent Richards
McGill University
Tom Richardson
MRC Integrative Epidemiology Unit
Thorsten Roser
Loughborough University London
Patricia Ryser-Welch
Newcastle University
Crina Samarghitean
ICE/University of Cambridge
Aravind Sankar
European Bioinformatics Institute (EMBL-
EBI)
Robert Scott
GlaxoSmithKline
Syed Ahmar Shah
The University of Edinburgh
Tom Shorter
University of Leicester
Kamen Shoylev
KCL
Avinoam Shye
Tel Aviv University
Jonathan Smellie
University of Edinburgh
Rona Strawbridge
University of Glasgow
Matthew Styles
University of Nottingham
Cathie Sudlow
University of Edinburgh
Praveen Surendan
University of Cambridge
Dominic Sykes
The University of Edinburgh
Shu Mei Teo
Baker Heart and Diabetes Institute
Fatemeh Torabi
Swansea University
Salih Tuna
NIHR BioResource
Rachael Turner
SciBite
Juilus Upmeier zu Belzen
Charité - Universitätsmedizin Berlin
Colin Veal
University of Leicester
Marc Wadsley
University of Leicester
Rhos Walker
Health Data Research UK
Neil Walker
NIHR BioResource
Rosemary Walmsley
University of Oxford
Cyndi Williams
Quin
Andrew Wood
University of Exeter
Angela Wood
University of Cambridge
Peter Wooldridge
Quin
Honghan Wu
University of Edinburgh
David Wyllie
Public Health England
Index
A Mizani, M P1 Pathak, J S7
Alex, B P2 Penkett, C S29
Poon, M P20
Beaumont, R S25 Portelli, S P21
Beck, T P3 Porteous, D P22
Biggin, F S33 Pugh, C P23
Brookes, A P4
Buergel, T P5 Rannikmae, K S9
Bull, C P6 Richards, B S27
Richardson, T S23
Cai, T S49 Ryser-Welch, P P24
Clarke, T P7
Cunningham, N P8 Sankar, A S39
Shorter, T P25
Davies, M P9 Strawbridge, R P26
Dobson, R S3 Styles, M P27
Doudesis, D S17
Teo, S M P28
Friis Joergensen, I S5 Torabi, F P29
Tuna, S P30
Harrow, J P10
Hastie, C S31 Veal, C P31
Hayward, C P11
Herrero-Zazo, M S11 Wadsley, M P32
Walker, N S41
Inouye, M S19 Wood, A S47
Wooldridge, P P33
John, C P12 Wu, H P34
Jung, A W P13 Wyllie, D P35
Kim-Hellmuth, S S15
Kraljevic, Z S35
Leslie, K P14
Lindgren, C S21
Lipunova, N P15
Luyapan, J P16
Lyall, D P17
MacRea, C S43
Manolio, T S13
Millard, L P18
Molloy, A S45
Morris, A S1
Mullen, J S37
Mullins, J P19