Health Data Science

106
1 Name: Health Data Science Wellcome Genome Campus Conference Centre, Hinxton, Cambridge, UK 11-12 June 2019 Scientific Programme Committee: John Danesh University of Cambridge, UK Caroline Relton University of Bristol, UK Cathie Sudlow University of Edinburgh, UK Tweet about it: #HDS19 @ACSCevents /ACSCevents /c/WellcomeGenomeCampusCoursesandConferences

Transcript of Health Data Science

Page 1: Health Data Science

1

Name:

Health Data Science

Wellcome Genome Campus Conference Centre, Hinxton, Cambridge, UK

11-12 June 2019

Scientific Programme Committee:

John Danesh

University of Cambridge, UK

Caroline Relton

University of Bristol, UK

Cathie Sudlow

University of Edinburgh, UK

Tweet about it: #HDS19

@ACSCevents /ACSCevents /c/WellcomeGenomeCampusCoursesandConferences

Page 2: Health Data Science

2

Wellcome Genome Campus Scientific Conferences Team:

Rebecca Twells

Head of Advanced Courses and

Scientific Conferences

Treasa Creavin

Scientific Programme

Manager

Nicole Schatlowski

Scientific Programme

Officer

Jemma Beard

Conference and Events

Organiser

Lucy Criddle

Conference and Events

Organiser

Sarah Offord

Conference and Events

Office Administrator

Zoey Willard

Conference and Events

Organiser

Laura Wyatt

Conference and Events

Manager

Page 3: Health Data Science

3

Dear colleague,

I would like to offer you a warm welcome to the Wellcome Genome Campus Advanced Courses and

Scientific Conferences: Health Data Science. I hope you will find the talks interesting and stimulating,

and find opportunities for networking throughout the schedule.

The Wellcome Genome Campus Advanced Courses and Scientific Conferences programme is run on a

not-for-profit basis, heavily subsidised by the Wellcome Trust.

We organise around 50 events a year on the latest biomedical science for research, diagnostics and

therapeutic applications for human and animal health, with world-renowned scientists and clinicians

involved as scientific programme committees, speakers and instructors.

We offer a range of conferences and laboratory-, IT- and discussion-based courses, which enable the

dissemination of knowledge and discussion in an intimate setting. We also organise invitation-only

retreats for high-level discussion on emerging science, technologies and strategic direction for select

groups and policy makers. If you have any suggestions for events, please contact me at the email address

below.

The Wellcome Genome Campus Scientific Conferences team are here to help this meeting run smoothly,

and at least one member will be at the registration desk between sessions, so please do come and ask

us if you have any queries. We also appreciate your feedback and look forward to your comments to

continually improve the programme.

Best wishes,

Dr Rebecca Twells Head of Advanced Courses and Scientific Conferences [email protected]

Page 4: Health Data Science

4

General Information

Conference Badges

Please wear your name badge at all times to promote networking and to assist staff in identifying you.

Scientific Session Protocol

Photography, audio or video recording of the scientific sessions, including poster session is not

permitted.

Social Media Policy

To encourage the open communication of science, we would like to support the use of social media at

this year’s conference. Please use the conference hashtag #HDS19. You will be notified at the start of

a talk if a speaker does not wish their talk to be open. For posters, please check with the presenter to

obtain permission.

Internet Access

Wifi access instructions:

Join the ‘ConferenceGuest’ network

Enter your name and email address to register

Click ‘continue’ – this will provide a few minutes of wifi access and send an email to the

registered email address

Open the registration email, follow the link ‘click here’ and confirm the address is valid

Enjoy seven days’ free internet access!

Repeat these steps on up to 5 devices to link them to your registered email address

Presentations

Please provide an electronic copy of your talk to a member of the AV team who will be based in the

meeting room.

Poster Sessions

Posters will be displayed throughout the conference. Please display your poster in the Conference

Centre on arrival. There will be one poster session during the conference, taking place on Tuesday

11 June at 18:00 – 19:30. The page number of your abstract in the abstract book indicates your

assigned poster board number. An index of poster numbers appears in the back of this book.

Conference Meals and Social Events

Lunch and dinner will be served in the Hall. Please refer to the conference programme in this book as

times will vary based on the daily scientific presentations. Please note there are no lunch or dinner

facilities available outside of the conference times.

All conference meals and social events are for registered delegates. Please inform the conference

organiser if you are unable to attend the dinner.

A cash bar will be open from 19:00 – 23:00 each day.

Dietary Requirements

If you have advised us of any dietary requirements, you will find a coloured dot on your badge. Please

make yourself known to the catering team and they will assist you with your meal request.

If you have a gluten or nut allergy, we are unable to guarantee the non-presence of gluten or nuts in

dishes even if they are not used as a direct ingredient. This is due to gluten and nut ingredients being

used in the kitchen.

Page 5: Health Data Science

5

For Wellcome Genome Campus Conference Centre Guests

Check in

If you are staying on site at the Wellcome Genome Campus Conference Centre, you may check into

your room from 14:00. The Conference Centre reception is open 24 hours.

Breakfast

Your breakfast will be served in the Hall restaurant from 07:30 – 09:00

Telephone

If you are staying on-site and would like to use the telephone in your room, you will need to

contact the Reception desk (Ext. 5000) to have your phone line activated – they will require your

credit card number and expiry date to do so.

Departures

You must vacate your room by 10:00 on the day of your departure. Please ask at reception for

assistance with luggage storage in the Conference Centre.

Taxis

Please find a list of local taxi numbers on our website. The Conference Centre reception will also

be happy to book a taxi on your behalf.

Return Ground Transport

Complimentary return transport has been arranged for 15:30 on Wednesday, 12 June to Cambridge

station and city centre (Downing Street), and Stansted and Heathrow airports.

A sign-up sheet will be available at the conference registration desk from 15:30 on Tuesday, 11 June.

Places are limited so you are advised to book early.

Please allow a 30-40-minute journey time to both Cambridge city and Stansted airport, and up to 3

hours to Heathrow airport.

Messages and Miscellaneous

Lockers are located outside the Conference Centre toilets and are free of charge.

All messages will be available for collection from the registration desk in the Conference Centre.

A number of toiletry and stationery items are available for purchase at the Conference Centre

reception. Cards for our self-service laundry are also available.

Certificate of Attendance

A certificate of attendance can be provided. Please request one from the conference organiser

based at the registration desk.

Contact numbers

Wellcome Genome Campus Conference Centre – 01223 495000 (or Ext. 5000)

Wellcome Genome Campus Conference Organiser (Zoey) – 07747 024256

If you have any queries or comments, please do not hesitate to contact a member of staff who will

be pleased to help you.

Page 6: Health Data Science

6

Conference Summary

Tuesday, 11 June 2019

09:00 – 10:00 Registration with refreshments

10:00 – 10:10 Welcome and introduction

10:10 – 11:10 Keynote lecture Andrew Morris, Health Data Research, UK

11:10 – 12:40 Session 1: Computational medicine

12:40 – 14:00 Lunch

14:00 – 15:30 Session 2: Precision medicine

15:30 – 16:00 Afternoon tea

16:00 – 17:30 Session 3: Integrating multi-omics into health research

17:30 – 18:00 Lightning talks for poster session abstracts

18:00 – 19:30 Poster session and drinks reception

19:30 Served conference dinner

Wednesday, 12 June 2019

09:00 – 10:30 Session 4: Molecular epidemiology and population health

10:30 – 11:00 Morning coffee

11:00 – 12:00 Session 5: Platform and technologies for health data science

12:00 – 13:30 Lunch

13:30 – 15:00 Session 6: Digital health

15:00 – 15:10 Closing remarks

15:10 – 15:30 Take away refreshment

15:30 Coaches depart to Cambridge City Centre and Train Station, and Heathrow

Airport via Stansted Airport

Page 7: Health Data Science

7

Health Data Science

Wellcome Genome Campus Conference Centre,

Hinxton, Cambridge

11 – 12 June 2019

Lectures to be held in the Francis Crick Auditorium

Lunch and dinner to be held in the Hall Restaurant

Poster sessions to be held in the Conference Centre

Spoken presentations - If you are an invited speaker, or your abstract has been selected for a

spoken presentation, please give an electronic version of your talk to the AV technician.

Poster presentations – If your abstract has been selected for a poster, please display this in the

Conference Centre on arrival.

Conference programme

Tuesday, 11 June 2019

09:00 - 10:00 Registration

10:00 - 10:10 Welcome and Introduction

Programme Committee: Caroline Relton, University of Bristol, UK

10:10 - 11:10 Keynote Lecture

Chair: Caroline Relton, University of Bristol, UK

Health Data Research UK (HDR UK): A national institute for health data

science

Andrew Morris

Health Data Research, UK

11:10 - 12:40 Session 1: Computational medicine

Chair: John Danesh, University of Cambridge, UK

11:10 Using the EHR for research and intervention and enriching the clinical

phenotype through smartphones and wearables

Richard Dobson

King’s College London, UK

11:40 Large-scale correction of faulty diagnoses in population-wide Danish

registry data

Isabella Friis Joergensen

Novo Nordisk Foundation Center for Protein Research, Denmark

12:10 Mining health and healthcare utilization patterns during pregnancy to

discover risk factors for postpartum depression

Jyotishman Pathak

Weill Cornell Medicine, USA

Page 8: Health Data Science

8

12:25 Validating the identification of neurological diseases in UK Biobank

Kristiina Rannikmae

University of Edinburgh, UK

12:40 - 14:00 Lunch

14:00 - 15:30 Session 2: Precision medicine

Chair: Cathie Sudlow, University of Edinburgh, UK

14:00 Characterization of inpatient trajectories and analysis of missingness

patterns from Electronic Health Records data

Maria Herrero-Zazo

EMBL-EBI, UK

14:15 Implementing genomic medicine in the US and globally

Teri Manolio

Division of Genomic Medicine NHGRI, USA

14:45 Understanding genetic effects on gene expression in health and disease

Sarah Kim-Hellmuth

New York Genome Center, USA

15:15 Validation of a machine learning algorithm for diagnosing type 1

myocardial infarction on a population-scale dataset

Dimitrios Doudesis

University of Edinburgh, UK

15:30 - 16:00 Afternoon Tea

16:00 - 17:30 Session 3: Integrating multi-omics into health research

Chair: Teri Manolio, Division of Genomic Medicine NHGRI, USA

16:00 Regulation of the plasma proteome by polygenic disease risk

Michael Inouye

University of Cambridge, UK

16:30 (Gen)-omics of obesity

Cecilia Lindgren

Li Ka Shing Centre - BDI, UK

17:00 A transcriptome-wide Mendelian randomization study to uncover

tissue-dependent regulatory mechanisms across the human phenome

Tom Richardson

MRC Integrative Epidemiology Unit, UK

17:15 Common genetic variation influences the probability that new-born

babies are classified as clinically small or large for gestational age

Robin Beaumont

University of Exeter, UK

Page 9: Health Data Science

9

17:30 - 18:00 Lightning Talks

Chair: John Danesh, University of Cambridge, UK

18:00 - 19:30 Poster Session with Drinks Reception

19:30 Conference Dinner

Wednesday, 12 June 2019

09:00 - 10:30 Session 4: Molecular epidemiology and population health

Chair: Caroline Relton, University of Bristol, UK

09.00 Polygenic risk scores and potential clinical applications

Brent Richards

McGill University, Canada

09:30 Taking genomics into the clinic: Whole-genome sequencing of 13,000

rare disease patients in a national healthcare system

Chris Penkett

NIHR BioResource, UK

10:00 Antenatal exposure to ultraviolet B radiation and learning disabilities:

Population cohort study of 422,512 children

Claire Hastie

University of Glasgow, UK

10:15 A comparison of stroke in urban and rural populations: Exploring

stroke risk and survival in Lancashire

Frances Biggin

Lancaster University, UK

10:30 - 11.00 Morning Coffee

11:00 - 12:00 Session 5: Platform and technologies for health data science

Chair: Michael Inouye, University of Cambridge, UK

11:00 A novel unsupervised approach to biomedical named entity

recognition and linking using neural networks

Zeljko Kraljevic

King's College London, UK

11:15 Creating knowledge graphs from literature to enable prioritisation of

potential drug targets

Joseph Mullen

SciBite Ltd, UK

11:30 The Accelerating Medicines Partnership Type 2 Diabetes Knowledge

Portal

Aravind Sankar

European Bioinformatics Institute (EMBL-EBI), UK

Page 10: Health Data Science

10

11:45 Using Privacy Enhancing Technologies (PETs) to integrate clinical

phenotypes, routine healthcare data and genomics in the public cloud

Neil Walker

NIHR BioResource, UK

12:00 - 13:30 Lunch

13:30 - 15:00 Session 6: Digital health

Chair: Richard Dobson, King’s College London, UK

13:30 Creating ‘operating systems’ for biomedical discovery and care

Calum MacRea

Harvard Medical School, USA

14:00 Refining national approaches to low value care

Aoife Molloy

Imperial College London, UK

14:15 Using wearable devices and genetics to estimate and validate

mechanisms of sleep

Andrew Wood

University of Exeter, UK

14:30 Towards a self-learning healthcare system

Tianxi Cai

Harvard University, USA

15.00 - 15:10 Closing Remarks

Programme Committee: Cathie Sudlow, University of Edinburgh, UK

15.10 - 15:30 Take Away Refreshments

15:30 Coaches depart to Cambridge City Centre and Train Station, and

Heathrow Airport via Stansted Airport

Page 11: Health Data Science

11

These abstracts should not be cited in bibliographies. Materials contained herein should be

treated as personal communication and should be cited as such only with consent of the

author.

Page 12: Health Data Science

12

Notes

Page 13: Health Data Science

S1

Spoken Presentations

Health Data Research UK (HDR UK): A national institute for health data science

Andrew Morris

Health Data Research UK, UK

Healthcare is arguably the last major industry to be transformed by the information age.

Deployments of information technology have only scratched the surface of possibilities for

the potential influence of information and computer science on the quality and cost-

effectiveness of healthcare. In this talk, the vision, objectives and scientific strategy of HDR

UK will be discussed; specifically, the opportunities provided by computer science, artificial

intelligence and "big data" to transform health care delivery models. Examples will be given

from nationwide research and development programmes that integrate electronic patient

records with biologic and health system data. Two themes will be explored; specifically:

How the size of the UK (65M residents), allied to a relatively stable population and unified

health care structures facilitate the application of data science to support nationwide quality-

assured provision of care.

How population-based datasets can be integrated with biologic information and enabled by

data science to facilitate (i) epidemiology; (ii) drug safety studies; (iii) enhanced efficiency of

clinical trials through automated follow-up of clinical events and treatment response; and, (iv)

the conduct of large-scale genetic, pharmacogenetics, and family-based studies essential for

precision medicine.

Page 14: Health Data Science

S2

Notes

Page 15: Health Data Science

S3

Using the EHR for research and intervention and enriching the clinical phenotype

through smartphones and wearables

Richard Dobson

King's College London, UK

Digital health records present potentially transformative opportunities for novel research of

direct clinical relevance. There is much value in the unstructured portion of the electronic

health record (EHR), often in the form of rich narrative, which is currently unused, yet

important for understanding health interactions that are either not recorded in, or are less

obvious from, the structured data (e.g. multi-morbidity, cancer diagnosis). I will describe

efforts (and open source tools) to use this unstructured data in hospital records to enable

research-ready, actionable, real-time and large-scale EHRs.

The integration of NLP derived phenotypes from EHRs with other modalities including

imaging, mobile health and genomics will help generate a more complete picture of the

patient. The data streamed from devices such as wearables and smartphones enables the

shift from sporadic and sparse patient engagement to pervasive and continuous monitoring

throughout the disease continuum from pre-diagnosis (risk, health and wellness), through

diagnosis (early diagnosis, progression monitoring, non-pharmalogical interventions) to post

intervention (adherence, relapse prediction, efficacy monitoring). As such, remote monitoring

provides insight into what is happening to patients in real time, and also a direct intervention

by feeding back to patients. I will describe progress and challenges faced in mobile health

studies such as the IMI2 RADAR-CNS (RADAR-CNS.org) programme and the generalised

open source platforms developed to support this work.

Page 16: Health Data Science

S4

Notes

Page 17: Health Data Science

S5

Large-scale correction of faulty diagnoses in population-wide Danish registry data

Isabella Friis Joergensen1, Søren Brunak1

1Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3B, DK-2200 Copenhagen N, Denmark.

Diagnostic errors are common, and it has been estimated that everyone will experience at

least one diagnostic error in their lifetime. Common and potentially harmful diagnostic errors

include under-, mis- and over-diagnoses. We use Chronic Obstructive Pulmonary Disease

(COPD) to showcase a new approach to identify faulty diagnoses. Underdiagnosis of COPD

has been estimated to be 50-80%, while 5-60% are misdiagnosed. Earlier studies evaluate

the airway obstruction in population-wide data and identify undiagnosed patients as people

with airway obstruction but no COPD diagnosis, while misdiagnosed patients have a COPD

diagnosis, but no airway obstruction. Here we use an alternative disease trajectory

approach.

We used the Danish National Patient Registry containing hospital diagnoses for the entire

Danish population (6.9 million people) covering more than two decades. COPD patients

were used to identify common, temporal diagnostic correlations that were combined to

longer disease trajectories. These typical disease trajectories were compared with non-

COPD patients to discover undiagnosed COPD. Low similarity between COPD patients and

typical disease trajectories identified mis- and over-diagnosed patients.

In total, 284,154 COPD patients resulted in 61,521 and 428 common disease trajectories

with three and five consecutive diseases, respectively. This systematic, data-driven analysis

of common disease trajectories and patient similarity from an entire population identified

3,569 potentially under-diagnosed COPD patients, which comorbidities are very similar to

COPD patients. The potentially undiagnosed patients were repeatedly diagnosed with

respiratory infections, like influenza and pneumonia, indicating undiagnosed COPD.

Furthermore, we identified 4,553 potentially mis- or over-diagnosed COPD patients that do

not display the typical COPD comorbidities. The group of potentially under- and mis-

diagnosed patients die significantly earlier and have significantly fewer respiratory tests than

the general COPD patient. They die shortly after the COPD diagnosis and more than 10%

are diagnosed with lung cancer.

The method is entirely general and not limited to COPD and could improve the diagnostics

processes to reduce errors, avoid harm, and find optimal diagnoses faster in population-wide

health data.

Keywords: Diagnostic error, underdiagnosis, misdiagnosis, overdiagnosis, Chronic

Obstructive Pulmonary Disease

Page 18: Health Data Science

S6

Notes

Page 19: Health Data Science

S7

Mining health and healthcare utilization patterns during pregnancy to discover risk factors for postpartum depression

Shuojia Wang, MS, Alison Herman, MD; Jyotishman Pathak, PhD; Yiye Zhang, PhD, MS

Weill Cornell Medicine, Cornell University, New York, NY, USA; School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China

Objective: The US Preventive Services Task Force recently asserted a level B

recommendation that women at risk for postpartum depression be offered or referred for

preventive psychotherapy. However, there are significant knowledge gaps in determining

which women are at risk, particularly in regard to medical contributors. Furthermore, scales

that have been developed thus far to assess risk do not consider how risk may evolve over

the course of pregnancy and may have limited feasibility on a population-wide scale. This

project attempts to address this screening problem using machine learning technology and

electronic health records to describe patterns of health and healthcare utilization during

pregnancy that may predict risk for postpartum depression (PPD).

Methods: Using electronic health record data from Weill Cornell Medicine and NewYork-

Presbyterian Hospital from 2015 to 2017, we followed pregnant women's (n=9978) clinical

visits from their first trimesters to childbirth. Clinical events, such as billing diagnoses and

medication orders from physician encounters, were mined using machine learning

algorithms and modeled into clusters based on patterns of healthcare utilization. Clusters

were subsequently analyzed in relation to the outcome of PPD within one year after delivery.

Results: We discovered three clusters, which differ in the distribution of demographics,

health, and healthcare utilization. The cluster with the highest prevalence of PPD had

statistically significant higher age, higher body mass index, and higher likelihood of being

unmarried. In addition, this cluster also had statistically significant higher rates for medical

and/or obstetric complications during pregnancy and higher number of thyroid-related

prescriptions.

Conclusions: Applying machine learning algorithms to electronic health record data may be a

powerful tool for more accurately identifying women at risk for PPD, particularly when that

risk changes over the course of pregnancy due to medical or obstetric complications.

Acknowledgments: Research reported in this manuscript is supported in part by the Walsh

McDermott Scholarship, NIH R01 MH105384, NIH P50 MH113838, and the Chinese

Scholarship Council.

Page 20: Health Data Science

S8

Notes

Page 21: Health Data Science

S9

Validating the identification of neurological diseases in UK Biobank

Kristiina Rannikmae, Tim Wilkinson, Kathryn Bush, Kenneth Ngoh, Naomi Allen, Christian Schnier, Qiuli Zhang, John Nolan, Debbie Bathgate, Michael Bennett, David Breen, Jingjie Cheng, Richard Davenport, Fergus Doubal, Gordon Duncan, Robin Flaig, David Henshall, Aidan Hutchison, Chris Lerpiniere, Conor Maguire, John O’Brien, Scott Osborne, Suvankar Pal, Tom Russ, Rustam Al-Shahi Salman, Neshika Samarasekera, Naomi Warren, Will Whiteley, Kirsty Wilson, Rebecca Woodfield, Cathie Sudlow

KR: Centre for Medical Informatics, University of Edinburgh, Edinburgh, UK and UK Biobank

Background: Disease cases among the 500,000 UK Biobank (UKB) participants can be

identified through self-report and linkage to routinely collected coded national health data.

We assessed the accuracy of these datasets in identifying neurological diseases.

Methods: We (1) created disease code lists for stroke, Parkinson's disease (PD) and

dementia; (2) identified all Edinburgh-based UKB participants with a code(s) in hospital

admission, primary care, death record data or self-report; (3) reviewed participants' medical

records to generate reference standard diagnoses; (4) calculated positive predictive values

(PPV) for all codes combined and stratified by source.

Results: There were 225 stroke, 78 PD and 120 dementia coded cases among 17,249

participants. 39% to 53% of codes occurred only in primary care data. PPVs ranged from

79% to 91%; they were higher for stroke and lower for PD in hospital versus primary care

data (89% versus 80%; 84% versus 95%), and similar for dementia in both datasets (87%).

There were too few cases in death record and self-report data to draw firm conclusions.

Conclusions: Neurological disease cases in UKB can be identified through linked coded data

with sufficient accuracy for many genetic and epidemiological studies. Primary care data is

an important source.

Page 22: Health Data Science

S10

Notes

Page 23: Health Data Science

S11

Characterization of inpatient trajectories and analysis of missingness patterns from Electronic Health Records data

Maria Herrero-Zazo1, Thore Buergel2, Tomas Fitzgerald1, Victoria L. Keevil3,4, John Bradley4 and Ewan Birney1

1 The European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom. 2 Health Data Science Unit. University Clinics Heidelberg 3 Department of Medicine for the Elderly, Addenbrooke's Hospital, Cambridge, United Kingdom. 4 Clinical Gerontology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom. 5 Department of Medicine, NIHR Cambridge Biomedical Research Centre, University of Cambridge, Cambridge, United Kingdom.

The implementation of Electronic Health Record (EHR) systems in secondary care, such as

the EPIC system at Cambridge University Hospitals NHS Foundation Trust (CUHNFT),

represents an unprecedented way to record real-time clinical information at the patient's

bedside. EHR systems collect clinically relevant information about patients, including vital

signs (such as blood pressure or temperature), blood test results and medications, providing

a time-based description of clinical progress from hospital admission to discharge. This

large-scale data provides a unique source of information for studying the relationships

between clinical observations, treatments and final hospital outcomes. While traditional

statistical techniques cannot study these multiple and interrelated relationships in their

entirety, advanced statistical methods such as machine learning algorithms can recognise

patterns within these data and relate multiple patient characteristics to defined outcomes, for

example inpatient death.

We introduce a collaboration between Addenbrooke's Hospital (as part of CUHNFT) and the

European Bioinformatics Institute (EMBL-EBI) to bring together clinical and technical

expertise for the development of machine learning models that can process large amounts of

complex clinical data collected over time during hospital admission episodes.

This clinical longitudinal data is a type of irregular multivariate time series characterized by

high intra- and inter-individual heterogeneity due to unevenly sampled observations, different

lengths of follow-up periods and high rates of missingness. Many time series analysis and

feature extraction approaches rely on regular time series data and therefore imputation of

missing observations is a common pre-processing step.

In this work, we present the results of several imputation methods proposed for time series

data, such as linear interpolation, moving average or last observation carried forward,

applied to various vital signs and laboratory test results data. Missingness patterns are

extracted from EHR data present in the public Multiparameter Intelligent Monitoring in

Intensive Care (MIMIC-III) database and replicated on examples of the same data set

without missing observations. The error between the real and imputed values is calculated

providing an accurate estimate of the performance of these imputation methods on the

actual data. The methodology is flexible, enabling the analysis of novel imputation

approaches and their comparison with baseline methods providing an imputation

assessment framework that can be applied to clinical data from different EHR systems.

Page 24: Health Data Science

S12

Notes

Page 25: Health Data Science

S13

Implementing genomic medicine in the US and globally Teri A. Manolio, M.D., Ph.D. National Human Genome Research Institute (NHGRI), National Institutes of Health The growing availability of reliable, cost-effective genetic testing and increasing knowledge about the influence of genetic variation on human health have spurred the implementation of genomic medicine into clinical care. As defined by the National Human Genome Research Institute (NHGRI), genomic medicine as an emerging medical discipline that involves using genomic information about an individual as part of their clinical care, and the health outcomes and policy implications of that clinical use. Genomic medicine is already advancing diagnosis, treatment, and prevention in the fields of oncology, pharmacology, rare and undiagnosed diseases, and infectious diseases. Yet many barriers remain, including limited evidence of efficacy, scarcity of genomics expertise, lack of standards, and difficulties in integrating genomic results into electronic medical records. In the past several years, the governments of over a dozen countries have established national genomic-medicine initiatives to address these barriers. Some have focused on genomic testing of large numbers of patients with rare diseases or cancer, paired with development of workforce and infrastructure. Others have launched large-scale, population-based sequencing projects returning genomic results directly to participants, and still others are focusing on development of informatics infrastructure such as data standards and policies and platforms for data sharing. These research and implementation programs are expected to speed the evaluation and incorporation, where appropriate, of genomic technologies and findings into routine clinical care. Actual adoption of successful approaches in the clinic will depend upon the willingness, interest, and energy of practitioners, participants, professional societies, and payers to share their data and experience and to promote the responsible use of these approaches..

Page 26: Health Data Science

S14

Notes

Page 27: Health Data Science

S15

Understanding genetic effects on gene expression in health and disease

Sarah Kim-Hellmuth1,2,3

1New York Genome Center, New York, NY, USA; 2Max-Planck-Institute of Psychiatry,

Munich, Germany; 3Department of Systems Biology, Columbia University, New York, NY,

USA

Over the last decade, genome-wide association studies (GWAS) have identified thousands

of genetic variants robustly associated with complex traits and diseases. Detailed

characterization of cellular effects of genetic variants is essential for understanding biological

processes that underlie these genetic associations to disease. One approach to address this

challenge is to map genetic effects on the transcriptome across numerous conditions. In this

talk, I will discuss our recent work within the Genotype-Tissue Expression (GTEx)

consortium, which has built the most comprehensive atlas to date of expression quantitative

trait loci (cis-eQTLs) across human tissues. Using the final v8 data release, with RNA-

sequencing data from 17,382 samples across 49 tissues of 838 individuals with genome

sequencing data, we gain insights into the tissue-, cell type-, sex- and age-specificity of cis-

eQTLs at an unprecedented breadth and depth. Taking into account this context-specificity

of cis-eQTLs significantly improves our understanding of the underlying mechanism of

genetic risk to disease, which can ultimately inform management of disease risk and

treatment.

Page 28: Health Data Science

S16

Notes

Page 29: Health Data Science

S17

Validation of a machine learning algorithm for diagnosing type 1 myocardial infarction on a population-scale dataset

Dimitrios Doudesis1,2, Jason Yang2, Thanasis Tsanas1, Catherine Stables2, Anoop Shah2, Atul Anand2, Fiona Strachan2, Ken Lee2, Nicholas Mills1,2

1 Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK. 2 BHF Centre for Cardiovascular Science, University of Edinburgh, Edinburgh, UK.

Introduction

A number of rapid rule-in and rule-out pathways have been developed to assist in the

diagnosis of myocardial infarction in the acute setting. However, most do not take into

account the interaction between troponin levels, age, and gender. A proprietary statistical

model which takes these factors into account to predict the likelihood of type 1 myocardial

infarction (MI) has previously been created and validated (manuscript currently under

review), but it has not yet been tested against a large population-scale dataset.

Methods

We validated the model against a dataset consisting of 48,282 patients from High-STEACS,

a trial evaluating the use of high-sensitivity troponin assays (hs-cTnI) in the diagnosis of

suspected acute coronary syndrome. The model takes into account two consecutive hs-cTnI

results, the time between samples, age, and gender. It outputs the predicted risk of type 1 MI

(0-100).

The area under the receiver-operating-characteristic curve (AUC) was calculated for the

cohort as a whole. We also evaluated model performance statistics (sensitivity, specificity,

NPV, PPV) at two thresholds (1.6 and 49.7) suggested previously in the manuscript under

review corresponding to low-risk (rule-out) and high-risk (rule-in) respectively.

Results

Of the 48,282 patients, 22,057 had two consecutive hs-cTnI results available and were

included in the analysis. Type 1 MI occurred in 3,881 (17.6%) patients. The model was found

to have good calibration (AUC = 0.95 [95% CI 0.947 to 0.953]) overall. Using the

prespecified thresholds, 60.26% of patients were identified as low-risk (sensitivity 0.993

[95% CI 0.991 to 0.996], NPV 0.998 [95% CI 0.997 to 0.999]) and 16.51% were identified as

high-risk (specificity 0.945 [95% CI 0.942 to 0.949], PPV 0.728 [95% CI 0.715 to 0.741]).

These findings are preliminary subject to final data quality assurance checks and are

comparable to the findings of the original analysis, in a different dataset.

Conclusion

The model performs favourably when tested against a large population-scale dataset. By

taking into account the interaction between age, gender, and troponin levels, the model

could improve diagnostic pathways for MI by accurately identifying high-risk patients to be

targeted for prompt treatment and allowing early discharge in low-risk patients.

Page 30: Health Data Science

S18

Notes

Page 31: Health Data Science

S19

Regulation of the plasma proteome by polygenic disease risk

Michael Inouye

University of Cambridge, UK

Abstract not available at the time of printing.

Page 32: Health Data Science

S20

Notes

Page 33: Health Data Science

S21

(Gen)-omics of obesity

Cecilia Lindgren

Li Ka Shing Centre – BDI, UK

Obesity is a result of excess body fat accumulation, which is associated with adverse health

effects such as CVD, type 2 diabetes, and cancer. It’s a multifaceted, common chronic

disease with no current effective treatment strategy. Progress in understanding the aetiology

has been slow up until recently, with findings largely restricted to monogenic, severe forms

of obesity. However, technological and analytical advances have enabled detection of more

than 1000 obesity susceptibility loci and the number is rapidly increasing. These contain

genes suggested to be involved in the regulation of food intake through action in the central

nervous system (overall obesity) as well as in adipocyte function (body fat distribution).

These results provide plausible biological pathways that may, in the future, be targeted as

part of treatment or prevention strategies. Although the proportion of heritability explained by

these genes remains small, their detection heralds a new phase in understanding the

etiology of common obesity. I will discuss recent genetic and genomic studies of adiposity

that indicate a causal role for general and central adiposity in cardiometabolic disease, and

highlight potential mechanisms including insulin resistance and gene expression.

Page 34: Health Data Science

S22

Notes

Page 35: Health Data Science

S23

A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome

Tom G Richardson, Gibran Hemani, Tom R Gaunt, Caroline L Relton, George Davey Smith

MRC Integrative Epidemiology Unit (IEU), Population Health Sciences, Bristol Medical School, University of Bristol, Oakfield House, Oakfield Grove, Bristol, BS8 2BN, United Kingdom

Background: Developing insight into tissue-specific transcriptional mechanisms can help

improve our understanding of how genetic variants exert their effects on complex traits and

disease. By applying the principles of Mendelian randomization, we have undertaken a

systematic analysis to evaluate transcriptome-wide associations between gene expression

across 48 different tissue types and 395 complex traits.

Results: Overall, we identified 100,025 gene-trait associations based on conventional

genome-wide corrections (P < 5 x 10-08) that also provided evidence of genetic

colocalization. These results indicated that genetic variants which influence gene expression

levels in multiple tissues are more likely to influence multiple complex traits. We identified

many examples of tissue-specific effects, such as genetically-predicted TPO, NR3C2 and

SPATA13 expression only associating with thyroid disease in thyroid tissue. Additionally,

FBN2 expression was associated with both cardiovascular and lung function traits, but only

when analysed in heart and lung tissue respectively. We also demonstrate that conducting

phenome-wide evaluations of our results can help flag adverse on-target side effects for

therapeutic intervention, as well as propose drug repositioning opportunities. Moreover, we

find that exploring the tissue-dependency of associations identified by genome-wide

association studies (GWAS) can help elucidate the causal genes and tissues responsible for

effects, as well as uncover putative novel associations.

Conclusions: The atlas of tissue-dependent associations we have constructed should prove

extremely valuable to future studies investigating the genetic determinants of complex

disease. The follow-up analyses we have performed in this study are merely a guide for

future research. Conducting similar evaluations can be undertaken systematically at

http://mrcieu.mrsoftware.org/Tissue_MR_atlas/.

Page 36: Health Data Science

S24

Notes

Page 37: Health Data Science

S25

Common genetic variation influences the probability that new-born babies are classified as clinically small or large for gestational age

Robin N Beaumont, Sarah Kotecha, Bridget Knight, Sylvain Sebert, Andrew T. Hattersley, Marjo-Riitta Järvelin, Nicholas J. Timpson, Sailesh Kotecha, Rachel M Freathy

1. Institute of Biomedical and Clinical Science, University of Exeter, Exeter, UK 2. Department of Child Health, School of Medicine, Cardiff University, Cardiff, UK 3. Center For Life-course Health Research, University of Oulu, Finland 4. Department of Epidemiology and Biostatistics, Imperial College, London, UK 5. Medical Research Council Integrative Epidemiology Unit, University of Bristol, Bristol, UK

Introduction

Birth weight (BW) is an important determinant of neonatal mortality and morbidity. Babies

clinically Small or Large for Gestational Age (SGA or LGA), defined as sex- and gestational

age-adjusted BW below the 10th or above the 90th percentile, respectively, have higher risk

of complications at birth. Babies classified SGA/LGA may have experienced fetal growth

restriction or overgrowth, respectively, but the proportion of SGA/LGA babies experiencing

growth restriction or overgrowth is unknown. Recently, common genetic variants have been

identified for normal BW variation. We used genetic scores (GS) for BW, fasting glucose

(FG) and systolic blood pressure (SBP) to examine the effect of genetics on probability of

SGA and LGA.

Methods

We calculated maternal and fetal GS for BW in participants from the ALSPAC, EFSOCH,

NFBC1966 and NFBC1986 studies (n=12,125 babies and n=5,187 mothers) and maternal

FG and SBP GS from the ALSPAC and EFSOCH cohorts. We tested associations between

the GS and SGA/LGA to assess the extent to which the genetics of normal variation in birth

weight influences probability of being classified as SGA/LGA.

Results

Both maternal and fetal GS for BW showed strong association with SGA and LGA. A 1

decile higher maternal BW GS was associated with SGA (Odds Ratio 0.80 (95%CI:

0.76,0.87); P=2.0x10-9) and LGA (OR=1.23 (1.15,1.31); P=3.9x10-10). Fetal BW GS also

showed association with SGA (OR: 0.65 (0.60,0.71); P=2.4x10-23) and LGA (1.47

(1.36,1.59); P=1.7x10-21). Maternal FG GS showed similar association with both SGA (0.74

(0.58,0.94); P=1.5x10-2) and LGA (1.49 (1.19,1.87); P=5.7x10-4). Maternal SBP GS showed

strong association with SGA (1.31 (1.07,1.60); P=9.5x10-3), but weaker association with

LGA, with wide confidence intervals (0.87 (0.72,1.05); P=1.4x10-1).

Conclusions

The strong associations of SGA and LGA with the GS suggest a large proportion of babies

classified as SGA and LGA are the tails of the normal range of the BW distribution. In the

case of SGA it suggests a large number of babies classified as SGA are not growth

restricted, but in the lowest quantiles of the normal range of fetal growth

Page 38: Health Data Science

S26

Notes

Page 39: Health Data Science

S27

Polygenic risk scores and potential clinical applications

Brent Richards

McGill University, Canada

One of the goals of the Human Genome Project has been to use genetic information to

identify individuals at increased risk of disease. This goal has been difficult to realize since

most identified genetic variants have small effects and the resultant variance explained in

disease risk has been low. Using very large sample sizes it has become possible to sum up

the effect of many small effect size variants into polygenic risk scores, which when combined

with relatively simple machine learning techniques, have recently achieved clinically-relevant

amounts of variance explained.

In this talk, I will discuss how the genetic prediction of bone density could potentially be

applied to screening programs for osteoporosis—amongst the most common and costly

diseases. We have recently developed a polygenic risk score from 341,449 individuals which

is highly correlated with measured bone density measure (r = 0.48). Validating this measure

in 10,199 individuals from 5 cohorts, we demonstrate how genetic prediction can help to

improve osteoporosis screening efficiency by removing individuals from the screening

program unlikely to have low bone density. Such incorporation of a genetic prediction

reduces the number of people who need to be screened for osteoporosis by ~50% yet

maintains sensitivity to identify individuals eligible for therapy at 92% and leaves the

specificity largely unchanged.

Nevertheless, multiple barriers exist prior to incorporation of such polygenic risk scores into

clinical practice and I will provide an overview of some of these challenges.

Page 40: Health Data Science

S28

Notes

Page 41: Health Data Science

S29

Taking genomics into the clinic: Whole-genome sequencing of 13,000 rare disease patients in a national healthcare system

Christopher J Penkett [1,2], Kathleen Stirrups [1,2], Salih Tuna [1,2], Olga Sharmardina [1,2], Sri V.V. Deevi [1,2], Karyn Mégy [1,2], Rutendo Mapeta [1,2], Ernest Turro [1,2,3], Daniel Greene [1,2,3], Stefan Gräf [1,2], Matthias Haimel [1,2], Hana Lango Allen [1,2], F. Lucy Raymond [1,4,5], Willem H. Ouwehand [1,2,4,6] on behalf of the NIHR BioResource-Rare Diseases Consortium

1. NIHR BioResource, Cambridge University Hospitals, Cambridge, UK. 2. Department of Haematology, Cambridge University, Cambridge, UK. 3. MRC Biostatics Unit, Cambridge Institute of Public Health, Cambridge, UK. 4. Wellcome Sanger Institute, Hinxton, Cambridge, UK. 5. Department of Medical Genetics, CIMR, Cambridge University, Cambridge, UK. 6. NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK.

To study genetic sequence variants underlying unresolved Mendelian disorders and improve

interpretation of already identified high penetrance variants, a collection of ~13,000

individuals with a rare disease and their relatives has been whole genome sequenced with

an average 30x coverage. Participants were mainly recruited through the NIHR BioResource

at NHS hospitals in the UK using approved eligibility criteria for 15 different rare disease

domains.

We describe the population structure including ethnicity and relatedness estimation, high

level phenotypes collected using Human Phenotype Ontology (HPO) terms and quality

control and summary metrics for samples and variants. The resource contains over 170

million unique variants in the ~10,000 genetically independent samples with 47% of variants

previously unobserved in other large scale publicly available genome data-sets (e.g.

gnomAD, TOPMed, HGMD, UK10K).

We summarise the curation of gene lists and pertinent findings in diagnostic-grade genes for

the 15 domains. Over 1000 reports assigning pathogenic or likely pathogenic causal variants

have been issued following review by Multi-Disciplinary Teams. We show the power of a

recently developed rapid Bayesian association test, BeviMed, to identify ~25 novel genes

and to provide independent validation of recent rare disease gene discoveries by others.

Some of these novel genes have been reported and led to changes in patient management.

More NIHR BioResource rare disease project details can be found at:

https://bioresource.nihr.ac.uk/rare-diseases/rare-diseases; variants can be browsed at:

https://bioresource.nihr.ac.uk/wgs; and a biorxiv preprint is available here:

https://www.biorxiv.org/content/10.1101/507244v1 .

Page 42: Health Data Science

S30

Notes

Page 43: Health Data Science

S31

Antenatal exposure to ultraviolet B radiation and learning disabilities: Population cohort study of 422,512 children

Claire E Hastie, Daniel F Mackay, Tom L Clemens, Mark PC Cherrie, Albert King, Chris Dibben, Jill P Pell

University of Glasgow, University of Edinburgh, Scottish Government

Background

Learning disability varies by month of conception. The underlying mechanism is unknown

but vitamin D, necessary for normal brain development, is commonly deficient over winter in

high latitude countries due to insufficient ultraviolet radiation. This study aimed to determine

whether antenatal exposure to ultraviolet B radiation was associated with learning disability.

Methods

We linked the annual School Pupil Censuses conducted between 2007 and 2016 to

maternity records at an individual level across Scotland. Mean monthly ultraviolet B radiation

levels were derived from NASA satellite data. Univariate and multivariate logistic regression

analyses were used to explore the associations between overall, and trimester-specific,

ultraviolet B exposure and learning disabilities, adjusting for the potential confounding effects

of month of conception and sex.

Findings

Of the 422,512 eligible, singleton schoolchildren born at term in Scotland, 79,616 (18·8%)

had a learning disability. Total antenatal ultraviolet B exposure was associated with learning

disabilities (highest quintile; adjusted OR 0·553, 95% CI 0·509-0·601, p<0·001) with

evidence of a dose-relationship. The association was independent of ultraviolet A exposure

(highest quintile; adjusted OR 0·501, 95% CI 0·456-0·549, p<0·001). Significant

associations were demonstrated for exposure in all three trimesters.

Interpretation

Lack of maternal exposure to ultraviolet B radiation may play a role in the seasonal

patterning of learning disabilities. Trials are required to evaluate the effectiveness of vitamin

D supplements.

Page 44: Health Data Science

S32

Notes

Page 45: Health Data Science

S33

A comparison of stroke in urban and rural populations: Exploring stroke risk and survival in Lancashire

Frances Biggin, Kelly Heys, Matt Heys, Jo Knight, Peter Diggle

Lancaster University (Frances Biggin, Jo Knight, Peter Diggle), University Hospitals Morecambe Bay Trust (Kelley Heys, Matt Heys)

Stroke is a major cause of death and disability in the UK and the risk factors leading to

stroke are well researched. The aim of this study is to explore whether the effects of those

risk factors vary according to whether the patient lives in a rural or an urban area, and

whether survival after a stroke is also affected by this geographical distinction.

The data consisted of 1,045 cases and 54,356 controls. Cases were all first strokes admitted

to hospital in the University Hospitals Morecambe Bay Trust between 01/01/12 and

01/06/18. Hospital data was linked to GP data through the Trust's community data

warehouse, and this in turn was linked to census data to determine the urban or rural status

of each patient. Controls were also selected from the hospital database, such that they were

representative of the same source population as the cases.

A generalised linear model was used to analyse the data. The initial model included

covariates determined by previous studies to be important: age, gender, deprivation index,

smoking status, diabetes, previous TIA (Transient Ischaemic Attack), systolic blood pressure

and rurality. Stepwise backwards selection and likelihood ratio tests were then used to

determine the most appropriate model from these covariates. The final model showed that

after covariate adjustment, those living in the urban areas of Lancashire are at a slightly

elevated risk of stroke, with a risk ratio of 1.20 (95% CI 1.02 to 1.41). A Cox proportional

hazards model for survival analysis was fitted using the same procedure and showed that

there were no statistically significant differences between the survival of urban and rural

patients after a first stroke for either 30 day or overall survival.

Although this study shows that there is a small but statistically significant effect on stroke risk

of living in an Urban area this result is based on data from just one county. Lancashire has a

large rural population, but none of the areas classed as rural fall into the lower end of the

deprivation index scale, which may affect the results of this study. Future studies will extend

these results into the neighbouring county of Cumbria which has a more diverse rural

population.

Page 46: Health Data Science

S34

Notes

Page 47: Health Data Science

S35

A novel unsupervised approach to biomedical named entity recognition and linking using neural networks

Zeljko Kraljevic, Rebecca Bendayan, Daniel M. Bean, Richard J. B. Dobson

1. Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, U.K. (D.M.B., R.B., R.J.B.D., Z.K.). 2. NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, U.K. (R.B., R.J.B.D., Z.K.) 3. Health Data Research UK London, University College London, 222 Euston Road, London, U.K. (R.J.B.D.). 4. Institute of Health Informatics, University College London, 222 Euston Road, London, U.K. (R.J.B.D.).

Biomedical documents like Electronic Health Records (EHRs) contain a large amount of

information in an unstructured format. The data in EHRs is a hugely valuable resource

documenting clinical narratives and decisions at the point of care, but whilst the text can be

easily understood by human doctors it's very difficult to work with at scale for machine

learning and statistical modelling.

To uncover the huge potential of biomedical documents we need to extract and structure the

information in them. The task at hand can be split into two Natural Language Processing

problems: Named Entity Recognition (NER) and Named Entity Linking (NEL). The number of

entities, ambiguity of words, overlapping and nesting make this task significantly more

difficult than the traditional NLP problems. Besides the size and the complexity of the

entities, one of the main problems we face is the lack of training data, most biomedical

documents and especially patient records cannot be made publicly available. Even when

they can there are millions of biomedical concepts and manually labelling the data is not a

viable option.

We are proposing an unsupervised approach to NER+NEL that learns to extract and link

entities given a dictionary of medical concepts and a database of text documents. We have

validated our solution on the MedMentions dataset [Biomedical papers annotated with

mentions from the Unified Medical Language System - UMLS]. The dataset contains 4,392

biomedical documents with 352,496 annotated UMLS concepts. We have compared our

results to SciSpaCy a recently published state of the art NLP for biomedical documents.

Preliminary results show a soft F1 score of 0.798 (vs 0.692 for SciSpaCy) on all concepts

from UMLS, and soft F1 of 0.897 (vs 0.846 for SciSpaCy) for disease identification.

Overall, we show that our approach can detect and link millions of different biomedical

concepts with state-of-the-art performance, while still being lightweight and fast. Follow up

validations will be performed using other clinical cases from the Clinical Record Interactive

Search system [From South London and Maudsley NHS Foundation Trust] and King's

College Hospital where we have connected our solution with CogStack - an information

retrieval and extraction service for EHRs. This tool will potentially contribute to clinical

coding, temporal modeling of patients, phenotyping and event prediction.

Page 48: Health Data Science

S36

Notes

Page 49: Health Data Science

S37

Creating knowledge graphs from literature to enable prioritisation of potential drug targets

J.Mullen, O.Giles, R.Turner, M.Hughes

SciBite Limited, BioData Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1DR, United Kingdom

Here we present an integrated knowledgebase that harmonises data from the literature with

structured data from relevant open source data sets, including ChEMBL and the GWAS

Catalog. These data are aligned to SciBite life science ontologies (e.g. indication, gene,

drug). The knowledgebase brings together multiple evidence types around target

prioritisation into a single point of query. We show examples of how elaborate questions can

be asked of the knowledgebase in order to identify and prioritise potential targets relevant to

different disease areas.

Current trends in computational medicine focus on understanding disease mechanisms and

the diagnosis and treatment of human disease. Through understanding disease mechanisms

one can identify potential drug targets, a vital task in drug discovery. In the big data era,

voluminous datasets capturing data relevant to target identification are not only presented in

various formats they are also stored in decentralised locations, using multiple naming

conventions to describe the same entities. These issues make it very difficult to ask domain

specific questions, where answers depend on the consolidation of evidence from many

sources. Unstructured datasources, such as scientific journal articles, are particularly

problematic when it comes to extracting structured data. These difficulties arise from the fact

that, within the life sciences, entities are known to be highly synonymous (i.e. many terms

can refer to the same entity) as well as incredibly ambiguous (the same term may refer to

different entity types depending on the context). In order to extract meaningful data from the

literature, as well as integrate this with alternative structured datasources, one needs

accurate ontologies that are reflective of the area of application. Once these ontologies have

been created a means of identifying instances of these ontology concepts within the

literature must be achieved. At the heart of SciBite's technologies is TERMite, a named

entity recognition (NER) engine. TERMite makes use of over 80 manually curated life

science ontologies to identify instances of entities, such as genes or diseases, contained

within unstructured text. By making use of SciBite's NER engine, associations between

entity types can also be identified, e.g. how many documents contain a sentence that

mentions disease X and gene Y? Once this structured data has been extracted from

literature it can then be fed into numerous downstream analyses, including enhancing these

semantic triples with relevant structured sources of data; allowing knowledgebases to be

built around a particular domain of interest, such as target prioritisation.

Page 50: Health Data Science

S38

Notes

Page 51: Health Data Science

S39

The Accelerating Medicines Partnership in Type 2 Diabetes Knowledge Portal

Aravind Sankar, Dylan Spalding, the Accelerating Medicines Partnership in Type 2 Diabetes.

European Bioinformatics Institute (EMBL-EBI)

The Accelerating Medicines Partnership in Type 2 Diabetes Knowledge Portal

(http://www.type2diabetesgenetics.org) is the result of a collaborative effort between

researchers from the Broad Institute, EMBL-EBI, the University of Michigan and the

Wellcome Centre for Human Genetics. It is an open-access web resource comprising DNA

sequence, genetic association results and functional genomic information from studies on

type 2 diabetes (T2D) that aims to speed up the development of new diagnostics and

treatments by identifying promising biological targets. There are 66 datasets in the T2D

Knowledge Portal (T2DKP) to date, varying from small studies to large meta-analyses.

The T2DKP is supported by a set of Knowledge Bases (KBs) which supply and perform the

analyses on the data for the T2DKP. In order to conform to heterogenous ethical, legal and

societal issues (ELSI) of data from different geographical locations, datasets are hosted at

multiple federated KBs, such as the Broad Institute and EMBL-EBI, which respond to remote

queries from the T2DKP. All T2DKP tools and interfaces can query data housed at all

federated KBs, enabling joint analysis of all datasets across the network of KBs regardless

of geographical location of the datasets. Examples of federated datasets include the Oxford

BioBank and the GoDARTS datasets.

Genetic association results in the T2DKP may be explored using interactive Manhattan plots,

viewed on pages displaying epigenomic data, or comprehensive information about specific

genes and variants. The Genetic Association Interactive Tool (GAIT) allows users to

securely interact with individual level data to perform on-the-fly association tests, filtering

sample sets by multiple criteria: custom online analyses and association tests can then be

performed based on these criteria. Results include p-values and odds ratios. GAIT also

enables assessment of disease burden for genes by focussing on high impact variants.

LocusZoom visualizes associations across a region and custom association analysis; other

modules display PheWAS and Forest plots showing associations for a variant across all

phenotypes in the T2DKP or across a selection of phenotypes from UK Biobank. The

Genetic Risk Score (GRS) module allows association of 243 high-risk variants with different

phenotypes to determine how they are genetically related.

Page 52: Health Data Science

S40

Notes

Page 53: Health Data Science

S41

Using Privacy Enhancing Technologies (PETs) to integrate clinical phenotypes, routine healthcare data and genomics in the public cloud

Neil Walker, NIHR BioResource, CUHP/EAHSN HDR UK Sprint Exemplar Innovation Programme

NIHR BioResource for Translational Research, Cambridge Biomedical Campus, Cambridge, UK; Membership of the project team is listed on the Cambridge University Health Partners, Eastern Academic Health Science Network websites http://cuhp.org.uk/ , https://www.eahsn.org/

Rare diseases, cumulatively affecting 1 in 17 of the population, can be difficult to diagnose

and often have an unidentified genetic cause. Utilising integrated data from advances in

imaging, pathology, and genomic analyses is hampered by separation of data at different

sites, with privacy risk cited as the chief barrier to data sharing. An HDR UK award brought

together academic, NHS and industry partners to build the governance, data flows and IT

architecture required to enable data integration and interrogated across data centres, with

targeted subsets de-identified and transferred to the public cloud for analysis.

Participants from 3 rare disease cohorts recruited into the NIHR BioResource all consented

to access to their healthcare records for research.

Source data is held at 2 sites: a secure data centre (AIMES) with NHS- and wider internet-

connections, and the University of Cambridge High Performance Computing service (HPC).

AIMES holds curated clinical details coded as SNOMED-CT or HPO terms, and is receiving

routine healthcare data requested from 5 pilot NHS Trusts, and activity data from Public

Health England and NHS Digital. The HPC has bulk (2.5PB) genomic data on the same

participants, summarised in OpenCGA. Data at AIMES is transformed through ETL

processes to FHIR, NHS Digital's preferred electronic health record exchange format. An

i2b2 data warehouse acts as cohort discovery tool.

We describe 2 use cases:

1. Academic: using machine learning on longitudinal full blood count data with genomics -

using routine healthcare data for research

2. Industry partner: bringing graph database tools and a drug discovery knowledge-base to

the data - integrating genomic and routine healthcare data

For these use cases one widely-used PET - a Trusted Execution Environment - fails, through

want of computational flexibility: the answer to data existing in separate silos, is not always

to build a bigger silo. Instead, we describe the use of a Differential Privacy PET (Privitar) to

apply privacy policies to de-identify subsets of data that are then assembled from the 2

sources in the public cloud (Microsoft Azure).

Page 54: Health Data Science

S42

Notes

Page 55: Health Data Science

S43

Creating 'operating systems' for biomedical discovery and care

Calum A. MacRae, Rahul C. Deo, Benjamin Scirica, Dana Vuzman, Stanley Y. Shaw

Brigham and Women's Hospital, Harvard Medical School, Broad Institute of Harvard and

MIT, Harvard Stem Cell Institute

Among the major constraints on the rigorous use of modern analytic approaches in

biomedicine are the low information content of many datasets, their generally modest scale,

sparse metadata and limited systematic integration of clinical, translational and basic

science. Many of these challenges directly align current efforts to introduce digital

technologies to redesign care delivery, to implement genomics in clinical practice and to

transform the precision (resolution and scale) of biomedical data collection. Specific

examples of practical efforts to address each of these areas will be outlined and the potential

for integrated systems where data collection, analysis, interpretation and iterative

implementation converge around 'biological' transactions or operators will be highlighted.

Page 56: Health Data Science

S44

Notes

Page 57: Health Data Science

S45

Refining national approaches to low value care

Aoife Molloy (1,2), Marcus Dawson (2,3), Ben Crittenden(3), Johannes Wolff (2)

1 = Imperial College London, 2 = NHS England, 3 = Faculty AI

Low value care is defined as interventions that are ineffective or only effective in certain

narrow clinical circumstances. The Evidence Based Interventions (EBI) Programme is a

national partnership between clinicians, purchasers, regulators and guideline producers for

health care in England, which provides statutory guidance on 17 interventions that should

not be routinely performed or should only be performed in certain cases.

Identification of low value care is challenging, with limitations in search strategies and

database capture. In this study we sought to identify inappropriate interventions from

administrative data using variation analysis and a prediction model. Data was extracted from

the Secondary Uses Service (SUS) database comprising 14,333,224 records of elective

hospital admissions in England between April 2017 and December 2018. Intervention codes

were generated using procedure (OPCS)/diagnosis (ICD_10) pairs so clinical indication and

procedure performed were captured. The degree of geographical variation between clinical

commissioning groups (CCGs) was analysed for each of the 306,993 total unique

procedure/diagnosis codes. The interdecile range was then calculated for each treatment

across all CCGs. The complete data extract (14,333,224 records) was randomly split into

separate training, validation and test sets. A random forest classifier model was fitted to the

balanced training data and optimised on the validation set by systematically varying the

model hyperparameters. The analysis focused on records that were incorrectly classified as

inappropriate (i.e. the 1,986,083 records that were predicted to be class 1, but where the

actual label was class 0). The mean prediction probability was calculated for each

procedure/diagnosis code within this false positive group, which created a shortlist of 92,647

unique treatments with associated mean probabilities. The geographical variation analysis

and the random forest classifier results were combined. Interventions with a calculable

interdecile range and which the random forest classifier incorrectly predicted to be

inappropriate were selected. This created a new shortlist of 1,048 recommendations that can

be considered for the EBI programme.

The project outlines data-driven approaches to identify inappropriate interventions from

administrative data. Using variation analysis and machine learning to identify inappropriate

interventions for consideration for policy making, guidance production and further research

represents a novel and promising method for extracting clinically-relevant insights from the

vast resource of administrative patient-level data within the NHS.

Page 58: Health Data Science

S46

Notes

Page 59: Health Data Science

S47

Using wearable devices and genetics to estimate and validate mechanisms of sleep

Andrew R Wood1,16, Samuel E. Jones1, Vincent T. van Hees2, Diego R. Mazzotti3, Pedro

Marques-Vidal4, Séverine Sabia5,6, Ashley van der Spek7, Hassan S. Dashti8,9, Jorgen

Engmann5, Desana Kocevska7, Melvyn Hillsdon1, Annemarie I. Luik7, Najaf Amin7, Jacqueline M.

Lane8,9, Richa Saxena8,10, Martin K. Rutter11,12, Henning Tiemeier5,13, Zoltan Kutalik13,14, Meena

Kumari15, Cecilia Lindgren16, Aiden Doherty16, Timothy M. Frayling1, Michael N. Weedon1, Philip

Gehrman3

1) University of Exeter, Exeter, UK; 2) Netherlands eScience Center, Amsterdam, The

Netherlands; 3) University of Pennsylvania, Philadelphia, PA, USA; 4) Lausanne University

Hospital, Lausanne, Switzerland; 5) University College London, UK; 6) Université de Paris, Paris,

France; 7) Erasmus MC University Medical Center, Rotterdam, Netherlands; 8) Massachusetts

General Hospital, Boston, MA, USA; 9) Broad Institute of MIT and Harvard, Cambridge, MA,

USA; 10) Harvard Medical School, Boston, MA, USA; 11) University of Manchester, Manchester,

UK; 12) Manchester University NHS Foundation Trust, Oxford Road, Manchester, UK; 13)

Harvard TH Chan School of Public Health, Boston, MA, USA; 14) Swiss Institute of

Bioinformatics, Lausanne, Switzerland; 15) University of Essex, Colchester, UK; 16) Big Data

Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford,

UK.

The widespread availability of wearable devices, such as wrist worn accelerometers, offers the

potential to objectively measure sleep in millions of people, and thus better understand the role of

sleep in disease development. While polysomnography (PSG) is regarded as the gold standard

method of measuring sleep, it is impractical to perform in large cohorts. To better understand the

mechanisms of sleep, we aimed to derive accelerometer estimates of sleep duration, timing and

quality and perform subsequent genetic analyses on these phenotypes to identify novel genetic

associations.

Using accelerometer data in up to 85,670 participants from the UK Biobank study, we derived

measures of sleep duration (duration and variability), quality (sleep efficiency and number of

nocturnal sleep episodes), and timing (including nocturnal sleep and the least active 5-hours)

using an algorithm previously validated against PSG data. Genetic data available from the UK

Biobank was used for genome-wide association analyses and validation of phenotypes.

The phenotypic correlation between accelerometer-estimates of sleep timing and measures of

sleep duration and quality were low (-0.10 R 0.12), consistent with data from self-reported

chronotype (“morningness”) and sleep duration in the UK Biobank (R = -0.01). We observed a

stronger correlation between sleep duration and sleep efficiency (R = 0.57). Heritability estimates

ranged from 2.8% (95% CI 2.0%, 3.6%) for sleep duration variability, to 22.3% (95% CI 21.5%,

23.1%) for the number of nocturnal sleep episodes. We identified 47 genetic associations at

P<510-8. These include 26 novel associations with measures of sleep quality and 10 with

nocturnal sleep duration. We replicate a previously reported gene locus (PAX8) associated with

sleep duration based on self-report data. We observe variants previously associated with restless

legs syndrome associate with multiple sleep traits. As a group, sleep quality loci are enriched for

serotonin processing genes.

Using accelerometer data in the UK Biobank, we have derived eight measures of sleep

characteristics. Genetic associations between these measures and known restless legs

syndrome associations provides validation of our methods. Work will now focus on the effects of

sleep and classified daytime activity types on disease risk. Activity types will be derived from

machine learning methods using labelled accelerometer data from camera experiments (random

forests and Hidden Markov Models). Associations with adverse health outcomes will be

assessed using prospective data from the UK Biobank.

Page 60: Health Data Science

S48

Notes

Page 61: Health Data Science

S49

Towards a self-learning healthcare system

Tianxi Cai

Harvard University, USA

The wide adoption of electronic health records (EHR) systems has led to the availability of

large clinical datasets available for discovery research. EHR data, linked with bio-repository,

is a valuable new source for deriving real-word, data-driven prediction models of disease risk

and progression. Yet, they also bring analytical difficulties. Precise information on clinical

outcomes is not readily available and requires labor intensive manual chart review.

Synthesizing information across healthcare systems is also challenging due to heterogeneity

and privacy. In this talk, I’ll discuss analytical approaches for mining EHR data with a focus

on scalability, reproducibility and automated knowledge extraction. These methods will be

illustrated using EHR data from Partner’s Healthcare and Veteran Health Administration.

Page 62: Health Data Science

S50

Notes

Page 63: Health Data Science

P1

Poster Presentations

An empirical study of the effects of black-box anonymisation on observational research

Mehrdad A. Mizani, Aziz Sheikh

Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh

Access to individual-level data is essential for generalisable observational studies. Data

holders provide individual-level data only after anonymisation to protect patient privacy. The

commonly applied anonymisation methods in the UK are variations of k-anonymity, synthetic

datasets, statistical disclosure control, and differential privacy. These methods incorporate

deleting, randomising, perturbating, or aggregating data elements at the feature, record or

cell levels. Consequently, they introduce noise and potentially reduce data utility. These

anonymisation methods are typically applied as a black-box approach leading to potential

uncertainties regarding the integrity and representativeness of anonymised data. This

uncertainty can lead to undetected bias, faulty assumptions and wrong interpretations in

observational studies.

We aim to empirically analyse the effects of anonymisation on the outcome and

interpretation of observational studies based on Machine learning (ML) and Artificial

Intelligence (AI). We have developed a pilot analytic software for assessing the effects of

anonymisation techniques on data utility in a range of observational study designs (i.e.

ecological, cross-sectional, case-control, and retrospective cohort studies). We have tested

our platform on the adult dataset of the University of California, Irvine (UCI) Machine

Learning Repository and aim to extend our research to the whole cohort of the UK Biobank

for asthma outcome.

We will use algorithmically-defined asthma outcome data from UK Biobank as a baseline

gold standard to assess the effects of anonymisation at three levels, as observed in the pilot

stage: 1) feature engineering, 2) ML/AI and epidemiological models and 3) interpretation of

the outcomes. We will apply prominent methods of anonymisation- such as aggregating,

feature suppression, value generalisation, and clustering -on the baseline data. The splitting

of data into train-validation-test datasets and the application of ML methods (e.g. logistic

regression) will be followed. The evaluation of feature engineering (e.g. changes in

probability distribution and correlation), ML/AI models (e.g. distance measurement) and

epidemiological characteristics (e.g. odds ratio) will be compared across various anonymity

methods.

We seek to stimulate discussion on approaches currently used to maintain data anonymity

and to consider the potential impact of these on analytic considerations and the development

of novel approaches to maintain data granularity whilst at the same time preserving

anonymity.

Funding: Mehrdad A. Mizani's Newton International Fellowship is funded by the Academy of

Medical Sciences and the Newton Fund.

Page 64: Health Data Science

P2

Negation Detection in NHS Brain Imaging Reports

Dominic Sykes, Andreas Grivas, Claire Grover, Richard Tobin, Cathie Sudlow, William Whiteley, Andrew McIntosh, Heather Whalley, Beatrice Alex

Centre for Clinical Brain Sciences, Usher Institute and School of Informatics, University of Edinburgh

With the advancement of natural language processing (NLP) technology, it is now possible

to extract structured information from Electronic Healthcare Record (EHR) raw text at

reasonably high accuracy. However, a challenge affecting performance is the accurate

distinction between positive and negative mentions of clinical terms. For example,

distinguishing between 'evidence of tumour present' and 'no evidence of tumour present' has

important implications. It is essential therefore that NLP methods include a negation

detection component.

Using a validated set of manually-annotated brain imaging reports from the Edinburgh

Stroke Study (ESS), we tested four negation detection methods, including two rule based

systems and two neural network (NN) methods trained on a subset of ESS:

(i) 'EdiE-R-Neg' - part of an existing pipeline developed to process radiology reports

(ii) 'pyConText' - a rule-based implementation and extension of 'ConText' developed for

medical text

(iii) 'biLSTM-Neg' - a bidirectional Long Short-Term Memory (biLSTM) architecture

(iv) 'FFNN-Neg' - a feed-forward neural network (FFNN).

Radiology reports (N=630) from the ESS data were annotated for 12 named entity types

(e.g. tumour, subdural haematoma) and 4 modifiers (e.g. location, temporal information).

The automated methods were tested on the unseen ESS test set (N=266 reports) containing

961 negated annotations.

The EdIE-R-Neg system (specifically developed for radiological reports) performs best at

98.02 (96.92-98.72) precision (P; positive predictive value) 97.61 (96.43-98.40) recall (R;

sensitivity) and 97.81 F1 score (harmonic mean of P and R) for negation annotations (with

95% confidence intervals for P and R reported in brackets). This is compared to a rule-based

negation detection system developed for medical text more generally, pyConText, which

performs worst (P=91.52 (89.61-93.09), R=94.28 (92.62-95.57), F1=92.88)). The two NN

systems (both trained on ESS data) perform almost as well as the local rule based method

(biLSTM: P=98.40 (97.38-99.02), R=96.05 (94.61-97.10), F1=97.21; FFNN-Neg: P=97.81,

R=(96.68-98.56), R=97.61 (96.43-98.40), F1=97.71). The difference between the two NN

approaches is that biLSTM-Neg has higher precision than recall for negative annotations,

while FFNN-Neg is more balanced across metrics.

In summary, we report comparable high levels of performance for the NN methods to a rule

based system designed specifically for such records. This comparison demonstrates the

suitability of different approaches for the use of negation detection to improve NLP accuracy

in this context.

Page 65: Health Data Science

P3

Semantics and graph databases: developing, applying and mapping a spectrum of

phenotype ontologies to integrate health-related data

Tim Beck1, Richard Bramley2, Tom Shorter1 and Anthony J Brookes1

1 Department of Genetics and Genome Biology, University of Leicester, Leicester 2 NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester

The term “phenotype” is used to define an aggregated set of medically and semantically

distinct concepts such as a trait (e.g., blood glucose level), medical signs and symptoms

(e.g., hyperglycemia), and disease (e.g., type 2 diabetes). A spectrum of publicly available

ontologies is available that define overlapping phenotype domains. Two use cases

demonstrate our approach for harmonising both non-clinical and clinical qualitative research

data: integrating phenotype content across thousands of summary-level genome-wide

association studies (GWAS); and enabling the Genetics and Vascular Health Check study

(GENVASC) primary care dataset which consists of many thousands of codes to be

interrogated using a specialised disease-specific application ontology.

GWAS Central (https://www.gwascentral.org/) is the world’s largest collection of summary-

level GWAS data. The phenotype content is annotated using MeSH and the Human

Phenotype Ontology (HPO), where MeSH provides detailed disease descriptors and HPO

provides granular signs and symptoms coverage. We have created an automated pipeline

to map between the phenotypes coded in GWAS Central and phenotypes coded to other

ontologies in GWAS-related databases to create the comprehensive ‘PhenoMap’ resource.

The use of an underlying graph database allows simple and fast retrieval of complex

hierarchical structures that are difficult to model in relational systems.

GENVASC uses primary care data to investigate how risk prediction for coronary artery

disease (CAD) can be improved. The introduction of both the General Data Protection

Regulation, and SNOMED CT as the single structured vocabulary used within the NHS, has

presented two distinct challenges. Firstly, there is a necessity to justify the collection of all

SNOMED CT codes from the primary care record, and secondly, SNOMED CT is a verbose

clinical terminology which requires an interface with focussed specialised terms to allow it to

be fully accessible and useful to clinical researchers. We have developed an application

ontology for CAD (225 terms) ensuring alignment with external initiatives, such as the NIHR

HIC Metadata Catalogue. Using automated concept recognition techniques, we mapped our

CAD ontology to the UK Edition of SNOMED CT (>10K mapped terms) to produce a subset

of the terms in SNOMED CT which can be extracted from the primary care records for

participants.

Page 66: Health Data Science

P4

A scalable federated solution for UK health data encryption, linking and discovery

Professor Anthony Brookes, Theodoros Arvanitis(1), Tony Avery(2), Simon Ball(3), Don Campbell(4), Andy Carruthers(5), Saiful Choudhury(5), Cheryl Davenport(6), Spencer Gibson(7), Reiko Heckel(7), Umesh Kadam(7), Joe Kai(2), Danny Kirby(5), Darren Palmer(8), Phil Quinlan(2), Ajitha Ranasinghe(8), David Roberts(4), Mark Robertson(4), Des Shanahan(4), David Shepherd(7), Warren Stone(9), Phil Townes(8) and Marc Wadsley(7)

1: University of Warwick, UK 2: University of Nottingham, UK 3: University Hospital Birmingham, UK 4: Privitar 5: University Hospital Leicester, UK 6: Leicester City CCG, UK 7: University of Leicester, UK 8: Leicester Health Informatics Service, UK 9: NTT Data

As part of the HDR UK Digital Innovation Hub program we are running a 'Sprint Exemplar'

project to establish a method for responsible record-level linkage between effectively

anonymised healthcare datasets (primary care, secondary care, social care, etc) in a

federated network, along with a process for making these linked datasets safely

discoverable by researchers that might wish to access them. The project includes academic,

industry, biobanking and healthcare partners across Leicestershire and the Midlands, plus

the privacy software company Privitar. To date, we have combined our partner technologies,

and defined and begun implementing a viable architecture for the intended federated linkage

service. Source datasets have been agreed and Information Governance work undertaken to

enable access appropriate to the planned use. A generalised model for enabling linked data

discovery has been formulated, and this is being applied via previously developed 'Café

Variome' software. A UK wide 'Strategic Committee' has been formed to oversee progress

and make wider connections.

Page 67: Health Data Science

P5

Learn from the unknown: Leveraging missingness for outcome prediction in clinical time series data

Thore Buergel 1, Maria Herrero-Zazo 2, Tomas Fitzgerald 2 and Ewan Birney 2

1 Health Data Science Unit, University Clinics Heidelberg, Germany 2 The European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom.

The implementation of electronic health records (EHRs) with the ability to overcome the

reign of pen and paper in clinical context marks the birth hour for medical big data. Thereby,

a typical EHR is composed of descriptors of very diverge origins and scales, such as

physiological measurements, laboratory test results, pharmacological treatments, diagnosis

and patient demographics from hospital admission to discharge. The density of the clinical

time series is dependent on both, the severity of the disease as well as the ward, with the

densest sampling found in intensive care units (ICUs). In summary, EHRs are high-

dimensional, multi-scaled, irregular multi-variate time series containing information on patient

descriptors and outcomes.

Potential application of EHRs, beyond data collection lay in the development of machine

learning models for clinical decision support, risk modelling and patient stratification. The

foundation to all these approaches, is the correct representation of clinical time series data

for machine learning purposes. This is a non-trivial question, as this data exhibit strong

missingness. In consequence clinical time series vary between patients in both, recorded

descriptors and their specific frequency of recording. In contrary to other time series data,

however, this missingness, is as well informative, as laboratory tests for instance are only

requested upon incidence. Thus, in order to be able to fully exploit clinical time series data,

an optimal representation capturing the characteristic patterns of missingness is crucial.

In this work, we set out to investigate representations and identify the methods of choice for

machine learning from clinical time series data. Utilising data from the publicly available

MIMIC-III database, we assess combinations of time series representation techniques with

random forests, recurrent neural networks and temporal convolutional neural networks with

special emphasis on the question of leveraging the hidden information in missingness

patterns.

To be able to benchmark these diverse representations and modelling techniques, we define

a supervised learning problem and evaluate models based on their predictive performance.

More specifically, we assess the general static outcome of inpatient mortality from the first

48 hours of clinical time series data after admission to ICU. Here, we demonstrate predictive

performance increases when information on missingness is added. We report model

performances and give concrete insights in our flexible and transferable benchmark

environment covering the whole pipeline from data preprocessing, over encoding to

prediction.

Page 68: Health Data Science

P6

Early stages of type 2 diabetes: cohort study linking genetic liability with repeated metabolomics across early life

Caroline J. Bull 1,2,3*, Joshua A. Bell 1,2*, Marc J. Gunter 4, David Carslake 1,2, George Davey Smith 1,2, Nicholas J. Timpson 1,2, Emma E. Vincent 1,2,3

* Contributed equally 1 MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK 2 Bristol Population Health Science Institute, Bristol Medical School, Bristol, UK 3 School of Cellular and Molecular Medicine, University of Bristol, UK 4 Section of Nutrition and Metabolism, International Agency for Research on Cancer, Lyon, France

Type 2 diabetes develops for many years before diagnosis. We aimed to reveal its early

metabolic stages by examining genetic liability to adult type 2 diabetes in relation to detailed

metabolic traits across early life.

Data were from up to 4,765 offspring participants of the Avon Longitudinal Study of Parents

and Children (ALSPAC) birth cohort study. Linear regression models were used to examine

effects of a genetic risk score (GRS) for adult type 2 diabetes, which included 162 genetic

variants explaining 18% of variance from the most recent genome-wide association study,

on 4 repeated measures of 229 traits from targeted nuclear magnetic resonance

metabolomics. These traits included lipoprotein subclass-specific cholesterol and triglyceride

content, branched chain and aromatic amino acids, fatty acids, inflammatory glycoprotein

acetyls, and others, and were measured in childhood (age 8y), adolescence (age 15y),

young-adulthood (age 18y), and adulthood (age 25y).

Participants were 49.7% male and had progressively higher body mass index (BMI) across

time points (mean, standard deviation (SD)) BMI in kg/m2 was 16.2 (2.0) at age 8y and 24.8

(4.9) at age 25y, respectively. The prevalence of type 2 diabetes was low across time points

(< 5 cases when first assessed at age 15y; 7 cases (0.4%) when assessed at age 25y). At

age 8y, type 2 diabetes liability (per SD-higher GRS) was associated with lower lipids in

high-density lipoprotein (HDL) particle subtypes - e.g. -0.03 SD (95% CI=-0.06, 0.00;

P=0.03) for total lipids in very-large HDL. At age 15y, associations remained strongest with

lower lipids in HDL and became stronger with pre-glycemic traits including citrate (-0.06 SD,

95% CI=-0.09, -0.02; P=0.001) and with glycoprotein acetyls (0.05 SD, 95% CI=0.01, 0.08;

P=0.01). At age 18y, associations were stronger with branched chain amino acids including

valine (0.06 SD; 95% CI=0.02, 0.09; P=0.001), while at age 25y, associations had

strengthened with very-low density lipoprotein (VLDL) lipids and remained consistent with

previously altered traits including HDL lipids.

Early metabolic stages of type 2 diabetes likely involve lower HDL lipids, followed by higher

branched chain amino acid and inflammatory glycoprotein acetyl levels. These perturbations

are apparent in childhood as early as age 8y, decades before the clinical onset of disease.

Knowledge of traits specific to type 2 diabetes development may inform the targeting of key

pathways or the early prediction of individuals on course to more debilitating stages of

disease.

Page 69: Health Data Science

P7

Genome-wide association study of tramadol dose in Generation Scotland using Electronic Health Records

Toni-Kim Clarke, Johnathan Hafferty1, Mark Adams1, Archie Campbell2, David Porteous2, Caroline Hayward3 and Andrew McIntosh1

1Division of Psychiatry, University of Edinburgh, Edinburgh, UK 2Centre for Genomic and Experimental Medicine , Institute of Genetics & Molecular Medicine, University of Edinburgh, UK 3MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, UK

The increased use and dependence on opioids is a global concern. The recent opioid

epidemic in the United States largely has its origin in the over-prescription of opioid

analgesics to treat chronic pain. Deaths from both illicit and prescription opioids have

continued to rise steadily since 2000 and for those seeking treatment for opioid use disorder

(OUD), 75% cite prescription drugs as their first exposure to opioids. A risk factor for OUD is

the prescription of higher maximum doses of opioid analgesics for chronic pain. In order to

determine genetic factors associated with higher doses of opioid analgesics a genome-wide

association study (GWAS) of tramadol dose was performed in the Generation Scotland (GS)

cohort. Using electronic health records, members of GS were linked to the National

Prescribing Information System to obtain prescription dose information. 945 GS participants

were identified who had a received 3 or more tramadol prescriptions from April 2009 -

February 2015 (69% female, mean age 53.2 [S.D=13.5]). Maximum tramadol dose was

derived for each participant by extracting the dosage, frequency and medication strength

from the prescribing information. GWAS analyses did not find any loci associated with

maximum tramadol dose at the level of genome-wide significance; however, SNPs in genes

including CREB5 and TRPV2 were amongst those most significantly associated. CREB5 has

previously been associated with opioid disorder and TRPV2 polymorphisms are associated

with fibromyalgia. Although this is only a small sample, these findings demonstrate how

record linkage can be used to better understand the genetics of opioid dose and chronic

pain. Approaches such as these will also help to increase power in studies of OUD where it

is difficult to obtain large sample sizes.

Page 70: Health Data Science

P8

particleMDI: A Julia Package for the Integrative Cluster Analysis of Multiple Datasets

Nathan Cunningham, Jim E Griffin, David L Wild, Anthony Lee

NC: University of Warwick and The Alan Turing Institute, JEG: University College London, DLW: University of Warwick, AL: University of Bristol and The Alan Turing Institute

We introduce particleMDI, a Julia package for performing integrative cluster analysis on

multiple heterogeneous data sets with applications in genomc research. Our approach uses

a particle Monte Carlo method for inference where the cluster allocations are updated using

a particle filter within a Gibbs sampling approach. particleMDI updates cluster allocations

using a particle Gibbs approach which offers better mixing of the MCMC chain, but at greater

computational cost, than the original MDI algorithm. We outline approaches to improving the

computational performance of our algorithm, finding the potential for greater than an order of

magnitude improvement in performance. We apply our algorithm to the task of discovering

risk cohorts amongst 243 patients presenting with kidney cancer, using samples from the

Cancer Genome Atlas, for which there are gene expression, copy number variation,

methylation, protein expression and microRNA data. We identify 4 distinct consensus

subtypes and show they are prognostic for survival rate (p < 0.0001). This is driven mainly

by the copy number data, but the effect is strengthened by the other four data types,

demonstrating the value of integrating multiple data types.

Page 71: Health Data Science

P9

Enabling better access to healthcare data for the UK SME community

Mark Davies, Joanne Hartley, Tim Newton, Nicola Heron and Mark Samuels

Medicines Discovery Catapult, Alderley Park, Macclesfield, Cheshire, SK10 4TG.

Opportunity:

The ability to access and use health data offers many opportunities for SMEs to improve

healthcare, such as training AI algorithms or testing mobile health Apps. These could help

answer questions ranging from the epidemiology of rare diseases to understanding patients'

natural history of disease. Technological developments addressing key issues around data

privacy, such as the emergence of federated learning to solve the problem of moving patient

data. Appropriate access to health data gives SMEs the chance to find disease interventions

at far earlier stages than has been the case, ultimately leading to better development of new

medicines. Additionally, an ability to identify trends in healthcare data can be used to

optimise public health through better targeted treatments.

Problem statement:

The UK has a national healthcare system from cradle to grave, which provides extensive

opportunities for health research. Yet healthcare data and electronic patient records are

often fragmented. SMEs in particular, can find access procedures complex. Whilst there

exist a number of large data sets, e.g. UK Biobank, Hospital Episode Statistics (HES) and

Clinical Practice Research Datalink (CPRD), they can be inaccessible for many smaller

companies: the cost or requisite for analytical capabilities can preclude SMEs. Furthermore,

these companies would most benefit from linking these large datasets to complimentary data

(e.g. mobile health or lifestyle records).

Solution:

The Medicines Discovery Catapult is a not-for-profit company, funded by Government to

support the development of new medicines by SMEs. The Catapult provides scarce scientific

capabilities and acts as a national gateway to specialist facilities, technology and expertise

within the UK. Through research and engaging with SMEs, knowledge has been built around

what SMEs need in terms of healthcare data. With extensive know-how of the UK's health

data, the Catapult is helping SMEs to support development of new medicines for patients.

On top of its series of data access guides, the Catapult is already providing SMEs with

tailored advice and recommendations about where and how to source the data they need.

We have already provided support to a number of SMEs for accessing data from a range of

UK data sources. Furthermore, the results of our survey demonstrate the current challenges

facing SMEs and the workflow we are developing to help.

Page 72: Health Data Science

P10

ELIXIR: Providing a coordinated European Infrastructure for managing Human Genomics Translational Data and Services

Jennifer Harrow, On Behalf of ELIXIR

ELIXIR

ELIXIR unites Europe's leading life science organisations in managing and safeguarding the

increasing volume of data being generated by publicly funded research. It coordinates,

integrates and sustains bioinformatics resources across its member states and enables

users in academia and industry to access services that are vital for their research. There are

currently 22 countries involved in ELIXIR, bringing together more than 200 institutes and 600

scientists.

ELIXIR's activities are coordinated across five areas called 'Platforms', which have made

significant progress over the past few years. For instance, the Data Platform has developed

a process to identify data resources that are of fundamental importance to research and

committed to long term preservation of data, known as core data resources. The Tools

Platform has services to help search appropriate software tools, workflows, benchmarking as

well as a Biocontainer's registry, to enable software to be run on any operating system. The

Compute Platform has services to store, share and analyse large data sets and has

developed the Authorization and Authentication Infrastructure (AAI) single-sign on service

across ELIXIR. The Interoperability Platform develops and encourages adoption of

standards such as FAIRsharing, and the Training Platform helps scientists and developers

find the training they need via the Training e-Support System (TeSS).

The Beacon Project is an open sharing platform that allows any genomic data centre in the

world to make its data discoverable. The project is a first-of-its kind effort to make the

massive amounts of life sciences data being collected in healthcare and research settings

around the globe accessible and is being supported and funded by ELIXIR. To date, 70

beacons have been "lit," including seven in the UK and another nine across Europe, allowing

users unprecedented discovery of genomic variants in national and international cohorts.

The Authentication and Authorisation Infrastructure (AAI) provides a centralised user identity

and access management service (ELIXIR AAI). ELIXIR AAI will be used to access the

European Genome-Phenome Archive (EGA) resources and ELIXIR is working with the

GA4GH to have ELIXIR AAI approved as a standard. The focus now for ELIXIR Human

Genomics and Translational Data is to establish a federated suite of EGA services across

Europe, coordinating the national roadmaps and large EU projects to enable population

scale genomic, phenotypic, and biomolecular data to be accessible across international

borders.

Page 73: Health Data Science

P11

Generation Scotland: Scottish Family Health Study: Linkage to Electronic Health Records

Caroline Hayward, MRCHGU QTL in Health and Disease group.

MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, EH4 2XU, Scotland. Generation Scotland, Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, EH4 2XU, Scotland

Generation Scotland: Scottish Family Health Study (GS:SFHS) is a biobank designed to

study the genetics of areas of importance to public health. Over 24,000 adults were recruited

to the study from 2006 to 2011, with broad and enduring written informed consent for

biomedical research. Consent was obtained from most participants for GS:SFHS study data

to be linked to their Scottish National Health Service records, using their Community Health

Index number. This identifying number is used for NHS Scotland procedures and allows

healthcare records for individuals to be linked across time and location.

NHS Electronic Health Record (EHR) data on GS:SFHS participants with consent and

mechanisms for record linkage, plus genome-wide genetic information, is available for

associations relating to quantitative trait phenotypes and occurrence of prevalent and

incident diseases is investigated. The existing baseline study phenotypes and new, "omics"

and biomarker data obtained from analysis of stored biological samples (serum/urine)

present opportunities to gain insights into a range of complex conditions. This demonstrates

that EHR is a rich resource of real world data that can be used in research to characterise

the health trajectory of participants.

Page 74: Health Data Science

P12

Exploring chronic disease and multimorbidity through linked healthcare records in a local cohort

Catherine John, Robert C Free, Nicola Reeve, David J Shepherd, Louise V Wain, Martin D Tobin

University of Leicester (Department of Health Sciences, NIHR Leicester Biomedical Research Centre and Department of Respiratory Sciences), UK

Multimorbidity is the presence of multiple diseases or conditions in one patient. It is now the

norm in some age groups in high-income countries. However, many research studies are

designed to consider individual conditions in relative isolation. Studies with linked electronic

health records (EHR) can incorporate a broad range of diseases and risk factors to study

multimorbidity, as well as longitudinal follow-up of disease progression, treatment outcomes

and ageing.

EXCEED (Extended Cohort for E-health, Environment and DNA) is a population-based study

of over 10,000 adults based in Leicester. Recruitment is ongoing, with particular emphasis

on increasing representation from Black, Asian and minority ethnic (BAME) communities.

Data includes baseline anthropometry, spirometry, lifestyle factors, longitudinal health

information from EHR (primary care and Hospital Episode Statistics) and genome-wide

genotyping. Participants also consent to be contacted for recall-by-genotype and recall-by-

phenotype studies, facilitating precision medicine research and validation of EHR.

Mean participant age is 58.4 years (sd 8.5). Participants are slightly healthier than average

for common risk factors, in line with similar cohorts. For example, the total proportion who

are overweight/obese (64.2%) is slightly lower than similar age groups in Health Survey for

England. Nevertheless, 60% of participants with primary care EHR linkage have at least one

chronic condition (any qualifying diagnostic code for ≥1 of 16 conditions prioritised for

management in primary care by the Quality and Outcomes Framework), and over a quarter

have evidence of multimorbidity (≥2 of 16 conditions). Over 90% have ≥4 recordings of blood

pressure over time, and many have multiple results from routine blood tests such as serum

cholesterol (62.6% with ≥3 results) or creatinine (72.4% with ≥3 results), enabling the study

of progression and outcomes of common chronic diseases over time.

Linked primary care EHRs contain a wealth of data on chronic disease. Whilst there are

limitations - including misclassification, miscoding and the "clinical iceberg" effect - repeat

recording and multiple types of data can be used to improve classification. Many disease

definitions have been validated already - for example, using CPRD - and EXCEED allows for

further validation by recontacting participants. In addition, combining records such as

prescriptions, diagnoses and symptom codes can define complex phenotypes that may not

otherwise be easily studied. EXCEED is an important resource for the study of chronic

disease and multimorbidity in the era of genomics and precision medicine.

Page 75: Health Data Science

P13

MegaCox: Large-Scale implementation for Cox's Proportional Hazards Model

Alexander W. Jung, Moritz Gerstung

EMBL-EBI

The availability of population-scale health data sets ranging from medical records, wearable

devices, to various 'omics data allows for data analysis on unprecedented scale and depth.

However, the vast amount of data requires the development of new statistical methods. Of

particular interest in health data science is the prediction of time to event like death or

disease diagnosis.

Due to heavily skewed distributions and censoring, standard statistical methods are no

longer applicable and more advanced models in the context of survival analysis are needed.

Cox's Proportional Hazards Model (CPH) is extensively used in research, thanks to sound

statistical theory, flexibility, and meaningful parameter estimates.

Nevertheless, current implementation of the CPH requires the data to be fit into memory,

hence, limiting the scope for big data applications. Therefore, we developed an

implementation of the CPH that scales to massive amounts of data. We provide a simulation

study to show the consistency of our method and its scalability to big data.

This forms the methodological basis to analyze a population-wide hospital admission record

of 8+ million people to infere predictive disease trajectories.

Page 76: Health Data Science

P14

Adherence to cardiovascular medication across Scotland

Kirstin Leslie, Professor Jill Pell, Professor Colin McCowan, Professor Andrew Morris, Professor Dave Robertson.

University of Glasgow MRC Doctoral Training Partnership in Precision Medicine

Adherence to prescription drugs is a longstanding but increasing problem in disease

management. In recent years, a rise in chronic diseases, longer life-expectancies, increased

multi-morbidity, and use of medicine in disease prevention as well as treatment, has led to

an increase in the number of people taking drugs for extended periods of time, exacerbating

the problem of nonadherence. Failure to adhere to medication can contribute to reduced

clinical effectiveness, reduced cost-effectiveness, poorer overall health, and health

inequalities.

Despite the availability of efficacious drugs, cardiovascular disease (CVD) remains a leading

cause of global mortality, and prevalence of CVD is higher in Scotland than in other

developed countries; consequently, research into adherence to cardiovascular drugs could

be particularly useful in this setting. Furthermore, Scotland has valuable nation-wide

administrative databases which can be used for the study of adherence at a population level.

With these datasets, it is possible to define different levels of disease severity, identify

polypharmacy and comorbidities, and compare adherence across a range of drug classes.

This could prove important in targeting future interventions to improve adherence and for

informing prescribing practices.

Using the Prescribing Information System (PIS) linked to hospital admissions data (SMR01,

SMR04) and death certificates (NRS) we have defined four key patient subgroups: primary

prevention (n= 1,656,566), treatment for symptomatic CVD (n = 260,516), secondary

prevention (n=25,283), and secondary prevention with treatment (n=23,866). These groups

differ across key baseline characteristics such as age, gender, and commonly prescribed

CVD drugs.

This study is a novel take on the study of adherence to cardiovascular medication: a

longitudinal, Scotland-wide, retrospective study of adherence with cardiovascular drugs,

covering all ages and sociodemographic classes, giving greater insight into modifiable and

unmodifiable risk factors, and allowing identification of patient groups requiring extra support.

Page 77: Health Data Science

P15

Use of Hospital Episode Statistics in the UK Biobank to aid independent replication of genetic associations

Nadezda Lipunova1, Anke Wesselius2, Kar K. Cheng3, Frederik-Jan van Schooten4, Richard T. Bryan1, Jean-Baptiste Cazier1,5, Maurice P. Zeegers1,2

1Institute of Cancer and Genomic Sciences, University of Birmingham, United Kingdom, B15 2TT; 2Department of Complex Genetics, Maastricht University, The Netherlands, 3Institute for Applied Health Research, University of Birmingham, United Kingdom, 4Department of Pharmacology and Toxicology, Maastricht University, The Netherlands, 5Centre for Computational Biology, University of Birmingham, United Kingdom

Urinary bladder cancer (UBC) is a disease with great burden on healthcare systems and

patients. It is recognised research effort into management of existing cases would provide

most benefit, however, practical advances are scarce. This problem partially exists due to

the lack of adequate sample sizes with required phenotype data (i.e. recurrence,

progression), that is not collected routinely and relies on specific cohorts with long follow-up.

We aim to aid current research by mining Hospital Records Statistics (HES) to construct

meaningful UBC outcomes.

We have used HES on procedures carried out in a hospital setting, all coded under

Classification of Interventions and Procedures (version 4, OPCS4). Our developed algorithm

aims to identify specific patterns of administered procedures and their timing that would

represent UBC recurrence and/or progression. To validate their usefulness in a research

setting, we are currently carrying out an independent replication analysis of previously

reported genetic associations with UBC recurrence, progression, and survival (both overall-

and cancer-specific).

We encourage the use of HES data to aid both exploratory and confirmative research on

bladder cancer. Our proposed algorithm is currently specific to the OPCS4, used throughout

the United Kingdom. However, we believe the methodology is transferable and can be used

with other classifications of procedural codes.

Page 78: Health Data Science

P16

Analysis of Pulmonary Surfactant Pathway Genes and Association with Lung Cancer Risk

Jennifer Luyapan, Younghun Han, Todd A. MacKenzie, Christopher I. Amos

Quantitative Biomedical Science, Geisel School of Medicine, Dartmouth, Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine

Surfactant genes play a key role in maintaining the integrity of pulmonary airways by

encoding for proteins involved in the production, function, and catabolism of pulmonary

surfactant. Mutations in genes responsible for surfactant homeostasis have been associated

with fibrotic lung disease and lung cancer development. Previous studies identified genetic

variations in SFTPA, a gene responsible for surfactant metabolism and function, and its

association with lung cancer. Surfactant metabolism is regulated by many genes, and

comprehensive evaluations of genetic associations with lung cancer risk are lacking.

However, little attention has been paid to other surfactant proteins and their association with

lung cancer risk. Our comprehensive assessment of genes associated with surfactant

metabolism may provide new evidence to biological mechanisms by which surfactant genes

play a role in lung cancer etiology. We propose to perform a genetic association analysis of

single nucleotide polymorphisms (SNPs) genotyped within 100 KB of all surfactant genes

and associated proteins involved in their production and maturation. Genome-wide genotype

data for 29,266 cases and 56,450 controls were collected by the OncoArray Consortium to

study lung cancer risk. The OncoArray and Affymetrix chip arrays were configured to include

all known functional SNPs around surfactant genes to analyze their association with lung

cancer. We identified 3 susceptibility loci in cathepsin h (CTSH) gene associated with overall

lung cancer. We also identified 1 missense variant in surfactant protein b (SFTPB), 3

susceptibility loci in CTSH, and 34 susceptibility loci in telomerase reverse transcriptase

(TERT) genes that are associated with the lung histological subtype adenocarcinoma. In

addition, we identified 2 susceptibility loci in CTSH and 1 locus in adenosine deaminase 2

(ADA2) genes that are associated with small cell lung cancer. Here, all identified susceptible

loci achieved genome-wide significance. Given the importance of functional annotation of

discovered variants, all analyses will implement an approach to infer surfactant gene

expression levels by integrating data from expression quantitative trait loci (eQTL)

experiments from normal lung and whole blood tissues.

Page 79: Health Data Science

P17

Assessing for interaction between APOE e4, sex and lifestyle on cognitive abilities.

Dr Donald M. Lyall, Carlos Celis-Morales, Laura M. Lyall, Christopher Graham, Nicholas Graham, Daniel F. Mackay, Rona J. Strawbridge, Joey Ward, Jason M. Gill, Naveed Sattar, Jonathan Cavanagh, Daniel J. Smith, Jill P. Pell.

Institute of Health & Wellbeing, University of Glasgow, Scotland, UK. Institute of Cardiovascular and Medical Sciences, University of Glasgow, Scotland UK. Department of Medicine Solna, Karolinska Institute, Stockholm, Sweden.

Background: There are relatively well-known lifestyle risk factors for worse cognitive abilities,

such as smoking cigarettes. There is some evidence that people with apolipoprotein (APOE)

e4 genotype may be more vulnerable to such risk factors; i.e. that associations are

significantly larger in magnitude for e4 carriers vs. non-carriers. Our objective was to test for

interactions between APOE e4 genotype, and lifestyle factors on worse cognitive test scores

in middle-aged to older participants from the general population.

Methods: Using UK Biobank cohort data, we tested for interactions between APOE e4 allele

presence, lifestyle factors of high vs. low alcohol intake, smoking history, total physical

activity, obesity, and male vs. female sex, on cognitive tests of reasoning, information

processing speed and executive function (n range=70,988-324,725 depending on the test).

We statistically adjusted for potential confounders of age, sex, deprivation, cardiometabolic

conditions, and educational attainment.

Results: There were significant associations between APOE e4 and worse cognitive abilities,

independent of potential confounders, and between lifestyle risk factors and worse cognitive

abilities, however there were no interactions at multiple correction-adjusted P<0.05, against

our hypotheses.

Conclusions: Our results do not provide support for the idea that e4 genotype increases

vulnerability to the negative effects of lifestyle risk factors on cognitive ability, but rather

support a primarily outright association between APOE e4 genotype and worse cognitive

ability.

Page 80: Health Data Science

P18

A digital health approach for assessing glucose continuously in epidemiology cohort studies, including novel software for analysing continuously measured glucose data

Louise A C Millard [1,2,3], Nashita Patel [5], Kate Tilling [1,3], Melanie Lewcock [3], Peter A Flach [2], Debbie A Lawlor [1,3,4]

1 MRC Integrative Epidemiology Unit (IEU), Bristol Medical School (Population Health Sciences), University of Bristol, Bristol, UK 2 Intelligent Systems Laboratory, Department of Computer Science, University of Bristol, Bristol, UK 3 Population Health Sciences, Bristol Medical School, University of Bristol 4 Bristol NIHR Biomedical Research Centre 5 Department of Women and Children’s Health, School of Life Course Sciences, King College London, UK

While epidemiology studies typically measure glucose levels in the blood at a single or

widely spaced time points (e.g. every few years), there is increasing appreciation that

glucose levels and variability in free-living conditions during both the day and night, may

provide important health measures in clinical (e.g. diabetic or obese) and 'healthy'

populations. Continuous glucose monitors (CGM) record interstitial glucose 'continuously',

producing a sequence of measurements for each participant (e.g. glucose levels every 5

minutes over several days, both day and night). To analyze CGM data, researchers tend to

derive summary variables such as the area under the curve (AUC), to then use in

subsequent analyses. However, to date, a lack of consistency and transparency of precise

definitions used for these summary variables has hindered interpretation, replication and

comparison of results across studies.

We have developed GLU, an open-source software package for deriving a consistent set of

summary variables from CGM data. GLU derives a diverse set of summary variables

covering six broad domains: overall average levels, overall variability, glycaemic excursions,

fasting levels, variability from one moment to the next and post-event (e.g. after meals)

levels. For example, overall variability is depicted by the AUC while post-meal levels are

depicted by the AUC 1- and 2-hours after eating and time-to-peak-glucose. GLU also

provides two day-oriented approaches to dealing with missing data. The 'complete days'

approach includes only days with no missing values, while the 'approximal imputation'

approach imputes missing time periods using nearby non-missing values. GLU is

implemented in R and is available on GitHub at [https://github.com/MRCIEU/GLU].

We have used pilot data from the Avon Longitudinal Study of Parents and Children

Generation 2 (ALSPAC-G2) cohort to illustrate GLUs utility. Pregnant women are invited to

wear a CGM device on their buttock, abdomen or arm, for 6 days. We present the results of

preliminary analyses in 43 women, assessing the association of BMI with each GLU

summary variables. We found that a higher gestational BMI was associated with several

aspects of these CGM data - a higher overall mean glucose during both the day- and night-

time, higher time spent in hyper-glycaemia during the night-time, shorter post-prandial time

to peak, and higher variability during the night-time. Results using the complete days and

approximal imputed approaches to dealing with missing data were broadly consistent.

Page 81: Health Data Science

P19

Population Health meets Systems Biomedicine (POPMOL)

Jonathan G. L. Mullins, Ronan A. Lyons

Swansea University Medical School, Singleton Park, Swansea SA2 8PP, Wales, UK

Health data science repositories are being joined with systems biomedicine platforms. The

Human3DProteome platform (http://human.3dproteome.swan.ac.uk) is the first fully

integrated in silico platform of the entire human proteome (all 23,000 receptors, enzymes

etc., encoded by the human genome), providing a unique molecular level structural database

including characterisation of thousands of active sites, and millions of protein-compound

interactions, developed for drug discovery and virtual toxicity screening but now also being

developed as a powerful and comprehensive mechanistic framework for advancing health

data research. The Human3DProteome project includes 3D molecular structures for 830,000

polymorphic variant proteins (as listed in the UniProt database) - the largest structural library

of human disease phenotypes, with automated linkage to protein-protein interactions,

metabolic / signalling pathways and potential therapeutic areas. Machine learning

approaches are applied to widening discovery of interactions in biological and chemical

space, downstream metabolic and phenotypic effects.

The POPMOL project addresses:

Technical challenges of linking population and patient genomic data with the 3DProteome

architecture - bridging between sequence information stored in health data repositories

(whole genome, partial genome data, specific genes) and systems biomedicine visualisation

and analysis. This pipeline facilitates the key linkage of genome information to automated

computation of structural and functional impacts of missense, nonsense, insertion, deletion,

duplication, frameshift and repeat expansion mutations at molecular, pathway and disease

network levels.

System usability and application to specific areas of research interest. The 3DProteome

web interfaces were designed primarily for use by molecular and discovery scientists in

industry and academia. The usability challenge is to determine how these might best be

adapted to the needs of health data researchers and integrated with currently used

architectures in epidemiology studies.

This foundation will widen the scope and accessibility of mechanistic molecular

computational research (sequence-structure-function-phenotype relationships), to facilitate a

community approach to profiling and clustering specific variants to phenotypes; classification

of hitherto uncharacterised genetic variants; to stimulate development of new solutions in the

broad area of population genomics, furthering interaction with partners involved in research

incorporating wider genomic, metabolomic and proteomic data.

An early focus of the genome to molecular function framework is on developing capability for

identifying patterns and clusters of genetic differences associated with specific disease

manifestations and severity, working towards indices of risk and informing new precision

medicine strategies.

Page 82: Health Data Science

P20

Trends in two- and five-year survival after glioblastoma diagnosis: a systematic review and meta-analysis of population-based studies

Michael TC Poon, Cathie LM Sudlow, Paul M Brennan, Jonine D Figueroa

Usher Institute for Population Health Sciences and Informatics, The University of Edinburgh Centre for Clinical Brain Sciences, The University of Edinburgh

Background - Glioblastoma is the most common malignant primary brain tumour with a

median survival of ~14 months. The major advance in treatment came from a clinical trial

published in 2005 demonstrating radiotherapy with concomitant and adjuvant temozolomide

improved survival. We aimed to review and summarise 2- and 5-year survival in patients with

glioblastoma before and after the landmark trial. Methods - We searched MEDLINE and

Embase for population-based observational studies of adults with glioblastoma reporting

overall survival at ≥2 years published between January 2003 to 15 March 2019. Studies with

fewer than 50 patients were excluded. Study quality was assessed using a risk of bias tool

based on ROBINS-E. We stratified studies according to whether the recruitment period

ended before, or began in or after 2005. Overall survival at 2 and 5 years from non-

duplicating patients were meta-analysed using a random effects model. The I2 statistic

quantified heterogeneity. Results - The search identified 492 studies of which 57 met our

inclusion/exclusion criteria. The eligible studies represent 18 countries and most studies

came described the United States population (n=28, 48%). The majority of studies reported

2-year survival (n=49, 86%). Pooled estimate of 2-year survival was 10% (95% confidence

interval [CI] 6-14%; n/N=681/5926; 7 studies; I2=96%) for patients recruited before 2005;

and 18% (95% CI 14-21%; n/N=5127/30678; 12 studies; I2=98%) for patients recruited in or

after 2005. The pooled estimate of 5-year survival was 3% (95% CI 1-4%; n/N=418/15685; 6

studies; I2=96%) for patients recruited before 2005; and 3% (95% CI 2-4%; n/N=471/15083;

4 studies; I2=93%) for patients recruited in or after 2005. Two-year survival reported within

the same populations showed improvement in the later recruitment period, which was not

observed in 5-year survival. Interpretation - There may be an improvement in 2-year but not

5-year survival. Few populations allowed internal comparison and multiple unmeasured

confounders contributed to the high heterogeneity, which limit interpretation of these

observations in relation to the adoption of multimodal therapy. Uncertainties about survival

trends and the effect of modern treatment are best addressed using a large-scale

population-based study with detailed clinical information.

Page 83: Health Data Science

P21

Using changes in protein structure to guide clinical decisions: a case for Rifampicin Resistance

Stephanie Portelli 1, Nicholas Furnham 2 , Douglas E.V. Pires 1 and David B. Ascher 1

1 Department of Biochemistry and Molecular Biology, Bio21 Institute, University of Melbourne, Victoria, Australia 2 Department of Pathogen Molecular Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, United Kingdom.

Since its discovery over 50 years ago, rifampicin has been extensively used in the clinic to

treat the mycobacterial infections tuberculosis and leprosy. The rapid spread of rifampicin

resistance, mainly through missense mutations within the drug target gene rpoB, presents a

major challenge in treating both diseases. However, although genetic sequencing

techniques are able to identify known resistance causing mutations, they fall short of

understanding underlying mechanisms. I have explored the use of structural information to

bridge this gulf in knowledge, to guide the identification and characterisation of novel

mutations within the gene rpoB.

M. tuberculosis missense mutations were initially obtained from the literature and divided

into resistant (n=203) and susceptible (n=28) phenotypes. Next, the mCSM suite of

computational tools was used to qualitatively analyse the structural and molecular changes

brought about by these mutations. Our analysis included measurements for changes in

protein stability, dynamics, interaction binding affinities and physicochemical properties.

Resistance mutations having large detrimental effects were seen at a lower frequency within

a genome-wide association study and are thought to reduce bacterial fitness. These

measurements were used to train a binary classifier to identify likely resistance mutations

using sci-kit learn.

Our best classifier was trained using the KNN algorithm, which, following testing with an

independent, non-redundant blind test (n=88) performed with a precision of 88.2% and an

accuracy of 89.7%. Analysis of this model highlighted that changes in interactions within the

RNA polymerase complex, including to the nucleic acid and rifampicin, were a significant

driver of resistance. When tested using clinically resistant M. leprae mutations (n=40) from

the American Leprosy Mission, all mutations were predicted correctly, showing evidence of

clinical translatability for both mycobacterial infections.

This work illustrates the power of using structural information to interpret genomic variants.

Our structure-based tool is able to analyse missense mutations located throughout the RpoB

structure in both mycobacteria, and is the first genetics-based tool for drug resistance in

leprosy. Within the clinic, our tool is important for rifampicin stewardship, ultimately reducing

further resistance development.

Page 84: Health Data Science

P22

Generation Scotland: refining disease and improving prediction through electronic health record linkage and epigenetics.

David Porteous, on behalf of Generation Scotland

Centre for Genomic and Molecular Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.

NHS Scotland has some of the best and longest-established electronic health records (EHR)

available for research. Here we show how EHR linkage transforms Generation Scotland:

Scottish Family Health Study (GS:SFHS), a family-based genetic epidemiology study of

~24,000 volunteers from ~7000 families from a cross-sectional to a longitudinal study of

disease prevalence and incidence. In Scotland, EHR is made possible through the

Community Health Index (CHI), a unique identifier that tracks all NHS Scotland patients

through primary, secondary and tertiary care, from birth to death. We have now established

EHR linkage to all secondary care, dental, biochemical and dispensing NHS Scotland

datasets, overcoming technical and governance issues in the process. Linkage to primary

care data is in progress and hospital based image data will follow. Participants answered

extensive demographic, health and lifestyle questions, provided biological samples and

underwent clinical assessment at the time of recruitment. In addition to these baseline

measures of physical health, participants undertook tests of psychological, cognitive and

mental health. Biological samples of blood, serum, plasma and urine combine to form a

resource with broad consent for health related research, including genetics. Genome-wide

SNP association data is available on 20,000 participants and blood-based genome wide

methylation on 10,000.

To exemplify the broad utility of Generation Scotland, here we show how:

1. Pedigree-based heritability improves significantly on SNP-based heritability for multiple

metabolic and anthropometric traits.

2. Family structure also facilitates parent-of-origin studies that can identify couple-based and

epigenetic effects not detectable in UK Biobank.

3. Quantitative trait measures of personality and cognition can be used to refine the

diagnosis and categorisation of depression, while prescription switching can identify non-

responders.

4. Genome wide methylation explains moderate to large proportions of the variance in

lifestyle factors and complex traits, including smoking, alcohol consumption, cardiometabolic

outcomes, and lung function. Typically, the methylation-based predictors outperform GWAS-

based polygenic scores.

Researchers, both academic and commercial, can now use the linked datasets through a

managed access process (www.generationscotland.org) to undertake discovery, replication

and meta studies; identify prevalent and incident disease cases and matched healthy

controls; test potential biomarkers for prediction used stored biological samples; or invite

selected participants to volunteer for follow on studies, including repeat sample collection.

Page 85: Health Data Science

P23

Mental Health Pathfinder: Creating a mental health research resource in Scotland

Carys Pugh, Mark Adams, Toni-Kim Clarke, Matthew Iveson, Heather Whalley & Andrew McIntosh

Division of Psychiatry, Centre for Clinical Brain Sciences, University of Edinburgh

We will invite over 250,000 people to complete a mental health questionnaire and, with their

permission, link their responses to their electronic health records to create a Scottish

resource for mental health research.

In Scotland, people who use the NHS have a unique Community Health Index (CHI) number

that links their health records. SHARE (https://www.registerforshare.org) is a register of over

250,000 people in Scotland who have expressed an interest in taking part in health research.

Registering for SHARE takes less than one minute and the tendency to recruit healthy

volunteers is offset by recruiting people within health care settings. Participants have their

records linked via their CHI and are also able to opt-in to the collection of spare bloods from

routine health testing. So far, over 30,000 people have been genotyped using these bloods

and work is ongoing to increase capacity.

We intend to leverage the unique linkage possibilities of the Scottish NHS and the SHARE

register to build a resource for mental health research. All SHARE participants will be invited

to complete an online mental health questionnaire that has been adapted from the

questionnaire used by UK Biobank and the GLAD study (https://gladstudy.org.uk). The

questionnaire includes elements of the short form CIDI to assess lifetime and current

depression and anxiety, and we also ask about lifestyle, subjective wellbeing, experiences of

trauma and the use of alcohol and illegal substances. The questions are comparable to

previous questionnaires so it will be possible to substantially increase sample sizes

associated with prior analyses. By linking responses to routinely collected health data and to

genotype data, we will create a resource for genomic investigations of mental illnesses and

epidemiological studies of comorbities. We hope to develop a better mechanistic

understanding of mental illnesses, to help to stratify medicines and interventions by specific

mental health conditions, so that the treatment is tailored to the individual and so that the

burden of mental health illness is minimised for the individual, their family and society as a

whole.

Page 86: Health Data Science

P24

Integrating professional software engineering practices in medical research software

Patricia Ryser-Welch, Olly Butters

Newcastle University

Health data sets are getting bigger, more complex, and are increasingly being linked up with

other data sources. With this trend there is an increasing risk of patient identification and

disclosure. Two different ways of mitigating this risk are to use a federated analysis

approach or to use a data safe haven.

DataSHIELD (www.datashield.ac.uk) is an established federated data analysis tool that is

used in the medical sciences. This software has a variety of methods to reduce the risk of

disclosure built in. Here we describe the steps we are taking to apply modern software

engineering methodologies to DataSHIELD. The upcoming Medical Devices legislation

requires that software has more rigorous testing done on it. While this legislation does not

directly apply to software used for research, we think it is important the ideas behind this do

filter down to research software. For us these principles include testing that functions work,

as well as testing that they produce the correct answers. Using a static standard data set to

test against (that is publicly available) is also an important aspect. This work is being done in

a continuous integration framework using Microsoft Azure. Additionally all our software is

developed as open source.

In addition to the protection DataSHIELD provides on its own we are also integrating it into

our Trusted Research Environment as part of Connected Health Cities North East and North

Cumbria. This will give an extra level of protection to data that may automatically flow from

multiple data sources. Additionally, as analysis can be done in a federated way it means that

that data does not need to leave its data controller's environment. This opens up the

possibility of analysis happening across trusts and regions.

Page 87: Health Data Science

P25

Using phenotype ontologies to integrate human genetics and model-organism

research data

Tom Shorter, Anthony J Brookes and Tim Beck

Department of Genetics and Genome Biology, University of Leicester, Leicester

Phenotype ontologies are used to standardise phenotypic descriptions within many

biomedical research databases. In order to effectively search across databases that use

different ontologies, equivalent terms can be mapped. The “GWAS Phenomap” project

enables the discovery of genome-wide association study (GWAS) findings across a range of

GWAS and GWAS-related data sources by mapping their phenotype content with that of

GWAS Central (https://www.gwascentral.org/). GWAS Central currently provides a set of

online tools for integrating and comparing summary-level GWAS data from over 3000

studies measuring over 70 million p-values. These tools have been extended to allow a

range of mapped ontologies (MeSH, HPO, EFO, ICD-10) to be visualised and searched.

Running an ontology-driven query from the Phenomap interface interrogates the source

datasets and displays the matching studies, publications, and database cross-links. All data

sources can be queried using any of the ontologies, and phenotype data that are not defined

using any ontology are automatically annotated using concept recognition.

Furthermore, we have mapped between human and mouse phenotype ontologies to link

mouse phenotypic observations with their human equivalents in GWAS as part of a novel

approach to refine the regulatory loci identified in the mouse for lifestyle diseases. Our

automated pipeline maps Medical Subject Headings to Mammalian Phenotype (MP)

ontology terms, and using a concept recognition technique, associates free-text mouse trait

descriptions to MP ontology terms. For a range of lifestyle disease-related phenotypes, we

are able to identify the human intergenic regions that are syntenic to the mouse loci that are

associated with the equivalent phenotypes. By extending this approach and integrating

disease related mouse genetic studies with GWAS on a large scale, we can validate the

clinical relevance of mouse studies across many disease groups.

Page 88: Health Data Science

P26

CNTN5 genetic variation in Metabolic and Mental Health

Dr Rona Strawbridge, Soddy Sau Yu Leung1, Mark E.S. Bailey2, Breda Cullen1, Amy Ferguson1, Nicholas Graham1, Keira J.A. Johnston1,2,3, Donald M. Lyall1, Laura M. Lyall1, Claire L. Niedzwiedz1, Robert Pearsall1, Richard J Shaw1, Rachana Tank1, Joey Ward1, Daniel J. Smith1

1 Institute of Health and Wellbeing, University of Glasgow, Glasgow, UK. 2 School of Life Sciences, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK 3 Division of Psychiatry, College of Medicine, University of Edinburgh, Edinburgh, UK 4 Cardiovascular Medicine Unit, Department of Medicine Solna, Karolinska Institutet, Stockholm, Sweden

Epidemiology has convincingly demonstrated that people suffering from serious mental

illness (SMI, such as schizophrenia, bipolar disorder and major depressive disorder) have an

excess risk of obesity, diabetes and cardiovascular disease. It is unclear to what extent the

increased risk of cardiometabolic disease in people with SMI is due to social determinants

and lifestyle factors or whether there are shared biological mechanisms connecting mental

and physical illness.

The CNTN5 locus has been associated with a number of cardiometabolic traits, as well as

with SMI and suicidal behaviour, and might represent a mechanism common to mental and

physical illness. In this study we explored whether CNTN5 associations with cardiometabolic

traits overlap with, or are distinct from, the associations with psychological traits, including

suicidal behaviour and to assess trans-ancestral consistency in effects.

In the UK Biobank study (N=129,000-377,000) significant associations (multiple testing

corrected p <1x10-6) were observed between genetic variants in CNTN5 and neuroticism,

systolic and diastolic blood pressure, with suggestive associations (p <1x10-5) being

observed with mood instability, generalised anxiety disorder, major depressive disorder and

central adiposity. Linkage disequilibrium analysis indicates that these signals are

independent. Future analyses considering the multiple signals in this locus could provide

novel insights into behaviours relating to SMI.

Page 89: Health Data Science

P27

Discovering the UK’s Cohorts

Matt Styles, Tom Giles, Jurgen Mitsch, Philip Quinlan

Advanced Data Analysis Centre, Digital Research Service, University of Nottingham, Nottingham, UK

Medical research in the UK is still hindered by a perceived lack of suitable data and sample

resources that can be used by the research community. This perception contrasts starkly

with the knowledge that there are hundreds of potential resources that can supply data and

samples. The work of the Tissue Directory and Coordination Centre (TDCC) and backed by

the Medicines Discovery Catapult (MDC) has shown that the actual problem experienced by

both academic and commercial researchers is the inability to find suitable resources. The

consequence is that researchers tend to utilise the resources that are close by proximity

(because they can go and talk to the resource directly) rather than close to their research

goals. This approach is driven because there is no easy mechanism in which to discover

other potential national resources.

TDCC is the UK's national centre that is tasked with coordinating biobanks (samples and

data) to ensure that researchers re-use existing resources before seeking to collect new

samples or datasets. TDCC is a joint endeavour between the University of Nottingham and

University College London. Researchers can search on very high-level classification of

disease, gender and age and resources matching those criteria are displayed. This Directory

acts as a first filter but does not allow researchers to ask detailed feasibility questions.

It is clear that in order to support and utilise existing UK investment in health care resources

we must have a more efficient mechanism in which a researcher can ask a relatively detailed

question and understand whether resources exist that could support them. It is especially

important that researchers find out quickly if something is not feasible (rather than investing

time and finding it is not) and equally that referrals to the resources (such as Genomics

England) from the search system does not create irrelevant traffic. We are making a leap in

search capability across four of the UK's leading cohorts (Genomics England, UK Biobank,

ALSPAC and Generation Scotland) via a research collaboration with a leading technology

provider, BC Platforms, in order to deliver researchers a single place in which in-depth

queries can be performed in order to understand the feasibility of research using cohorts.

Page 90: Health Data Science

P28

Exploring the functional effects of genetic risk scores in pseudotime

Shu Mei Teo1,2, Christopher Yau 3,4, Michael Inouye 1,2,4,5,6

1 Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia 2 Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, United Kingdom 3 Centre for Computational Biology, Institute of Cancer and Genomic Sciences, University of Birmingham, United Kingdom 4 The Alan Turing Institute, London, United Kingdom 5 Department of Clinical Pathology, University of Melbourne, Parkville, Australia 6 School of BioSciences, The University of Melbourne, Parkville, Australia

Many complex human diseases are caused by a combination of genetic and environmental

exposures that interact with age of an individual. Time-series molecular data obtained from

longitudinal studies is ideal but challenging to conduct at a large scale. Recent advances on

the inference of pseudotime trajectories from single time-point data has shown that latent

temporal information could be extrapolated from cross-sectional data sets, therefore offering

new insight into dynamic disease processes without longitudinal studies. In parallel,

increasingly large genome-wide association studies have enabled the construction of

genomic risk scores (GRS) that represent an individual's risk of disease at birth. Yet, the

molecular pathways by which GRS confers disease risk have not been explored. I use

covariate-aware pseudotime approaches to model cross-sectional surveys of the human

proteome and its interactions with GRS of multiple complex human traits and diseases.

Proteins with significant GRS-pseudotime interaction effects are of interest; These represent

proteins whose trajectories are modulated by the genetic susceptibility to specific diseases.

Page 91: Health Data Science

P29

Oral anticoagulation in patients with Atrial Fibrillation: A longitudinal data linkage study.

Fatemeh Torabi, Daniel Harris, Ashley Akbari, Mike Gravenor, Ronan Lyons, Julian Halcox

Swansea University Medical School

Introduction Atrial Fibrillation (AF) is a common condition conferring a five-fold increase in the risk of stroke, which can be reduced by two thirds or more by use of anticoagulant (AC) drugs, where appropriate. It is unknown how variation in AC use impacts stroke incidence at a population level. We are developing a dynamic model to predict the impact of AF treatment on stroke risk at an individual and population level. Methodology We used the Secure Anonymised Information Linkage (SAIL) Databank to identify all people with linked data who had a diagnosis of AF prior to the index date (01-01-2016) and those with a diagnosis of stroke. Antithrombotic treatment prescriptions, stroke (CHADS-VASc) and bleeding risk (HAS-BLED) scores were evaluated on a representative index date. Subsequent analyses will be undertaken on a combined dataset of AF patients, identifying stroke related events, patient risk factors and treatments over a study period of 2012-18. A multivariable model will be developed to evaluate the impact of changes in AC treatment on incidence of AF-associated adverse outcomes in the Welsh AF population. Results We identified 57,721 patients with AF on 01-01-2016, of whom 35,443 (61.4%) were receiving AC . Of the 22,728 patients not receiving AC (non-AC), 35% were treated with anti-platelet drugs and 39% had previously been treated with AC. The proportions of patients at increased bleeding risk (HAS-Bled≥3) were similar in AC and non-AC populations across all levels of stroke risk. A preliminary survey found that the prevalence of AF increased from 2012-2016, but the incidence of stroke in known AF patients remained stable. Conclusions These data show that over a third of AF patients in Wales eligible for AC therapy in 2016 were not receiving it, despite a similar stroke and bleeding risk profile, on average in these patients as the treated population. We also found that despite sub-optimal AC in 2016 and an increase in the numbers of patients with AF, the incidence of stroke in this population did not increase between 2012-2016. We are undertaking statistical analyses to explore in detail relationships between stroke and bleeding risk factors, use of AC medication and adverse clinical outcomes in AF patients over 2012-2018. A better understanding of these relationships will inform development of strategies to optimise use of AC therapy for stroke prevention at an individual and population level in Wales and beyond.

Page 92: Health Data Science

P30

Predicting genetic ethnicity and relatedness from genotyping data (pretrel)

Salih Tuna, Salih Tuna [1,2], Chris Penkett [1,2], Will Astle [3], Kathy Stirrups [1,2], Willem H. Ouwehand [1,2,4,5] , Ernest Turro [1,2,3]

1. NIHR BioResource, Cambridge University Hospitals, Cambridge, UK. 2. Department of Haematology, Cambridge University, Cambridge, UK. 3. MRC Biostatics Unit, Cambridge Institute of Public Health, Cambridge, UK. 4. Wellcome Sanger Institute, Hinxton, Cambridge, UK. 5. NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK.

Prediction of ethnicity and relatedness from genomic markers is a crucial step in Quality

Control (QC) analyses for any large scale sequencing and genotyping study. By identifying

any mislabelled samples at an early stage in the project, the researcher can investigate any

potential issues in their laboratory/informatics pipelines, such as possible sample swaps or

duplications, and to ensure the correct samples are included in the final dataset.

Furthermore, by using an unrelated set, bias in population genetics parameters can be

removed from the mainstream analysis. This allows for example for accurate calculation of

Allele Frequencies (AF) for a particular mutation.

In the NIHR Bioresources , we have developed an R package (pretrel) that supports initial

SNP selection from the overall dataset, then calculates ethnicity and relatedness between

samples and finishes by identifying a maximum set of unrelated samples. pretrel can be run

on genotyping data obtained from a wide variety of platforms including Illumina sequencing

data and microarrays. The full analysis suite scales to large sample size, and has been run

on the GEL Whole-Genome Sequencing (WGS) data (~50,000 samples), NIHR

BioResource WGS data, targeted NGS data for rare disease panels and data from Axiom

arrays.

As a first step pretrel can select a reliable set of common SNPs from your dataset. The

analysis also uses genotypes from a reference project such as 1000 Genomes project to

infer ethnicities and part of the selection involves finding SNPs intersecting both the project

data and reference samples. VCF files from the project and reference samples are then

filtered for this SNP set whilst checking for data completeness, before being merged to one

VCF file with high quality genotypes. The main statistical method used for predicting

ethnicities is Principal Component Analysis (PCA) where the principal components are

calculated using the reference set and the samples of interest are then projected onto the

reference set using the GENESIS R package. By using these projections, we further assign

ethnicities for each sample based on their similarity to the reference samples. Pairwise

relatedness is then calculated with the GENESIS package, which starts with the ethnicity

PCA loadings and then calculates the IBD ratios and a kinship score.

Page 93: Health Data Science

P31

Café Variome, A Data Discovery Lattice

Colin Veal, Dhiwagaran Thangavelu, Gregory Warren, Marc Wadsley, Vagelis Ladas, Spencer Gibson, Tim Beck, and Anthony J Brookes

Department of Genetics and Genome Biology, University of Leicester, Leicester, UK

Health data sharing often occurs in small-scale interactions between a limited number of

individuals or institutions. This approach helps to ensure correct data use and protects

patient confidentiality, but runs counter to widespread data re-use. A more effective

approach would be based upon a supportive "data discovery" layer that would allow data

seekers to query for the "existence" rather than the "substance" of particular datasets (i.e.,

without exposing the data themselves). Such services could grow into a comprehensive

lattice of resources that could be universally interrogated by suitably authorized users. To

this end, we have built Cafe Variome (www.cafevariome.org) as a flexible/customisable,

web-based, data discovery tool that can be quickly installed by any genotype-phenotype

data owner or network. This technology makes safe or sensitive content appropriately

discoverable, and facilitates effective patient 'match-making' with match settings controlled

by the user.

Via Cafe Variome data owners/custodians retain full control of their data. Approved data

elements (or obfuscated copies thereof) are remotely queried while remaining on the

custodian's local server. No transfer of data to the browser or other installations, and query

responses can be strictly limited (i.e., yes/no, or a count of query "hits"). The platform

includes:

- A data component, whereby different records and/or fields can be made differentially

discoverable by different users/networks.

- A query component, including simple and advanced "query-builder" interfaces, optionally

constrained by ontologies.

- An administration component, whereby a data custodian controls: who can perform

discovery on which datasets; whether any data can be viewed; the creation and

management of federated networks; and the site appearance/branding.

Current implementations support matching across RD mutation registries (EDS Consortium),

multi-site Alzheimer subject recruitment (EU-EPAD), and biomaterial finding across biobanks

(e.g., in the Leicester Bioresource). Café Variome can be and has been configured to

interoperate with alternative discovery providers (e.g., GA4GH Beacons) and supports

querying by consent and data use conditions (e.g., using the GA4GH "ADA-M" standard).

We are now developing and applying specific modules for the discovery and real time

phenotypic similarity matching requirements of the Solve-RD Horizon 2020 project.

Applications across the UK will also be explored within the HDR-UK national institute.

Funding: Provided by EU projects Solve-RD (#779257) and EJP-RD (#825575) and HDRUK.

Page 94: Health Data Science

P32

Real-time phenotypic similarity matching across federated datasets

Marc Wadsley, Colin Veal, Gregory Warren, Tim Beck, Dhiwagaran Thangavelu, Anthony Brookes

University of Leicester

Solving diseases increasingly relies on large resources spread across many sites. This is

particularly true for rare disease where obtaining sufficient numbers of subjects for thorough

analysis or trials is difficult. Finding subjects with similar phenotypes can often reveal new

knowledge such as shared variants or increase patient numbers for robust statistical

analysis. However, finding similar patients can be time consuming due to cumbersome

communication between sites, slow patient by patient pairwise mapping or limited matching

options.

Here we describe a platform that allows real time phenotypic similarity matching across a

federated network. The core functionality resides within a graph based model combining the

HPO ontology and patient phenotype descriptions and builds upon existing semantic

similarity methods based on information content (Resnik, Lin, Jiang-Conrath, relevance

information coefficient). The platform can be deployed within our existing Café Variome

architecture, providing flexible ontology based query interfaces, access controls and further

filtering options based on demographics, variant genotypes and consent amongst many

other options.

Benchmarking using a lightweight Linux server (2 cores, 4GB RAM) and populating the

platform with 10,000 patient equivalents (2 to 15 HPO terms per patient) generated from

7500 OMIM disease descriptions we found that queries of 1-5 HPO matched similar patients

in <100ms. This shows potential for high scalability not only within a single system but

across a large, international federated network.

This is now being deployed in the Solve-RD Horizon 2020 project and we are currently

adapting this to additional ontologies and for use in other health data contexts.

Page 95: Health Data Science

P33

Consumer-led breakthroughs in insulin-treated diabetes

Cyndi Williams, Peter Wooldridge

Quin Technology Ltd

100M people use insulin to treat diabetes. No doctor can tell anyone exactly how much

insulin to take and when. People with diabetes use trial-and-error to make hundreds of

decisions a day to keep themselves going. They're 2-3 times more likely to suffer from

fatigue, anxiety, stress and depression. Only 8% of them achieve outcomes that will keep

them from complications or an early death due to diabetes.

The outcomes are poor because there are just too many long-lasting, variable, overlapping

and poorly understood factors that affect blood glucose. Some are studied by researchers -

insulin injections, eating, activity - but still many more - stress, sleep, illness, menstruation,

travel, etc. - remain unstudied. It's impossible for any individual to know about or recall all

these events and manage them effectively with insulin. Doctors and researchers are equally

stuck. There's no way to study the events in isolation, the endocrine system is too complex

and poorly understood, and the molecular taxonomy for diabetes remains incomplete,

making only generic solutions possible.

The experience and expertise of the millions of people who are keeping themselves alive

every day taking insulin is a massive source of untapped diabetes knowledge. We are using

it to unlock the development of more personalised solutions for insulin-treated diabetes.

Our mobile app takes data from existing diabetes devices, wearables and phones, and uses

it to formalise and classify studied and unstudied events and evaluate their effects on blood

glucose. Each individual's unique self-care experience is codified into our Insulin Intervention

Taxonomy (patent pending). These rich highly structured data sets are the basis for

modelling and training machine learning algorithms to give personalised decision-making

guidance to our users via our app, and stratify them according to their experiences and

outcomes. New stratifications then network "users like me" via our app for peer support, and

drive new multi-omic research into more personalised treatments and pathways.

We are currently running a research program with 50+ people with diabetes who take

insulin. We are using their self-care insights to help shape the personalised decision-making

services in our app.

Page 96: Health Data Science

P34

Contextualised concept embedding for efficiently adapting natural language processing models for phenotype identification

Honghan Wu, Karen Hodgson, Susan Dyson, Katherine I. Morley, Zina M. Ibrahim, Ehtesham Iqbal, Robert Stewart, Richard JB Dobson, Cathie Sudlow

Centre for Medical Informatics, Usher Institute, University of Edinburgh, United Kingdom; IoPPN, King’s College London, United Kingdom; Health Data Research UK, UCL, United Kingdom

*Background*

Many efforts have been put to use automated approaches, such as natural language

processing (NLP),to mine or extract data from free-text medical records to picture

comprehensive patient profiles fordelivering better health-care. Reusing NLP models in new

settings, however, remains cumbersome -requiring validation and/or retraining on new data

iteratively to achieve convergent results.

*Materials and Methods*

We formally define and analyse the NLP model adaptation problem, particularly in

phenotype identification tasks, and identify two types of common unnecessary or wasted

efforts: duplicate waste and imbalance waste. A distributed representation approach is

proposed to represent familiar language patterns for an NLP model by learning phenotype

embeddings from its training data.Computations on these language patterns are then

introduced to help avoid or reduce unnecessary efforts by combining both geometric and

semantic similarities.To evaluate the approach, we cross validate NLP models developed for

six physical morbidity studies(23 phenotypes; 17 million documents) on anonymised medical

records of South London Maudsley NHS Trust, United Kingdom.

*Results*

Two metrics are introduced to quantify the reductions for both duplicate and imbalance

wastes. We conducted various experiments on reusing NLP models in four phenotype

identification tasks. Our approach can choose a best model for a given new task, which can

identify up to 76% mentions needing no validation & model retraining, meanwhile, having

very good performances (93-97% accuracy). It can also provide guidance for validating and

retraining the model for novel language patterns in new tasks, which can help save around

80% of the efforts required in blind model-adaptation approaches.

*Conclusion*

Adapting pre-trained NLP models for new tasks can be more efficient and effective if the

language pattern landscapes of old settings and new settings can be made explicit and

comparable. Our experiments show that the phenotype embedding approach is an effective

way to model language patterns for phenotype identification tasks and its operations can

guide efficient NLP model reusing.

Page 97: Health Data Science

P35

Population level individual risk assessment informed by microbial genomics: assessing TB transmission risk in England

David Wyllie, Esther Robinson, Eliza Alexander, Vlad Nikolayevskyy, Colin Campbell, Grace Smith

Public Health England

Background

WHO has proposed that countries with low tuberculosis incidence eliminate in-country

transmission. In the UK, as in other low incidence countries, many TB cases are imported,

and so incidence does not necessarily reflect UK-based transmission. However, neither the

best method of measuring in-country transmission, nor methods for rapid detection of groups

at risk of onward transmission, are well established. Routine whole genome sequencing of

TB isolates may be able to help both goals.

Methods

Maximal likelihood phylogenetic relationships were determined between M. tuberculosis

sequences from a universal, prospective M. tuberculosis sequencing program operating in

England. For the first M. tuberculosis isolation from each individual, we computed the time to

putative transmission, defined as subsequent isolation of an organism compatible with

descent from the first, but from a different individual. Survival analyses used to determine

risk factors for putative transmission. A model derived using data from one year's data from

part of England, was validated using independent data from the whole country obtained the

subsequent year.

Results

We examined TB isolates from 1420 individuals obtained in 2016/2017 from Midlands and

North of England, and 2653 from across England in 2018/19. Kaplan-Meier estimates of

transmission were generated both for the 2016/17 dataset and across England in 2018/19.

Univariate (KM) analyses showed transmission risk was higher in patients with multiple

recognised risk factors for TB transmission. However, in a multivariable proportional hazards

model derived from 2016/17 cases, only pulmonary disease, young age, imprisonment, and

the isolation of similar samples in the last 120 days significantly increased putative

transmission; recent immigration, and lineage 1 disease decreased risk. From these

variables, a risk prediction score was generated and successfully validated in the 2018/19

data set. Prediction was impacted relatively little by imprisonment status and recent

immigration status, the only variables not available from laboratory data.

Conclusions

Analysis of routine whole genome sequencing data can provide M. tuberculosis transmission

estimates; such analyses suggest declining transmission in the Midlands and North area of

the UK, which is compatible with known UK trends. Importantly, routine laboratory data in

England allow in-country tuberculosis transmission risk prediction in England, something

which could be added to laboratory reporting to drive targeted public health interventions.

More refined estimates are likely achieveable were more complex data incorporated,

perhaps using Bayesian models. Overall, the approach represents an application of data

fusion, and the provision of personalised medicine to disease control.

Page 98: Health Data Science

P36

Notes

Page 99: Health Data Science

P37

Notes

Page 100: Health Data Science

P38

Notes

Page 101: Health Data Science

Delegate List

Mehrdad A Mizani

Usher Institute of Population Health

Sciences and Informatics

[email protected]

Ashley Akbari

HDR-UK Wales & Northern Ireland

[email protected]

Beatrice Alex

University of Edinburgh

[email protected]

Chantal Babb de Villiers

PHG Foundation

[email protected]

Chiara Batini

University of Leicester

[email protected]

Robin Beaumont

University of Exeter

[email protected]

Tim Beck

University of Leicester

[email protected]

Fran Biggin

Lancaster University

[email protected]

Veronique Birault

Francis Crick Institute

[email protected]

William Bradlow

University Hospitals Birmingham NHS

Foundation Trust

[email protected]

Anthony Brookes

University of Leicester

[email protected]

Thore Manuel Buergel

EMBL-EBI

[email protected]

Caroline Bull

University of Bristol

[email protected]

Tianxi Cai

Harvard University

[email protected]

Archie Campbell

University of Edinburgh

[email protected]

Toni Clarke

University of Edinburgh

[email protected]

Adrian Cortes

GSK

[email protected]

Nathan Cunningham

University of Warwick

[email protected]

John Danesh

University of Cambridge

[email protected]

Mark Davies

Medicines Discovery Catapult

[email protected]

Carol Dezateux

QMUL & HDRUK London

[email protected]

Emanuele Di Angelantonio

University of Cambridge

[email protected]

Richard Dobson

King's College London

[email protected]

Page 102: Health Data Science

Dimitrios Doudesis

University of Edinburgh

[email protected]

Elliot Dryer Beers

Imperial College London

[email protected]

Demitra Ellina

F1000

[email protected]

Tomas Fitzgeraal

EMBL-EBI

[email protected]

David Ford

HDR-UK Wales & Northern Ireland

[email protected]

Thorsten Forster

LifeArc

[email protected]

Isabella Friis Joergensen

Novo Nordisk Foundation Center for

Protein Research

[email protected]

Kate Gardner

Guys and St Thomas’ NHS Trust

[email protected]

Katie Harron

UCL

[email protected]

Jennifer Harrow

ELIXIR

[email protected]

Joanne Hartley

Medicines Discovery Catapult

[email protected]

Claire Hastie

University of Glasgow

[email protected]

Caroline Hayward

MRC Human Genetics Unit

[email protected]

Maria Herrero Zazo

EMBL-EBI

[email protected]

Kathryn Holt

Monash University

[email protected]

Richard Houghton

Wellcome Sanger Institute

[email protected]

Michael Inouye

University of Cambridge

[email protected]

Catherine John

University of Leicester

[email protected]

Alexander Wolfgang Jung

EBI

[email protected]

Stephen Kaptoge

University of Cambridge

[email protected]

Sarah Kim Hellmuth

New York Genome Center

[email protected]

Nathalie Kingston

University of Cambridge

[email protected]

Jo Knight

Lancaster University

[email protected]

Zeljko Kraljevic

King's College London

[email protected]

Loic Lannelongue

University of Cambridge

[email protected]

Page 103: Health Data Science

Kirstin Leslie

University of Glasgow

[email protected]

Melissa Lewis Brown

Health Data Research UK

[email protected]

Waty Lilaonitkul

University College London

[email protected]

Cecilia Lindgren

Li Ka Shing Centre -BDI

[email protected]

Nadia Lipunova

University of Birmingham

[email protected]

Jennifer Luyapan

Dartmouth College

[email protected]

Donald Lyall

University of Glasgow

[email protected]

Ronan Lyons

HDRUK Wales & Northen Ireland

[email protected]

Calum MacRae

Harvard Medical School

[email protected]

Teri Manolio

NHGRI

[email protected]

Andrew McIntosh

University of Edinburgh

[email protected]

Louise Millard

Integrative Epidemiology Unit, University

of Bristol

[email protected]

Aoife Molloy

Imperial College London

[email protected]

Sarah Morgan

EMBL-EBI

[email protected]

Andrew Morris

Health Data Research UK

[email protected]

Joseph Mullen

SciBite Ltd

[email protected]

Jonathan Mullins

Swansea University

[email protected]

Vasilis Nikolaou

University of Surrey

[email protected]

Povilas Norvisas

BenevolentAI

[email protected]

Jyotishman Pathak

Weill Cornell Medicine

[email protected]

Chris Penkett

NIHR BioResource

[email protected]

John Perry

University of Cambridge

[email protected]

Michael Poon

Usher Institute of Population Health

Sciences and Informatics

[email protected]

Stephanie Portelli

University of Melbourne

[email protected]

Page 104: Health Data Science

David Porteous

University of Edinburgh

[email protected]

Carys Pugh

University of Edinburgh

[email protected]

Kristiina Rannikmae

University of Edinburgh

[email protected]

Caroline Relton

University of Bristol

[email protected]

Christopher Rentsch

LSHTM/Yale

[email protected]

Brent Richards

McGill University

[email protected]

Tom Richardson

MRC Integrative Epidemiology Unit

[email protected]

Thorsten Roser

Loughborough University London

[email protected]

Patricia Ryser-Welch

Newcastle University

[email protected]

Crina Samarghitean

ICE/University of Cambridge

[email protected]

Aravind Sankar

European Bioinformatics Institute (EMBL-

EBI)

[email protected]

Robert Scott

GlaxoSmithKline

[email protected]

Syed Ahmar Shah

The University of Edinburgh

[email protected]

Tom Shorter

University of Leicester

[email protected]

Kamen Shoylev

KCL

[email protected]

Avinoam Shye

Tel Aviv University

[email protected]

Jonathan Smellie

University of Edinburgh

[email protected]

Rona Strawbridge

University of Glasgow

[email protected]

Matthew Styles

University of Nottingham

[email protected]

Cathie Sudlow

University of Edinburgh

[email protected]

Praveen Surendan

University of Cambridge

[email protected]

Dominic Sykes

The University of Edinburgh

[email protected]

Shu Mei Teo

Baker Heart and Diabetes Institute

[email protected]

Fatemeh Torabi

Swansea University

[email protected]

Salih Tuna

NIHR BioResource

[email protected]

Page 105: Health Data Science

Rachael Turner

SciBite

[email protected]

Juilus Upmeier zu Belzen

Charité - Universitätsmedizin Berlin

[email protected]

Colin Veal

University of Leicester

[email protected]

Marc Wadsley

University of Leicester

[email protected]

Rhos Walker

Health Data Research UK

[email protected]

Neil Walker

NIHR BioResource

[email protected]

Rosemary Walmsley

University of Oxford

[email protected]

Cyndi Williams

Quin

[email protected]

Andrew Wood

University of Exeter

[email protected]

Angela Wood

University of Cambridge

[email protected]

Peter Wooldridge

Quin

[email protected]

Honghan Wu

University of Edinburgh

[email protected]

David Wyllie

Public Health England

[email protected]

Page 106: Health Data Science

Index

A Mizani, M P1 Pathak, J S7

Alex, B P2 Penkett, C S29

Poon, M P20

Beaumont, R S25 Portelli, S P21

Beck, T P3 Porteous, D P22

Biggin, F S33 Pugh, C P23

Brookes, A P4

Buergel, T P5 Rannikmae, K S9

Bull, C P6 Richards, B S27

Richardson, T S23

Cai, T S49 Ryser-Welch, P P24

Clarke, T P7

Cunningham, N P8 Sankar, A S39

Shorter, T P25

Davies, M P9 Strawbridge, R P26

Dobson, R S3 Styles, M P27

Doudesis, D S17

Teo, S M P28

Friis Joergensen, I S5 Torabi, F P29

Tuna, S P30

Harrow, J P10

Hastie, C S31 Veal, C P31

Hayward, C P11

Herrero-Zazo, M S11 Wadsley, M P32

Walker, N S41

Inouye, M S19 Wood, A S47

Wooldridge, P P33

John, C P12 Wu, H P34

Jung, A W P13 Wyllie, D P35

Kim-Hellmuth, S S15

Kraljevic, Z S35

Leslie, K P14

Lindgren, C S21

Lipunova, N P15

Luyapan, J P16

Lyall, D P17

MacRea, C S43

Manolio, T S13

Millard, L P18

Molloy, A S45

Morris, A S1

Mullen, J S37

Mullins, J P19