Dekker trog - learning outcome prediction models from cancer data - 2017
-
Upload
andre-dekker -
Category
Healthcare
-
view
56 -
download
0
Transcript of Dekker trog - learning outcome prediction models from cancer data - 2017
Learning outcome prediction models from cancer data
Andre DekkerDepartment of Radiation Oncology (MAASTRO)GROW - Maastricht University Medical Centre +Maastricht, The Netherlands
SLIDES AVAILABLE ON SLIDESHARE (slideshare.net/AndreDekker)
2
Disclosures• Research collaborations incl. funding and speaker honoraria
– Varian (VATE, SAGE, ROO, chinaCAT, euroCAT), Siemens (euroCAT), Sohard (SeDI, CloudAtlas), Mirada Medical (CloudAtlas), Philips (EURECA, TraIT, SWIFT-RT, BIONIC), Xerox (EURECA), De Praktijkindex (DLRA), ptTheragnostic (DART, Strategy), CZ (My Best Treatment)
• Public research funding– Radiomics (USA-NIH/U01CA143062), euroCAT(EU-Interreg), duCAT&Strategy
(NL-STW), EURECA (EU-FP7), SeDI & CloudAtlas & DART (EU-EUROSTARS), TraIT (NL-CTMM), DLRA (NL-NVRO), BIONIC (NWO)
• Spin-offs and commercial ventures– MAASTRO Innovations B.V. (CSO)– Various patents on medical machine learning
3
TROG 2017 talks• Learning outcome prediction models from
cancer data– Technical Research Workshop, Monday 840-910,
followed by Panel Discussion• Big Data in Radiation Oncology
– Statistical Methods, Evidence Appraisal and Research for Trainees, Monday 1450-1520
• Knowledge Engineering in Oncology– TROG Plenary, Tuesday, 925-1000
• Radiomics for Oncology– TROG Plenary, Thursday, 1150-1220
Some Overlap
NoOverlap
4
Learning objectivesAfter the lecture, attendees should be able to• Name the major sources of cancer data and their absolute and
relative size• Understand the challenges of sharing data and solutions to these• Itemize steps in the methodology to go from data to models• Appraise papers that describe models incl. using TRIPOD
The Data Part
6
Cancer Data?
Oncology2005-2015140M patients0.1-10GB per patient14-1400PB80% unstructured100k hospitals
7
Barriers to sharing data[..] the problem is not really technical […]. Rather, the problems are ethical, political, and administrative. Lancet Oncol 2011;12:933
1. Administrative (I don’t have the resources)2. Political (I don’t want to)3. Ethical (I am not allowed to)
4. Technical (I can’t)
8
Common approaches to sharing• Sharing standardized, highly curated data from
clinical research programs• Very useful, but only 3% of patients (if that)
• Sharing standardized, highly curated data to clinical registries
• Very useful, but limited amount of features and a lot of work
• Big Data companies usually cloud based (Watson Health Cloud, Flatiron/Google, ASCO/SAP CancerLinq)
• Worries about privacy, loss of control, limited reusability, silos
9
Data landscape• Clinical research
• 3% of patients• 100% of features• 5% missing• 285 data points
• Clinical registries• 100% of patients• 3% of features• 20% missing• 240 data points
• Clinical routine• 100% of patients• 100% of features• 80% missing• 2000 data points
Data elementsPatients
10
A different approach• If sharing is the problem: Don’t share the data
• If you can’t bring the data to the research• You have to bring the research to the data
• Challenges– The research application has to be distributed (trains & track)– The data has to be understandable by an application (i.e. not a human) ->
FAIR data stations
11
CORAL: Community in Oncology for RApid Learning
7
4
meerCATLung - DyspneaU MichiganMAASTROThe Christie
Map © Copyright Showeet.com
canCATLung SBRT - ControlPrincess MargaretMAASTRO
BIONICRadiomicsMAASTROTata Memorial
duCATLung - DysphagiaMAASTRORadboudNKI
euroCATLung - SurvivalUK AachenLOC HasseltCatharinaMAASTROCHU Liege
Interest to joinErasmus (Breast)BCCA (Breast)Bloemfontein (Cervix)Odense (HN, Lung)Aalst (Lung)McGill (Brain)
ozCATHead&Neck - Survival LiverpoolIllawarra NewcastleWestmeadMAASTRORTOG/NRG
worldCATRectum - Local ControlFudanRome/EURTOG/NRG
12
Typical Data Quality challenges• Data are unstructured• Data are not understandable• Data are missing• Data are incorrect• Data are contradicting• Data are biased• Data are biased missing
• Garbage in – Garbage out?
声门下区T4N0M0 Stage IV patientPatient weighing 1000kg
Grade 3+ toxicities
For the techies…
14
Horizontal PartitionsData elementsPatients
Maastricht
Patients Shanghai
• Reasonably well understood
• Distributed learning possible if data is FAIR
• No need for data to leave the hospital
15
Vertical and Complex PartitionsData elements MAASTROPatients
Data elements Registry
16
A bit more technical detail• Keep data locally• Standardize it
according to an ontology
• Make and send around learning “bots”
• Share the results - not the data!
17
Even more technical details• De-identification• Semantic web, linked data• Imaging/DICOM data & clinical data stream
The Modelling Part
19
Our modelling approach• Hypothesis driven!!
20
How much data do you need?• Rule of thumb. Min. 10 events per input feature
• 200 NSCLC patients• 25% survival at two years• 50 events
• 10 input features• More is better Source: vitalflux.com (2017)
21
Source: Jason Brownlee (2013)
Machine Learning
22
Considerations for machine learning• Discrimination (AUC)• Calibration (Brier)
• Interpretability (black box vs. transparent)
• Can it handle low data quality (of training and validation)?
• Can it be learned in a distributed setting?
23
Choose alreadySimple and quick, but need complete data• Logistic regression• Support Vector Machines
Intuitive and can handle missing data• Bayesian Networks
All can be learned in a distributed setting
Review pending
24
TRIPOD
https://www.tripod-statement.org/
25
Validation model• Discrimination: Is the model able to classify the
population into two or more groups with different observed survival?
• Calibration: Is the estimated probability of survival equal to the observed survival probability?
• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?
26
Laryngeal carcinoma model• 994 MAASTRO patients• 1990-2005• www.predictcancer.org• Input parameters
– Age– Hemoglobin– T-stage– Radiotherapy Dose (Gy)– Gender– N+– Tumor location
• Output parameters– Overall survival
27
Discrimination / Calibration / Clinical Relevance?
• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?
• Calibration: Is the estimated probability of survival equal to the observed survival probability?
• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?
28
Discrimination / Calibration / Clinical Relevance?
• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?
• Calibration: Is the estimated probability of survival equal to the observed survival probability?
• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?
29
Discrimination / Calibration / Clinical Relevance?
• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?
• Calibration: Is the estimated probability of survival equal to the observed survival probability?
• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?
30
There is an app for that
31
Learning objectivesAfter the lecture, attendees should be able to• Name the major sources of cancer data and their absolute and
relative size• Understand the challenges of sharing data and solutions to these• Itemize steps in the methodology to go from data to models• Appraise papers that describe models incl. using TRIPOD
32
Acknowledgements• Fudan Cancer Center, Shanghai,
China• Varian, Palo Alto, CA, USA• Siemens, Malvern, PA, USA• RTOG, Philadelphia, PA, USA• MAASTRO, Maastricht, Netherlands• Policlinico Gemelli, Roma, Italy• UH Ghent, Belgium• UZ Leuven, Belgium• Radboud, Nijmegen, Netherlands• University of Sydney, Australia• University of Michigan, Ann Arbor,
USA
• Liverpool and Macarthur CC, Australia
• CHU Liege, Belgium• Uniklinikum Aachen, Germany• LOC Genk/Hasselt, Belgium• Princess Margaret CC, Canada• The Christie, Manchester, UK• UH Leuven, Belgium• State Hospital, Rovigo, Italy• Illawarra Shoalhaven CC, Australia • Catharina Zkh Eindhoven,
Netherlands• Philips, Eindhoven, NetherlandsMore info on: www.predictcancer.org www.cancerdata.org
www.eurocat.info www.mistir.info
Thank you for your attention
Andre DekkerDepartment of Radiation Oncology (MAASTRO)GROW - Maastricht University Medical Centre +Maastricht, The Netherlands
34
35
Patient(ncit:C16960)
Age at start RT(roo:100003)
Year(uo:UO_0000036)
Value
Non-small cell lung carcinoma
(ncit:C2926)
Sex(nci:C20197 and
nci:C16576)
Value
Hospital(ncit:C19326)(uri=http://
www.uhn.ca/PrincessMargaret)
Month(uo:UO_0000035)
Value
Survival(roo:100063)
Vital Status(ncit:C37987 or
ncit:28554)
FEV1(nci:C38084)
Percentage FEV1(nci:C112376)
Liter(uo:UO_0000099)
Value
Percent(uo:UO_0000187)
Age at diagnosis(roo:100002) Year
(uo:UO_0000036)
Value
ECOG performance status
(nci:105722nci:105723nci:105725nci:105726nci:105727nci:105728)
Value
Positive Lymph Node Stations(roo:100049)
Count(uo:UO_0000189)
has_unitroo:100027
Value
DateTimeDescription
Clinical TNM Finding
(ncit:C48881)
Generic T-stage 0-4(ncit:48719)(ncit:48720)
).(ncit:48732)
has_clinical_t_stageroo:100244
Diagnostic Procedure
(ncit:C18020)
Volume of primary tumor
(roo:100054)
has_
volu
me
(roo
:100
315)
Cubic centimeter(uo:UO_0000097)
ValueRT Structure Set
(sedi:RTStructureSet)MIA Version
(mia:<version>)
AJCC Edition(roo:100052)(roo:100053)
Radition Therapy (ncit:C15313)
OR
SBRT(ncit:C118286)
Prescribed Radiotherapy Dose
(roo:100013)
Gray(uo:UO_0000134)
Value(xsd:double)
No. RT Fractions Per Treatment(roo:100356)
Value(xsd:integer)
No. RT Fractions Per Day
(roo:100355)
Value(xsd:integer)
Delivered Radiotherapy Dose
(roo:100012)
Gray(uo:UO_0000134)
Value(xsd:double)
First radiotherapy fraction
(roo:100058)
Last radiotherapy fraction
(roo:100059)
Histology(nci:2926nci:2852nci:3780nci:2929nci:2852nci:3915)
DateTimeDescription
DateTimeDescription
DateTimeDescriptionat_date_timeroo:100041
DateTimeDescription
Pneumonitis(ctcae:Pneumonitis)
Fracture(ctcae:Fracture)
Rib(fma:fma7574)
DateTimeDescription
DateTimeDescription
Reaction(ctcae:Radiation_recall_reaction_derm
atologic)
DateTimeDescription
at_date_timeroo:100041
Fatigue(ctcae:Fatigue)
DateTimeDescription
at_date_timeroo:100041
Dyspnea(ctcae:Dyspnea)
DateTimeDescription
at_date_timeroo:100041
Couch(ctcae:Couch)
DateTimeDescription
at_date_timeroo:100041
Anorexia(ctcae:Anorexia)
DateTimeDescription
at_date_timeroo:100041
DateTimeDescription Dysphagia(ctcae:Dysphagia)
at_date_timeroo:100041
DateTimeDescription Hemoptysis(nci:C3094)
at_date_timeroo:100041
DateTimeDescription Esophagitis(ctcae:Esophagitis)
at_date_timeroo:100041
DateTimeDescriptionPulmonary Fibrosis(ctcae:Pulmonary_fi
brosis)
at_date_timeroo:100041
DateTimeDescriptionBrachial plexopathy(ctcae:Brachial_plex
opathy)
at_date_timeroo:100041
36
Tech used• ETL (Pentaho, Talend)• DICOM de-identification
(CTP)• RDF store & SPARLQ
endpoint (Blazegraph, Sesame)
• Ontology editing (Protégé)• Ontology publishing
(BioPortal)• Database (PostgreSQL)
• Database to RDF (D2R)• DICOM to RDF (SeDI)• PACS (dcm4chee)• Image processing pipeline
(MIA-MAASTRO)• Distributed application
(Varian, Docker)• Generic & Machine
learning (Matlab, R, Java, Python)