Automating Machine Learning for Prevention Research

51
Automating Machine Learning for Prevention Research Jason H. Moore, PhD, FACMI Edward Rose Professor of Informatics Director, Institute for Biomedical Informatics Senior Associate Dean for Informatics Perelman School of Medicine University of Pennsylvania Philadelphia, PA, USA upibi.org epistasis.org [email protected]

Transcript of Automating Machine Learning for Prevention Research

Page 1: Automating Machine Learning for Prevention Research

Automating Machine Learning for

Prevention Research

Jason H. Moore, PhD, FACMI

Edward Rose Professor of Informatics

Director, Institute for Biomedical Informatics

Senior Associate Dean for Informatics

Perelman School of Medicine

University of Pennsylvania

Philadelphia, PA, USA

upibi.org epistasis.org [email protected]

Page 2: Automating Machine Learning for Prevention Research

• Bioinformatics

• Computational Biology

Basic Science

• Clinical Informatics

• Clinical Research Informatics

• Consumer Health InformaticsClinical

• Public Health InformaticsPopulation

Biomedical Informatics

American Medical Informatics Association (AMIA)

Page 3: Automating Machine Learning for Prevention Research

Golden Era of Biomedical InformaticsMoore and Holmes, BioData Mining (2016)

Why?

• Big data

• High-performance computing

• Talented trainees

• Government recognition

• Industry recognition

• Patient recognition

• University investment

Page 4: Automating Machine Learning for Prevention Research

Golden Era of Biomedical Informatics

What next?

• Artificial intelligence

• Biomedical devices

• Data integration

• Data science

• Informatician scientists

• Machine Learning

• No-boundary thinking

• Visual analytics

Page 5: Automating Machine Learning for Prevention Research

DNA Repair

Cell Cycle

Oxidative

Signaling

Methylation

Metabolism

RISK OF

BLADDER

CANCER

Adapted from Sing et al., (2003)

ARSENIC

EXERCISE

STRESS

DIET

SMOKING

SNP

Page 6: Automating Machine Learning for Prevention Research

Data Science Pipeline

D DI FS FC ML I A

*-

+

V

Page 7: Automating Machine Learning for Prevention Research

Big Data

D

http://www.kdnuggets.com/

Page 8: Automating Machine Learning for Prevention Research

Data Integration

DI Relational Database Graph Database

Michael Hunger – Neo4j

Page 9: Automating Machine Learning for Prevention Research

Feature Selection

FS

Ritchie – PLoS Genetics (2013)

Sohangir – J Soft Engin App (2013)

Page 10: Automating Machine Learning for Prevention Research

Feature Construction

FC

1 2 3

4 5 6

7 8 9

1 2 3 4 5 6 7 8 9

0 1 2

0

1

2

X1

X2

Z1

0 0 0

0 1 1

0 1 1

0 1 2

0

1

2

X3

X4 0 1

Z2 1 0 1

0 1 0

1 0 1

0 1 2

0

1

2

X5

X6 0 1

Z3

Page 11: Automating Machine Learning for Prevention Research

Machine Learning

ML

*-

+

http://suanfazu.com/

Page 12: Automating Machine Learning for Prevention Research

Statistical and Biological Interpretation

I

Page 13: Automating Machine Learning for Prevention Research

Biological Validation

V

Talbot, Zebrafish (2014) dev.biologists.org

Page 14: Automating Machine Learning for Prevention Research

Clinical Application

A

M.D. Anderson

Page 15: Automating Machine Learning for Prevention Research

Can the data science pipeline

construction process be automated?

Page 16: Automating Machine Learning for Prevention Research

Automated Machine Learning (AutoML)

D DI FS FC ML I A

*-

+

V

Page 17: Automating Machine Learning for Prevention Research

Tree-Based Pipeline Optimization Tool (TPOT)Towards Automated Data Science

https://github.com/epistasislab/tpot/

Dr. Randal Olson (postdoc)

Page 18: Automating Machine Learning for Prevention Research

Tree-Based Pipeline Optimization Tool (TPOT)

1) ML code base

2) Pipeline representation

3) Optimization algorithm

4) Overfitting control

Page 19: Automating Machine Learning for Prevention Research

ML code base

Page 20: Automating Machine Learning for Prevention Research

http://scikit-learn.org/

Page 21: Automating Machine Learning for Prevention Research

Building Blocks of a Pipeline

Variance

Univariate

Recursive

Importance

PCA

SVD

Factor

ICA

Feature Selection Feature ConstructionFeature Processing

Normalization

Encoding

Polynomial

Scaling

Page 22: Automating Machine Learning for Prevention Research

Building Blocks of a Pipeline

OLS

LDA

SVM

kNN

DT

RF

Machine Learning

Page 23: Automating Machine Learning for Prevention Research

Example Data Science Pipeline

Importance

PCA

Polynomial

Data

RF DT

SVM

kNN

Page 24: Automating Machine Learning for Prevention Research

Pipeline representation

Page 25: Automating Machine Learning for Prevention Research

TPOT Expression Tree

TPOT Pipeline

Data

DT

SVMkNN

PCAPoly

Imp

RF

DT

RF

Imp Poly

PCA kNN

SVM

Page 26: Automating Machine Learning for Prevention Research

Pipeline optimization

Page 27: Automating Machine Learning for Prevention Research

Children

ParentsDT

SVMkNN

PCAPoly

RF

Poly RF

Uni

Page 28: Automating Machine Learning for Prevention Research

Children

ParentsDT

SVMkNN

PCAPoly

RF

Poly RF

Uni

DT

RFkNN

PCAUni

RF

Var SVM

Poly

Page 29: Automating Machine Learning for Prevention Research

Pipeline overfitting

Page 30: Automating Machine Learning for Prevention Research

Data

75%

25%

75%

25%

Training

Testing

Validation

Cross-Validation

Page 31: Automating Machine Learning for Prevention Research

Pareto Optimization

Err

or

Complexity

Page 32: Automating Machine Learning for Prevention Research

Simulation Example

Page 33: Automating Machine Learning for Prevention Research

Olson et al., LNCS (2016)

Page 34: Automating Machine Learning for Prevention Research

Bladder Cancer Example

Page 35: Automating Machine Learning for Prevention Research

Application to Bladder CancerAndrew et al., Carcinogenesis (2006, 2008)

• 334 bladder cancer cases

• 580 controls

• From the state of NH

• Polymorphisms in DNA repair enzyme genes

– XPD

– APE

– XPC

– XRCC1

– XRCC3

• Pack-years of smoking, age, gender

© Jason H. Moore

Page 36: Automating Machine Learning for Prevention Research

© Jason H. Moore

Training Accuracy 0.66

Testing Accuracy 0.64

OR = 3 (95% CI 2.0-3.4)

P < 0.001

Page 37: Automating Machine Learning for Prevention Research

TPOT Building Blocks

Variance

Univariate

Recursive

Importance

PCA

SVD

Factor

ICA

Feature Selection Feature ConstructionFeature Processing

Normalization

Encoding

Polynomial

Scaling

MDR

DT NB

Machine Learning

Page 38: Automating Machine Learning for Prevention Research

MDRData DTMDR

2

48

2

22

XPD.312 XPD.751 pack.yrs age > 50

Testing Accuracy 0.64

Page 39: Automating Machine Learning for Prevention Research
Page 40: Automating Machine Learning for Prevention Research
Page 41: Automating Machine Learning for Prevention Research

Scaling for big data

Page 42: Automating Machine Learning for Prevention Research

An expert knowledge feature selector

EKKnowledge

ReliefF

SURF

SURF*

MultiSURF

Subset Size

k = 5

k = 10

k = 25

k = 50

Page 43: Automating Machine Learning for Prevention Research

An expert knowledge feature selector

Sohn et al., LNCS (2017)

Page 44: Automating Machine Learning for Prevention Research

Automated Machine Learning (AutoML)

D DI FS FC ML I A

*-

+

V

Page 45: Automating Machine Learning for Prevention Research

TPOT: The Data Science Assistant

DT

RFkNN

PCAUni

RF

Var SVM

Poly

Datanami.com

Page 46: Automating Machine Learning for Prevention Research

Data Science should be

accessible and easy

Page 47: Automating Machine Learning for Prevention Research
Page 48: Automating Machine Learning for Prevention Research
Page 49: Automating Machine Learning for Prevention Research

Penn IBI Idea FactoryConnecting Researchers with Ideas

Page 50: Automating Machine Learning for Prevention Research
Page 51: Automating Machine Learning for Prevention Research

Acknowledgments• Graduate Students and Postdocs

– Brett Beaulieu-Jones, Brian Cole, Molly Hall, Bill LaCava, Randy Olson,

Alena Orlenko, Patryk Orzechowski, Nadia Penrod, Elizabeth Piette,

Andrew Sohn, Ryan Urbanowicz

• Staff– Elisabetta Manduchi, Pete Schmitt

• NIH grants R01 AI11679, R01 LM009012, R01

LM010098

[email protected]

• epistasis.org, epistasisblog.org

• twitter.com: @moorejh