Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra...

33
Machine Learning for Biomedicine at Scale Jennifer Chayes Technical Fellow and Managing Director Microsoft Research New England, New York, and Montreal 1

Transcript of Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra...

Page 1: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Machine Learning for Biomedicine at Scale

Jennifer Chayes

Technical Fellow and Managing Director

Microsoft Research New England, New York, and Montreal

1

Page 2: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Machine Learning Problems in Biomedicine

Page 3: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

CRISPR-ML Team

Meagan Sullender

Mudra HegdeEmma W. VaimbergKatherine DonovanIan SmithDavid Root

Washington University School of Medicine

John Doench

Broad Institute of MIT and Harvard

University of California Los Angeles

Zuzana Tothova

Michael Weinstein

And thanks to: Carl Kadie (Microsoft Research), Maximilian Haeussler (UCSC)

Jennifer Listgarten

Microsoft Research

Dana-Farber Cancer Institute / BroadCraig WilenRobert OrchardHerbert W. Virgin

Melih ElibolLuong HoangJake CrawfordKevin Gao

Nicolo Fusi

Harvard / Massachusetts General HospitalBenjamin KleinstiverKeith YoungAlex Sousa

Page 4: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

4

Page 5: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University
Page 6: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Credit: Feng Zhang and the McGovern Institute for Brain Research at MIT

CRISPR 101

Page 7: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Two problems:

1. Better “on-target” (knocking out the gene of interest):

2. Elimination/reduction of “off-target” effects:

Matching the Target

* * * *… …

… …?? ?

Page 8: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Two problems:

1. Better “on-target” (knocking out the gene of interest):

2. Elimination/reduction of “off-target” effects:

Matching the Target

* * * *… …

… …?? ?

Solution paths:

• Improving laboratory methods.

•Machine learning.

Page 9: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

ML on-target predictive modeling for CRISPR

𝑦 = not effective

𝑦 = effective

𝑦 = very effective

𝑦 = not effective

𝑓( Ԧ𝑥)

𝑓( Ԧ𝑥)

𝑓( Ԧ𝑥)

𝑓( Ԧ𝑥)DNA to be edited

Problem scale: ~40M possible guides 20,000 human genes x 10 potential transcripts (RNA) per gene x 200 potential edit locations per transcript

Page 10: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

20%Improvements in

Accuracy

50%Savings in Cost and

Time Per Gene

Page 11: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Strong uptake in biology community

• Recommended by independent studies (Haeussler et al. 2016).• Adopted by startups and academics/researchers worldwide.• Azure ML service ~1500 requests/day, doubling every 3 months• Web service ~300 requests/day.• Over 3000 open-source software downloads.

https://www.microsoft.com/en-us/research/project/crispr/(Azimuth)

Nature Biotechnology 2016 (320 citations to date)

Page 12: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Off-target prediction: Much harder problem

• Need to scan across all 3 billion nucleotides of the genome

• Sparse training data: only measure genes with observable effect

• Combinatorial explosion arising from tolerance of mismatches

• Need to combine into one off-target score for the user

GGCTGCTTTACCCCCTGTGGGguide/intended target

…CTATAACTGGCAGCTCTACCCGGTGTGGGACAAG…potential off-target with 2 mismatches

* * * *… …

Page 13: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Combinatorial explosion (for 1 guide in 1 gene)

1 mismatch: 69 sites

2 mismatches: 2277 sites

3 mismatches: 47,817 sites

4 mismatches: 717,255 sites

5 mismatches: 8,176,707 sites

1 full example

very sparsely sampled across different genes

Combinatorial explosion + Sparse training data => Cleverness needed!

Page 14: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Three step-approach to off-target modeling

Given a guide:

1. Filter genome-wide potential off-targets to assess (all is too many; reduce to ~2,000 most serious)

2. Score each off-target from (1) for activity using a machine learning predictive model.

3. Aggregate the scores from (2) into a single overall off-target score.

Nature Biomedical Engineering 2018https://www.microsoft.com/en-us/research/project/crispr/ (Elevation)

(or go to Microsoft AI blog and search for “CRISPR”)

Page 15: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

2. Score each off-target for activity

𝑦 = high

𝑦 = low

𝑦 = medium

𝑦 = high

𝑓( Ԧ𝑔, Ԧ𝑡)

𝑓( Ԧ𝑔, Ԧ𝑡)

𝑓( Ԧ𝑔, Ԧ𝑡)

𝑓( Ԧ𝑔, Ԧ𝑡)

……

machine

learning

model

Page 16: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

ℎ(0.77, 0.21)

Off-target: two-step guide-target modelling

i. Build single-mismatch model.

ii. Build multi-mismatch model that combines the output from step (i).

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

2-mismatch example

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

0.77

0.210.43

𝑓( Ԧ𝑔, Ԧ𝑡)

g(𝑓 𝑡1, Ԧ𝑔1 , 𝑓(𝑡2, Ԧ𝑔2))

𝑓 Ԧ𝑔1, 𝑡1

𝑓( Ԧ𝑔2, 𝑡2)

Page 17: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

ℎ(0.77, 0.21)

i. Build single-mismatch model.

ii. Build multi-mismatch model that combines the output from step (i).

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

2-mismatch example

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

GGCTGCTTTACCCGCTGTGGG

…ACTGGCAGCTCTAACCCGTGTGGGACA…

0.77

0.210.43

𝑓( Ԧ𝑔, Ԧ𝑡)

ℎ(𝑓 Ԧ𝑔1, 𝑡1 , 𝑓( Ԧ𝑔2, 𝑡2, ))

𝑓 Ԧ𝑔1, 𝑡1

𝑓( Ԧ𝑔2, 𝑡2)

Off-target: two-step guide-target modelling

Page 18: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Evaluation on held out data

knob of Spearman weights

Has now also been validated as state-of-the-art on two data sets.

Perf

orm

ance

• Evaluation depends on usage scenario•May want to weight off-

targets differently depending on their impact•Our approach (Elevation)

outperforms others, no matter what the weighting

Page 19: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Azure Cloud

Notebooks

Storage

FC - Flow Cytometry

RAS – Resistance Assays

Input DatasetsTraining

Model

All possibleGuide RNA

Scoring

Pre-computations

Search Engine for CRISPR gene edits

Problem scale: ~80B possible guides ~2000 mismatches x

(on-target) 20,000 genes x 10 potential transcripts (RNA) per gene x 200 potential edit locations per transcript

Pre-computation ran in 18 days on 16k cores (~7M CPU-hours)Results stored in Azure Tables with a front-end web interface

Search uses hashing to return ranked results instantly

Page 20: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

CRISPR.ML – Putting it all together

Source code for everything on GitHub

Page 21: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

15.6K guides queried 9.6M off-targets

returned

Usage per Day Users

Academia Research Inst.

Pharma/Life Sci Co. Hospital

Page 22: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Reading the immune system to diagnose disease

Slides courtesy of Jonathan Carlson, Ph.D.

Microsoft Research Healthcare NExT

Cloud-Scale Immunomics

T-cells killing a cancer cellCredit: Alex Ritter, Jennifer Lippincott Schwartz

and Gillian Griffiths, National Institutes of Health

Page 23: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Blood Sample Immunosequencing Machine Learning Empowered care

Learning to decode the immune system to diagnose disease

Page 24: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Blood Sample Immunosequencing Machine Learning Empowered care

Learning to decode the immune system to diagnose disease

Page 25: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Blood Sample Immunosequencing Machine Learning Empowered care

Learning to decode the immune system to diagnose disease

Page 26: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Immunosequencing Machine Learning Empowered care

Learning to decode the immune system to diagnose disease

Blood Sample Machine Learning

Page 27: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Immunosequencing Machine Learning Empowered care

Learning to decode the immune system to diagnose disease

Blood Sample

Page 28: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

The biology

Self-reactive

Page 29: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

The machine learning

𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}

Page 30: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

The machine learning

𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}

𝑓1 𝑇𝐶𝑅,𝑀𝐻𝐶, 𝑝𝑒𝑝𝑡𝑖𝑑𝑒 → ℝ (binding energy)

Page 31: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

The data

𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}

𝑓1 𝑇𝐶𝑅,𝑀𝐻𝐶, 𝑝𝑒𝑝𝑡𝑖𝑑𝑒 → ℝ (binding energy)

pair1M TCR per clinical sample

MIRAPairwise binding data for 1k antigens against 1M TCR

Page 32: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University

Toward a universal diagnostic

T-cells are nature’s universal diagnostic machine

Will generate billions of training samples per day

Cloud-scale machine learning to translate T-cells to antigens

Empowered care

Page 33: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University