Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra...
Transcript of Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra...
![Page 1: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/1.jpg)
Machine Learning for Biomedicine at Scale
Jennifer Chayes
Technical Fellow and Managing Director
Microsoft Research New England, New York, and Montreal
1
![Page 2: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/2.jpg)
Machine Learning Problems in Biomedicine
![Page 3: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/3.jpg)
CRISPR-ML Team
Meagan Sullender
Mudra HegdeEmma W. VaimbergKatherine DonovanIan SmithDavid Root
Washington University School of Medicine
John Doench
Broad Institute of MIT and Harvard
University of California Los Angeles
Zuzana Tothova
Michael Weinstein
And thanks to: Carl Kadie (Microsoft Research), Maximilian Haeussler (UCSC)
Jennifer Listgarten
Microsoft Research
Dana-Farber Cancer Institute / BroadCraig WilenRobert OrchardHerbert W. Virgin
Melih ElibolLuong HoangJake CrawfordKevin Gao
Nicolo Fusi
Harvard / Massachusetts General HospitalBenjamin KleinstiverKeith YoungAlex Sousa
![Page 4: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/4.jpg)
4
![Page 5: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/5.jpg)
![Page 6: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/6.jpg)
Credit: Feng Zhang and the McGovern Institute for Brain Research at MIT
CRISPR 101
![Page 7: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/7.jpg)
Two problems:
1. Better “on-target” (knocking out the gene of interest):
2. Elimination/reduction of “off-target” effects:
Matching the Target
* * * *… …
… …?? ?
![Page 8: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/8.jpg)
Two problems:
1. Better “on-target” (knocking out the gene of interest):
2. Elimination/reduction of “off-target” effects:
Matching the Target
* * * *… …
… …?? ?
Solution paths:
• Improving laboratory methods.
•Machine learning.
![Page 9: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/9.jpg)
ML on-target predictive modeling for CRISPR
…
𝑦 = not effective
𝑦 = effective
𝑦 = very effective
𝑦 = not effective
𝑓( Ԧ𝑥)
𝑓( Ԧ𝑥)
𝑓( Ԧ𝑥)
𝑓( Ԧ𝑥)DNA to be edited
Problem scale: ~40M possible guides 20,000 human genes x 10 potential transcripts (RNA) per gene x 200 potential edit locations per transcript
![Page 10: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/10.jpg)
20%Improvements in
Accuracy
50%Savings in Cost and
Time Per Gene
![Page 11: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/11.jpg)
Strong uptake in biology community
• Recommended by independent studies (Haeussler et al. 2016).• Adopted by startups and academics/researchers worldwide.• Azure ML service ~1500 requests/day, doubling every 3 months• Web service ~300 requests/day.• Over 3000 open-source software downloads.
https://www.microsoft.com/en-us/research/project/crispr/(Azimuth)
Nature Biotechnology 2016 (320 citations to date)
![Page 12: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/12.jpg)
Off-target prediction: Much harder problem
• Need to scan across all 3 billion nucleotides of the genome
• Sparse training data: only measure genes with observable effect
• Combinatorial explosion arising from tolerance of mismatches
• Need to combine into one off-target score for the user
GGCTGCTTTACCCCCTGTGGGguide/intended target
…CTATAACTGGCAGCTCTACCCGGTGTGGGACAAG…potential off-target with 2 mismatches
* * * *… …
![Page 13: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/13.jpg)
Combinatorial explosion (for 1 guide in 1 gene)
1 mismatch: 69 sites
2 mismatches: 2277 sites
3 mismatches: 47,817 sites
4 mismatches: 717,255 sites
5 mismatches: 8,176,707 sites
1 full example
very sparsely sampled across different genes
Combinatorial explosion + Sparse training data => Cleverness needed!
![Page 14: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/14.jpg)
Three step-approach to off-target modeling
Given a guide:
1. Filter genome-wide potential off-targets to assess (all is too many; reduce to ~2,000 most serious)
2. Score each off-target from (1) for activity using a machine learning predictive model.
3. Aggregate the scores from (2) into a single overall off-target score.
Nature Biomedical Engineering 2018https://www.microsoft.com/en-us/research/project/crispr/ (Elevation)
(or go to Microsoft AI blog and search for “CRISPR”)
![Page 15: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/15.jpg)
2. Score each off-target for activity
…
𝑦 = high
𝑦 = low
𝑦 = medium
𝑦 = high
…
𝑓( Ԧ𝑔, Ԧ𝑡)
𝑓( Ԧ𝑔, Ԧ𝑡)
𝑓( Ԧ𝑔, Ԧ𝑡)
𝑓( Ԧ𝑔, Ԧ𝑡)
……
machine
learning
model
![Page 16: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/16.jpg)
ℎ(0.77, 0.21)
Off-target: two-step guide-target modelling
i. Build single-mismatch model.
ii. Build multi-mismatch model that combines the output from step (i).
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
2-mismatch example
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
0.77
0.210.43
𝑓( Ԧ𝑔, Ԧ𝑡)
g(𝑓 𝑡1, Ԧ𝑔1 , 𝑓(𝑡2, Ԧ𝑔2))
𝑓 Ԧ𝑔1, 𝑡1
𝑓( Ԧ𝑔2, 𝑡2)
![Page 17: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/17.jpg)
ℎ(0.77, 0.21)
i. Build single-mismatch model.
ii. Build multi-mismatch model that combines the output from step (i).
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
2-mismatch example
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
GGCTGCTTTACCCGCTGTGGG
…ACTGGCAGCTCTAACCCGTGTGGGACA…
0.77
0.210.43
𝑓( Ԧ𝑔, Ԧ𝑡)
ℎ(𝑓 Ԧ𝑔1, 𝑡1 , 𝑓( Ԧ𝑔2, 𝑡2, ))
𝑓 Ԧ𝑔1, 𝑡1
𝑓( Ԧ𝑔2, 𝑡2)
Off-target: two-step guide-target modelling
![Page 18: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/18.jpg)
Evaluation on held out data
knob of Spearman weights
Has now also been validated as state-of-the-art on two data sets.
Perf
orm
ance
• Evaluation depends on usage scenario•May want to weight off-
targets differently depending on their impact•Our approach (Elevation)
outperforms others, no matter what the weighting
![Page 19: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/19.jpg)
Azure Cloud
Notebooks
Storage
FC - Flow Cytometry
RAS – Resistance Assays
Input DatasetsTraining
Model
All possibleGuide RNA
Scoring
Pre-computations
Search Engine for CRISPR gene edits
Problem scale: ~80B possible guides ~2000 mismatches x
(on-target) 20,000 genes x 10 potential transcripts (RNA) per gene x 200 potential edit locations per transcript
Pre-computation ran in 18 days on 16k cores (~7M CPU-hours)Results stored in Azure Tables with a front-end web interface
Search uses hashing to return ranked results instantly
![Page 20: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/20.jpg)
CRISPR.ML – Putting it all together
Source code for everything on GitHub
![Page 21: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/21.jpg)
15.6K guides queried 9.6M off-targets
returned
Usage per Day Users
Academia Research Inst.
Pharma/Life Sci Co. Hospital
![Page 22: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/22.jpg)
Reading the immune system to diagnose disease
Slides courtesy of Jonathan Carlson, Ph.D.
Microsoft Research Healthcare NExT
Cloud-Scale Immunomics
T-cells killing a cancer cellCredit: Alex Ritter, Jennifer Lippincott Schwartz
and Gillian Griffiths, National Institutes of Health
![Page 23: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/23.jpg)
Blood Sample Immunosequencing Machine Learning Empowered care
Learning to decode the immune system to diagnose disease
![Page 24: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/24.jpg)
Blood Sample Immunosequencing Machine Learning Empowered care
Learning to decode the immune system to diagnose disease
![Page 25: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/25.jpg)
Blood Sample Immunosequencing Machine Learning Empowered care
Learning to decode the immune system to diagnose disease
![Page 26: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/26.jpg)
Immunosequencing Machine Learning Empowered care
Learning to decode the immune system to diagnose disease
Blood Sample Machine Learning
![Page 27: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/27.jpg)
Immunosequencing Machine Learning Empowered care
Learning to decode the immune system to diagnose disease
Blood Sample
![Page 28: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/28.jpg)
The biology
Self-reactive
![Page 29: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/29.jpg)
The machine learning
𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}
![Page 30: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/30.jpg)
The machine learning
𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}
𝑓1 𝑇𝐶𝑅,𝑀𝐻𝐶, 𝑝𝑒𝑝𝑡𝑖𝑑𝑒 → ℝ (binding energy)
![Page 31: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/31.jpg)
The data
𝑓2 {𝑇𝐶𝑅} → {𝑑𝑖𝑠𝑒𝑎𝑠𝑒}
𝑓1 𝑇𝐶𝑅,𝑀𝐻𝐶, 𝑝𝑒𝑝𝑡𝑖𝑑𝑒 → ℝ (binding energy)
pair1M TCR per clinical sample
MIRAPairwise binding data for 1k antigens against 1M TCR
![Page 32: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/32.jpg)
Toward a universal diagnostic
T-cells are nature’s universal diagnostic machine
Will generate billions of training samples per day
Cloud-scale machine learning to translate T-cells to antigens
Empowered care
![Page 33: Machine Learning for Biomedicine at Scale · Machine Learning for Biomedicine at Scale ... Mudra Hegde Emma W. Vaimberg Katherine Donovan Ian Smith David Root Washington University](https://reader033.fdocuments.in/reader033/viewer/2022042210/5eaf4cfe1e3756503c401a25/html5/thumbnails/33.jpg)