Post on 13-Oct-2020
Interpretable AI
Dimitris Bertsimas
MIT
Septemebr 2018
1 / 24
Interpretable AI
I AI and especially deep learning have made significant progressin computer vision, automatic translation and voicerecognition that are affecting society.
I Deep learning suffers from lack of interpretability.
I A driveless car is involved in an accident with loss of life. Whois at fault? Can society tolerate not understanding?
I A student is not selected for freshman admissions. Is it anadequate response that an algorithm made the decision?
I Interpretability matters.
2 / 24
Interpretable AI
I AI and especially deep learning have made significant progressin computer vision, automatic translation and voicerecognition that are affecting society.
I Deep learning suffers from lack of interpretability.
I A driveless car is involved in an accident with loss of life. Whois at fault? Can society tolerate not understanding?
I A student is not selected for freshman admissions. Is it anadequate response that an algorithm made the decision?
I Interpretability matters.
2 / 24
Interpretable AI
I AI and especially deep learning have made significant progressin computer vision, automatic translation and voicerecognition that are affecting society.
I Deep learning suffers from lack of interpretability.
I A driveless car is involved in an accident with loss of life. Whois at fault? Can society tolerate not understanding?
I A student is not selected for freshman admissions. Is it anadequate response that an algorithm made the decision?
I Interpretability matters.
2 / 24
Goal: Develop AI algorithms that are interpretable andprovide state of the art performance.
PatientinfoAge: 30Gender: maleAlbumin: 2.8g/dLSepsis: noneINR: 1.1Diabetic: yes…
Mortality risk: 26.4%
Black-boxmodels
Interpretablemodels
Mortality risk: 26.4%
Age<25?
13.2% Male?
26.4% 18.3%
PatientinfoAge: 30Gender: maleAlbumin: 2.8g/dLSepsis: noneINR: 1.1Diabetic: yes…
3 / 24
Leo Breiman, On Interpretability Trees receive an A+
I Leo Breiman et. al. (1984) introduced CART, a heuristicapproach to make predictions (either binary or continuous)from data.
I Widespread use in academia and industry (∼ 37,000citations!)
I The Iris flower data set introduced by Fisher 1936 to classifyflowers based on four measurements: petal width/height andsepal width/height.
4 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Iris data set
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th
Species●
●
setosavirginica
5 / 24
The Tree Representation
1
2 V
V S
Sepal length < 5.75 Sepal length ≥ 5.75
Sepal width < 2.7 Sepal width ≥ 2.7
6 / 24
Leo again ....
I CART is fundamentally greedy—it makes a series of locallyoptimal decisions, but the final tree could be far from optimal
I Finally, another problem frequently mentioned(by others, not by us) is that the tree procedure isonly one-step optimal and not overall optimal. . . . Ifone could search all possible partitions . . . the tworesults might be quite different.
We do not address this problem. At this stage ofcomputer technology, an overall optimal treegrowing procedure does not appear feasible for anyreasonably sized data set.
I On interpretability trees receive an A+
7 / 24
B.+Dunn, “Optimal Trees”, Machine Learning, 2017
I Use Mixed-Integer Optimization (MIO) and local search toconsider the entire decision tree problem at once and solve toobtain the Optimal Tree for both regression and classification.
I The Algorithms scale with n = 1, 000, 000, p = 10, 000.
I Motivation: MIO is the natural form for the Optimal Treeproblem:
I Decisions: Which variable to split on, which label to predict fora region
I Outcomes: Which region a point ends up in, whether a point iscorrectly classified
8 / 24
OCT-H
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
4 5 6 7 8Sepal Length
Sepa
l Wid
th Species●
●
●
setosaversicolorvirginica
9 / 24
Performance of Optimal Classification Trees
I Average out-of-sample accuracy across 60 real-world datasets:
70
75
80
85
2 4 6 8 10Maximum depth of tree
Out
−of
−sa
mpl
e ac
cura
cy
CART OCT OCT−H
10 / 24
Performance of Optimal Classification Trees
I Average out-of-sample accuracy across 60 real-world datasets:
70
75
80
85
2 4 6 8 10Maximum depth of tree
Out
−of
−sa
mpl
e ac
cura
cy
CART OCT OCT−H Random Forest XGBoost
10 / 24
How do trees compare with Deep Learning?
I B. + Mazumder+ Sobiesk, 2018
I Theorem: Optimal classification and regression trees withhyperplanes are as powerful as classification and regression(feedforward, convolutional and recurrent) neural networks,that is given a NN we can find a OCT-H (or ORT-H) that hasthe same in sample performance.
I Out of sample performance is very comparable on 7 populardata sets between NNs and OCT-Hs.
I Why is this result important?
11 / 24
Surgical Outcomes Prediction - used at MGH
Figure: Decision tree for predicting any complication post surgery.
12 / 24
Surgical Outcomes Prediction - App
Figure: Surgical outcome prediction questionnaire based on OptimalTrees.
13 / 24
Mortality Prediction in Cancer Patients - used atDanna-Farber
Figure: Decision tree for predicting 60-day mortality in breast cancerpatients.
14 / 24
Mortality Prediction in Cancer Patients - App
Figure: Cancer mortality prediction questionnaire based on Optimal Trees.
15 / 24
Saving Lives in Liver Transplantation
Using OCT, we designed a new system for prioritizing livertransplantation recipients that averts 400 deaths per year in theUS compared to current practice.
16 / 24
Critical Brain Injury
Using OCT, we can identify critical brain injury in children using40% less CT scans than CART and missing only 5 children (out of337, instead of 9 for CART ).
17 / 24
Designing financial plans from transactions
I Using OCT we can accurately predict whether a person islikely to buy a house, or open an educational account basedon transactional data (payroll, credit cards, ...).
I Based on these predictions we create a financial plan thatmaximizes the probability of success of goals.
18 / 24
Optimal Prescriptive Trees
I B+Dunn+Mundru, Optimal Prescriptive Trees, 2018.
I Consider a healthcare setting (personalized medicine, manyother applications)
I Historical observational data (Xi , zi ,Yi ), i = 1, . . . , n.
I Xi ∈ Rd : Features of patient i .
I zi ∈ {1, 2, . . . ,m} : Treatment assigned to patient i by doctor.
I Yi ∈ R : Outcome recorded of patient i (Lower the better).
I Question: When a new patient comes in with features x ,what treatment τ(x) ∈ {1, 2, . . . ,m} is best for this person?
19 / 24
Can we use Machine Learning?
I For each patient xi : If we knew the best treatment(treatment out of m options that leads to best outcome),then it is a standard multiclass classification problem.
I We could learn a classifier that predicts in {1, . . . ,m} givenx ∈ Rd using this historical data.
I KEY CHALLENGE: But, we only know the outcome for zi(historically given treatment) and not the others.
I We do not know what would have happened(“counterfactuals”) to patient i under the other (m − 1)treatments.
20 / 24
Optimal Prescriptive Trees
I Objective: Determine τ(x) to minimize
µ Mean outcome + (1− µ)Prediction error, 0 < µ < 1
µ
n∑i=1
(yiI[τ(xi ) = zi ] +
∑t 6=zi
yi (t)I[τ(xi ) = t]
)+
(1− µ)
[n∑
i=1
(yi − yi (zi ))2],
I Need to predict counterfactuals.1. For each subject i : If he/she received treatment 1, we know
Yi = Yi (1).2. Estimate Yi (0) as average of patients in that leaf who received
0.3. Can also use linear regression.
I Use B+Dunn OCT or ORT algorithms.
21 / 24
Personalized Diabetes Management
I Data from the Boston Medical Center, from 1999-2014.
I 100,000 patient visits for type 2 diabetes.
I 13 possible treatment options (regimens).
I Patient features include demographic information (sex, race,gender etc.), treatment history, and diabetes progression.
I Outcome of interest: HbA1c level; smaller the better.
I Varied # training samples from 1,000–50,000 to examine theeffect on out-of-sample performance. Averaged this processover ten different splits of the data.
22 / 24
OPT has a Performance and Interpretability Edge
● ● ● ●
● ● ● ●
● ●
●
●
●● ●
●
●
● ●●
●●
●
●
−0.6
−0.4
−0.2
0.0
103 103.5 104 104.5
Training size
Mea
n H
bA1c
cha
nge
● ● ● ●
● ● ● ●
●●
●
●
●
●●
●●
● ●
●●
●
●
●
−0.6
−0.4
−0.2
0.0
103 103.5 104 104.5
Training size
Con
diti
onal
HbA
1c c
hang
e
● ● ● ●
● ● ● ●
●●
●●
●
●●
●
●●
● ●
●●
● ●
0%
25%
50%
75%
100%
103 103.5 104 104.5
Training size
Pro
p. d
iffer
from
SO
C
●
●
●
●
●
●
BaselineOracle
RC−kNNRC−LASSO
RC−RFOPT
23 / 24
Conclusions
I OCT and OCT-H provide interpretable, state of the artpredictions.
I OPT provide state of the art prescriptions direcltly from data
I Exciting applications in medicine and many other fields:computer security, financial services, drug discovery amongmany others.
I Rethink how we teach optimization.
I New Class: Machine Learning and Personalized Medicine
24 / 24