Post on 30-Apr-2020
Justiça naAprendizagem de MáquinaEscola de Verão 2020Flavio Figueiredo
1
About Me
Flavio Figueiredo
Whenever someone asks me: What do you research?
About Me
Flavio Figueiredo
Whenever someone asks me: What do you research?
● Distributed Systems○ Dependability, File Sharing, Social Exchanges
● Information Retrieval○ Folksonomies
● Social Networks○ Popularity and evolution
● Human Computer Interaction○ User studies
● Machine Learning○ Learning models from social data
Human Factors in Computer Science
Distributed SystemsWhy and how do people share information?
Information RetrievalHow does unstructured human knowledge grow?
Social DynamicsHow does user actions impact popularity? How do users perceive popularity?
Machine LearningHow to capture complex human behavior?
Interest on Human-Centric Machine Learning
5
Fairness is one of the main issues!
6
Simplified Background
● Mathematically speaking, what is the goal of a supervised learning system?
● The goal is to learn some parameters
● Where these parameters maximize some prediction function across y
This is just one view. Optimization v. Bayesian and other topics out of the scope. 7
Simplified Background
● The goal of a supervised learning algorithm is to discriminate
8
Notation
We observe a dataset sampled from some joint distribution
Our ML model creates a hypothesis focused on good predictions
9
Pip
elin
e
10
Data
Training Val (Dev) Test
Time
Test
Test
Test
Test
Development Production
Ho
pe
full
y. L
et’
s as
sum
e s
o.
11
Data
Training Val (Dev) Test
Tempo
Test
Test
Test
Test
Development Production
Same Distribution
Simplified Background
● The goal of a supervised learning algorithm is to discriminate ● Why are we now so worried that it does? It seems we can trust them.
12
Machine Bias
● Pro Publica analysis of COMPAS (which stands for Correctional Offender Management Profiling for Alternative Sanctions)https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
13
Criminal Prediction
https://callingbullshit.org/case_studies/case_study_criminal_machine_learning.html
14
Discrimination
● The overall goal of a ML model is on discrimination● The best hypothesis is a good discriminator
○ Both in training time and in testing time
● In society, discrimination has a different connotation
15
Regulation
16
Two Guiding Principles
Disparate Treatment: Individuals from a sensitive group must be treated equally
19
Two Guiding Principles
Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally
Extreme case: A credit scoring system that denies all loans?
20
Two Guiding Principles
Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally
Extreme case: A credit scoring system that denies all loans?Real Case: https://en.wikipedia.org/wiki/Ricci_v._DeStefano
21
COMPAS
22
23
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Machine Bias
There’s software used across the country to predict future criminals. And it’s biased against blacks.by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica May 23, 2016
COMPAS
Correctional Offender Management Profiling for Alternative Sanction
● First developed in 1998● Still used today● Predicts chance of recidivism● Developed by Northpointe (now Equivant)● Most information comes from a manual
"A Practitioner's Guide to COMPAS Core"
24
COMPAS
Used in (or was used in, may be outdated)
● Florida● New York● Wisconsin● California● and others...
25
Overall
It’s not the algorithm, it’s the data.
https://cacm.acm.org/magazines/2017/2/212422-its-not-the-algorithm-its-the-data/fulltext
“There is no debate that both of these types of technologies are being used on a fairly widespread basis in the U.S. According to a 2013 article published by Sonja B. Starr, a professor of law at the University of Michigan Law School, nearly every state has adopted some type of risk-based assessment tools to aid in sentencing.”
26
How is the model trained?
Who knows?!
27
How is the model trained?
We know the input
28
How is the model trained?
Compas Questionnairehttps://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.html
29
Is COMPAS Using Well Known ML?
● Linear Regression● Logistic Regression● Support Vector Machines● Random Forests● etc● etc
Who knows? But they cite such methods.
30
Several Citations to ML Work
Note that it is a regression model. It seems that they regress to their own risk score.
31
Decile Scores
My interpretation:
● It seems that COMPAS scalesinputs and outputs in deciles
● Inputs come from the questionnaireand other variables
● Outputs is usually some kind of risk● I would guess that this is fed to some
blackbox ML system
32
Decile Scores
33
Single Equation
Only one equation is present in the manual:
Violent Recidivism Risk Score = (age∗−w) + (age-at-first-arrest∗−w) + (history of violence∗w) + (vocation education ∗ w) + (history of noncompliance ∗ w),
where w is weight, the size of which is "determined by the strength of the item’s relationship to person offense recidivism that we observed in our study data."
34
ProPublica’s Study
● Most analysis of such software is done by the developers● ProPublica decided to do it’s own analysis● COMPAS was chosen due to popularity
35
Anecdotal Evidence
Several examples in the article
36
Anecdotal Evidence
● Great for news articles● But the problem needs to be better understood● Use machine learning to understand machine learning
37
Data
38
Response
39
Some Initial Data Science
● Understanding Decile Scores (output of the model)
40
Some Initial Data Science
● Risk of Recidivism Score
41
Some Initial Data Science
● Violent Risk Score
42
Logistic Regression
● The model here is trying to predict the output of Compas● Works somewhat like a reverse engineer● We are not predicting real recidivism!
43
Results of the Model
● Significant values indicatethat the features may be ableto explain the COMPAS score
● Positive values point towardsa recidivism indication by COMPAS
● Negative values are the opposite● Higher values, more impact
44
Violent Recidivism
● Significant values indicatethat the features may be ableto explain the COMPAS score
● Positive values point towardsa recidivism indication by COMPAS
● Negative values are the opposite● Higher values, more impact
45
Fairness
46
21 fairness definitions and their politics
Arvind Narayanan - FAT Conference 2018 Tutorial
● Computer Scientist on a wild goose chase for a single definition● There is value to various definitions● Each can lead to trustworthiness
47
What is Fairness?Sahil Verma and Julia Rubin (2018) -- Fairness Definitions Explained
● A lot of these metrics worry about someform of equality
● Let S be some subset of sensitive attributes.S = { col(j, X) | column j is sensitive }N = { col(i, X) | column i is not-sensitive }
48
Balanced Representation
1 0 1 1
0 0 1 0
1 0 1 1
1 1 0 1
0
1
0
0
SN
1 1 1 0 1
X =
Is this fairness?
49
Classifier Evaluation
If we assume that the future has the same distribution as the past.
● We can measure the error rates on our development step● Indicates that our model works
Or we can...
● Actually wait for the future● Then make claims
In either case, there are multiple ways to evaluate a classifier. Compare predictions with ground-truth labels.
50
Classifier Evaluation
● Different tasks will focus on different metrics● Search engines usually optimize for precision (definition soon)
○ Retrieved cases
● When predicting some rare cancer, recall is more important○ All cases
● How should we work in recidivism?!
To understand the metrics we need a confusion matrix.
51
Parity in Predictions
52
Classifier Evaluation (COMPAS Example)
53
High Risk Low Risk
Recidivism True PositiveTP
False NegativeFN
Stayed Clean False PositiveFP
True NegativeTN
What is the impact of each kind of error?
54
High Risk Low Risk
Recidivism True PositiveTP
False NegativeFN
Stayed Clean False PositiveFP
True NegativeTN
Classifier Evaluation (COMPAS Example)
55
High Risk Low Risk
Recidivism True PositiveTP
False NegativeFN
Criminal loose!
Stayed Clean False PositiveFP
Innocent arrested!
True NegativeTN
COMPAS in Practice
● Each individual is assigned a score
● Scores are broken down into deciles
56
COMPAS in Practice
● Let’s assume three cut-off points● Purple individuals have recidivated● Orange ones have not● Now we need to pick a cut-off point
57
D1 D2 D2 D1 D3 D3 D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● Let’s say that everyone >= D2 is at risk● What is each score?
58
D1
D2 D2
D1
D3 D3
D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● Let’s say that everyone >= D2 is at risk● What is each score?● TP = 3, FP = 1, FN = 0, TN = 3.
59
D1
D2 D2
D1
D3 D3
D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● Let’s say that everyone >= D1 is at risk● What is each score?
60
D1 D2 D2 D1 D3 D3 D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● Let’s say that everyone >= D1 is at risk● What is each score?● TP = 3, FP = 4, FN = 0, TN = 0.
61
D1 D2 D2 D1 D3 D3 D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● Let’s say that everyone >= D3 is at risk● What is each score?● TP = 1, FP = 1, FN = 2, TN = 3.
62
D1 D2 D2 D1
D3 D3
D1
TP FN
FP TN
1 0prediction
true1
0
COMPAS in Practice
● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall
○ TP / (TP + FN)○ Row normalization
● FPR○ FP / (FP + TN)○ Row normalization
● How do we maximize each?
63
TP FN
FP TN
prediction
true
COMPAS in Practice
● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall
○ TP / (TP + FN)○ Row normalization
● FPR○ FP / (FP + TN)○ Row normalization
● How do we maximize each?○ Recall is maximized when we say everyone is going to recidivate○ FPR is maximized if say no-one is going to
64
TP FN
FP TN
prediction
true
COMPAS Evaluation(From developers)
65
TP FN
FP TN
prediction
true
Predictive Validity of the COMPAS Reentry Risk ScalesAn Outcomes Study Conducted for the Michigan Department of Corrections: Updated Results on an Expanded Release Sample
https://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf
COMPAS in Practice
● TPR or Recall of around 80%● 80% of every recidivism case is accurate
● FPR or around 43%● 43% of innocent individuals receive a
high score● 57% of non-recidivism cases are accurate
66
TP FN
FP TN
prediction
true
COMPAS in Practice
● Suggestion of cut-off point at 4● Anybody above that is predicted as
a recidivism risk● Below are low risk● AUC of 0.72
○ Good score
67
TP FN
FP TN
prediction
true
COMPAS in Practice
● In all fairness, we are only showing one result● The study considers various aspects of COMPAS● If you wish, you can present many of the studies from COMPAS
68
Two views of the confusion matrix
69
Other View of the Confusion Matrix
70
High Risk Low Risk
Recidivism True PositiveTP
False NegativeFN
Stayed Clean False PositiveFP
True NegativeTN
Data focused (recall)
Prediction Focused (Prec.)
Column Normalization
● Positive Predictive Value (PPV) or Precision○ TP / (TP + FP)
● False Discovery Rate○ FP / (TP + FP)○ 1-PPV○ 1-Precision
● Negative Predictive Value (NPV)○ TN / (TN + FN)
● False Omission Rate○ 1 - NPV
Word of advice: Some books transpose the matrix. No decoreba!71
TP FN
FP TN
prediction
true
Propublica Dataset
Now let’s look into the Propublica study.Different dataset (Michigan vs Broward County)
72
Propublica Dataset
Let’s look at results comparing races
Table from Krishna Gummadi
73
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
74
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
75
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
76
TP FN
FP TN
1 0prediction
true1
0Northpointe: FDR rates are comparable!COMPAS is Fair!
Propublica Dataset
● Now let’s look at the rows.● Recall that this focuses on the data!● Columns focus on predictions. Previous results, predictions balanced!
77
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
78
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
79
TP FN
FP TN
1 0prediction
true1
0
Propublica Dataset
80
TP Loose
Arrest TN
1 0prediction
true1
0● Error rates are not comparable● What are the consequences?
Rates on Imbalanced Datasets
● Both datasets are imbalanced to begin with● There are arguments from both sides● Presentation and Project ideas
○ Re-evaluate COMPAS■ Other classifiers■ Other metrics
○ Present counter-argument papers
81
Impossibility of Fairness
82
Fairness is Impossible in Practice
● Two proofs, one from:Alexandra Chouldechova
and one from
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghava
● References at the end of the slide
83
In Practice
● Classifiers output some probability score● This probability is used as a threshold● This is exactly what we have done with the decile scores
84
Formalizing
● The classifier now outputs a score
● From which we can create classes
● Essentially what COMPAS does85
Formalizing
● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score
(note to self: exemplify)
86
Formalizing
● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score
● Each has a probability distribution P(R = 1) and P(R = 0)● Also, let 𝞼 be the true fraction of individuals that have recidivated. This is called our
base rate (or prevalence): 𝞼 = P(Y = 1)
87
3 Definitions of Fairness
● When y = 1. Recidivism
● When y = 0. No recidivism
88
Calibration
89
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Calibrated for all Scores (Deciles)
90
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Not Calibrated for Score 2
91
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Not Calibrated for all Scores
92
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Intuition
Calibration: Scores have the same meaning.
93
Predictive Parity
94
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Predictive Parity OK for sHR = 2
95
s = 1
R = white
s = 2
R = white
s = 3
R = white
s = 1
R = black
s = 2
R = black
s = 3
R = black
Intuition
Calibration: Scores have the same meaning.
Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.
96
Measuring Calibration and PP
97
Error Rate Balance
● Focused on the rows-normalization of theconfusion matrix
● Note that it is conditioned on true value● Recall like interpretation● This is where ProPublica showed that COMPAS is unfair● Easy to see with a plot (next slide)
98
TP FN
FP TN
1 0prediction
true1
0
Error Rate Balance
99
Intuition
Calibration: Scores have the same meaning.
Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.
Error Rate Balance: We cannot bias to fewer/more errors towards a certain group.
100
Error Rate Balance
Previous graph
● For different cut-off points● Imbalanced error rates● [Main Result] COMPAS is never fair from this point of view. Why?
○ Actually valid for any classifier
101
Proof that FPR and FNR cannot be equal
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
102
TP FN
FP TN
1 0prediction
true1
0
We can write the identity below
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
103
TP FN
FP TN
1 0prediction
true1
0
FP=
TP + FN FP TP
FP + TN FP + TN TP TP + FN
We can write the identity below
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
104
TP FN
FP TN
1 0prediction
true1
0
FP=
TP + FN FP TP
FP + TN FP + TN TP TP + FN
We can write the identity below
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
105
TP FN
FP TN
1 0prediction
true1
0
FP=
TP + FN FP TP
FP + TN FP + TN TP TP + FN
We can write the identity below
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
Rewrite:
106
TP FN
FP TN
1 0prediction
true1
0
FP=
TP + FN FP TP
FP + TN FP + TN TP TP + FN
We can write the identity below
Knowing that:
● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)
107
TP FN
FP TN
1 0prediction
true1
0
FPR =𝞼 1 - PPV
1-FNR1-𝞼 PPV
Two Groups
108
FPR(w) =𝞼(w) 1 - PPV(w)
1-FNR(w)1-𝞼(w) PPV(w)
FPR(b) =𝞼(b) 1 - PPV(b)
1-FNR(b)1-𝞼(b) PPV(b)
Each with a confusion matrix, then for group w:
And for group b:
Calibration and PP are equal for groups
109
If PP and Calibration are Equal
110
FPR(w) =𝞼(w) 1 - PPV(w)
1-FNR(w)1-𝞼(w) PPV(w)
FPR(b) =𝞼(b) 1 - PPV(b)
1-FNR(b)1-𝞼(b) PPV(b)
PPV(w) = PPV(b)
If PP and Calibration are Equal
111
FPR(w) =𝞼(w)
1-FNR(w)1-𝞼(w)
FPR(b) =𝞼(b)
1-FNR(b)1-𝞼(b)
PPV(w) = PPV(b). Divide one with the other.
The other two can only be equal when: 𝞼(w) = 𝞼(b)
This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FPR(w) = FPR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FNR.
If PP and Calibration are Equal
112
FPR(w) =𝞼(w)
1-FNR(w)1-𝞼(w)
FPR(b) =𝞼(b)
1-FNR(b)1-𝞼(b)
PPV(w) = PPV(b). Divide one with the other.
The other two can only be equal when: 𝞼(w) = 𝞼(b)
This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FNR(w) = FNR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FPR.
If PP and Calibration are Equal
113
FPR(w) =𝞼(w)
1-FNR(w)1-𝞼(w)
FPR(b) =𝞼(b)
1-FNR(b)1-𝞼(b)
PPV(w) = PPV(b). Divide one with the other.
Proof Sketch: 𝞼/(1-𝞼) is bijective.FPR(w) = a * bFPR(b) = c * d. When we set both FPRs to equal, b = d only when a = c.
Results
● Impossibility of achieving predictive parity and equal errors for both error rates● Our classifiers will always favor one group● Where do we go from here:
○ We can tolerate some error threshold○ Decide which group should be biased○ More definitions of fairness [next class]
114
What can we do?
115
Hard Problem
Society (as a consequence datasets) is unfairAccountability is difficult (who do we blame?)Datasets and models are hard to understand
116
Thank You!
118
References
● Predictive Validity of the COMPAS Reentry Risk Scaleshttps://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf
● Equality in Opportunity in Machine Learninghttps://arxiv.org/pdf/1609.05807.pdf
● Fair prediction with disparate impact:A study of bias in recidivism prediction instrumentshttps://www.andrew.cmu.edu/user/achoulde/files/disparate_impact.pdf
● Inherent Trade-offs in Algorithmic Fairnesshttps://www.youtube.com/watch?v=p5yY2MyTJXA
119