Fall 2020 Data Ethics in Algorithmic Decision Making · 2020. 11. 11. · 11/11/2020 4 Fall 2020 7...

35
11/11/2020 1 Fall 2020 1 Data Ethics in Algorithmic Decision Making Vassilis Christophides, Vasilis Efthymiou {christop|vefthym}@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete, Fall 2020 Fall 2020 2 Plan Introduction and Motivation Discrimination Examples in Algorithmic Decision Making Direct and Indirect Discrimination How Machines Learn to Discriminate? Anti-Discrimination Learning Associational Fairness Definitions Causal Fairness Definitions Upstream Discrimination Prevention in ML Pipelines Interplay of Data Preprocessing and Fairness Interventions Progress so Far and Acknowledgments 1 2

Transcript of Fall 2020 Data Ethics in Algorithmic Decision Making · 2020. 11. 11. · 11/11/2020 4 Fall 2020 7...

  • 11/11/2020

    1

    Fall 2020

    1

    Data Ethics in Algorithmic

    Decision Making

    Vassilis Christophides, Vasilis Efthymiou{christop|vefthym}@csd.uoc.gr

    http://www.csd.uoc.gr/~hy562

    University of Crete, Fall 2020

    Fall 2020

    2

    Plan

    • Introduction and Motivation

    • Discrimination Examples in Algorithmic Decision Making

    ▪Direct and Indirect Discrimination

    • How Machines Learn to Discriminate?

    • Anti-Discrimination Learning

    ▪Associational Fairness Definitions

    ▪Causal Fairness Definitions

    • Upstream Discrimination Prevention in ML Pipelines

    ▪ Interplay of Data Preprocessing and Fairness Interventions

    • Progress so Far and Acknowledgments

    1

    2

  • 11/11/2020

    2

    Fall 2020

    3

    The Rise of Artificial Intelligence (AI)

    • Widespread investments in Artificial Intelligence (AI) are made to enable computers

    to interpret what they see, communicate in natural language, answer complex

    questions, and interact with their physical environment

    ▪ improve people’s lives, e.g., autonomous cars

    ▪ accelerate scientific discovery, e.g., precision

    medicine

    ▪ protect environment, e.g., reduce energy footprint

    ▪ optimize business, e.g., targeted advertisement

    ▪ transform society, e.g., increasing automation

    Fall 2020

    4

    Timeline of the Industrial Revolutions

    • The huge societal challenges arriving with the increased automation enabled by AI,

    resemble the challenges of industrial revolutions

    ▪Data is the driver of the new industrial era and actually fuel the development of AI

    https://pocketconfidant.com/how-the-era-of-artificial-intelligence-will-transform-society/

    3

    4

  • 11/11/2020

    3

    Fall 2020

    5

    AI is Data Hungry!

    https://www.slideshare.net/KeithKraus/gpuaccelerating-udfs-in-pyspark-with-numba-and-pygdf

    1012

    1015

    1018

    1021

    109

    Self-Driving

    Cars

    Superhuman

    Doctor

    Life On Other

    Planets?

    Creative arts

    Smart farms

    & food

    systems

    'Detoxify'

    Social Media

    Personal

    assistants

    Smarter

    Cybersecurity

    Earth challenge

    areas

    An ML model is only as good as its data, and no matter how good a training algorithm

    is, the ultimate quality of automated decisions lie in the data itself!

    Fall 2020

    6

    Automated Decisions of Consequent

    • Widespread algorithmic decision systems with many

    small interactions

    ▪ e.g. search, recommendations, social media, …

    • Specialized algorithmic decision systems with fewer

    but higher-stakes interactions

    ▪ e.g. hiring and promotion, credit-worthiness and

    loans, criminal justice and predictive policing,

    child maltreatment screening, medical

    diagnosis, welfare eligibility, …

    • At this level of impact, algorithmic decision-making

    can have unintended consequences in people’s life

    6

    Hiring AI

    Policing Sentencing AI

    Lending AI

    5

    6

  • 11/11/2020

    4

    Fall 2020

    7

    Are Automated Decisions Impartial?

    • Algorithmic decision making is more objective than humans and yet…

    ▪All traditional evils of discrimination, and many new ones,

    exhibit themselves in the Big Data & AI ecosystem

    ▪Opaque automated decision systems

    • Human decision-making affected by greed, prejudice, fatigue, poor

    scalability, etc. and hence can be biased!

    ▪ Formal procedures can limit opportunities to exercise prejudicial

    discretion or fall victim to implicit bias

    • High-stakes scenarios = ethical problems!

    ▪ Despite existing legal/regulation efforts, current anti-discrimination laws

    are not yet well equipped to deal with various issues of discrimination in

    data analysis [S. Barocas, A.D. Selbst 2016]

    Fall 2020

    8

    Discrimination is not a General Concept

    • It is domain specific

    ▪Concerned with important opportunities that affect people’s life chances

    • It is feature specific

    ▪Concerned with socially salient qualities that have served as the basis for

    unjustified & systematically adverse treatment in the past

    7

    8

  • 11/11/2020

    5

    Fall 2020

    10

    Legally Recognized ‘Protected Classes’

    https://www.slideshare.net/KrishnaramKenthapadi/fairnessaware-machine-learning-practical-challenges-and-lessons-learned-www-2019-tutorial

    Societal categories (political

    ideology, income, language,

    physical traits), Intersectional

    subpopulations (young white

    woman, old black men), etc.

    Fall 2020

    11

    Legal & Regulatory Frameworks in EU

    • EU Council Directive 76/207/EEC of 9 February 1976 on the implementation of the

    principle of equal treatment for men and women as regards access to employment,

    vocational training and promotion, and working conditions

    • GDPR 2016/679 71 [https://eur-lex.europa.eu/legal-content/en/TXT/ ?uri=CELEX:

    32015R2120]

    ▪ […] In order to ensure fair and transparent processing […], the controller should

    use appropriate mathematical or statistical procedures for the profiling,

    ▪ […] and that prevents, inter alia, discriminatory effects on natural persons on

    the basis of racial or ethnic origin, political opinion, religion or beliefs,

    ▪ […] or that result in measures having such an effect.

    Direct

    Indirect

    10

    11

  • 11/11/2020

    6

    Fall 2020

    12

    Discrimination Low: Two Doctrines

    • Disparate treatment (DT) is the illegal practice of treating an entity, such as a creditor

    or employer, differently based on a protected/sensitive attributes such as race, gender,

    age, religion, sexual orientation, etc.

    ▪ avoid disparities between different outputs for groups of people with the same (or

    similar) values of non-sensitive attributes but different values of sensitive ones

    • Disparate impact (DI) is the result of systematic disparate treatment, where

    disproportionate adverse impact is observed on members of a protected class [M.

    Feldman et all 2015]

    ▪ minimize outputs that benefit (harm) a group of people sharing a value of a

    sensitive attribute more frequently than other groups of people

    Fall 2020

    14

    What does Discrimination Law Aim to Achieve?

    • Equality of opportunity

    (treatment):

    ▪ Procedural fairness

    • Minimized inequality of outcome

    (impact)

    ▪ Outcome fairness

    https://www.reddit.com/r/GCdebatesQT/comments/7qpbpp/food_for_thought_equality_vs_equity_vs_justice

    12

    14

  • 11/11/2020

    7

    Fall 2020

    15

    Example: Criminal Justice System

    Dylon Fugett (right) was rated low risk while Bernad Parcker (left) was rated high risk

    • There’s software used across US to predict whether someone who has been arrested will be re-arrested in the future on the basis of criminal history, demographic, and other information

    ▪ Assessing recidivism risk is biased against blacks [J. Angwin, et al. 2016]

    https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

    • Recidivism Risk: Defendant’s likelihood of committing a crime

    ▪ Used to decide pretrial detention, bail and sentencing

    Fall 2020

    16

    Example: Predictive Policing

    https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/

    • PredPol identifies areas in a neighborhood where serious crimes are more likely to occur during a particular period

    ▪ using a wide range of “leading indicator” data, including reports of crimes, such as simple assaults, vandalism and disorderly conduct, and 911 calls about such things as shots fired or a person seen with a weapon

    • The American Civil Liberties Union [ACLU], the Brennan Center for Justice and various civil rights organizations have all raised questions about the risk of bias being baked into the software

    15

    16

  • 11/11/2020

    8

    Fall 2020

    17

    • Judge (Decision maker): Of those I’ve labeled

    high-risk, how many will recidivate ?

    Positive Predictive Value

    False Positive Rate

    Demographic Populations

    Recidivate

    Did Not

    Recidivate

    Label

    low-risk

    Label

    high-risk

    Tutorial: 21 fairness definitions and their politics: https://www.youtube.com/watch?v=jIXIuYdnyyk

    Detention of High-Risk Criminals

    • Defense (think hiring rather than criminal justice): Is

    the selected set demographically balanced ?

    • Plaintiff: What’s the probability I’ll be incorrectly

    classifying high-risk ?

    True

    Positive

    False

    Positive

    False

    Negative

    True

    Negative

    Fall 2020

    18

    Prediction Fails Differently for Black Defendants!

    [J. Dressel and H. Farid 2018]

    Labeled

    high-risk,

    But Didn't

    Re-offend

    Labeled low-

    risk, Yet Did

    Re-offend

    23,5% 44,9%

    47,7% 28,0%

    White African

    American

    • COMPAS: Correctional Offender Management Profiling for Alternative Sanctions

    FNRsFPRs

    17

    18

  • 11/11/2020

    9

    Fall 2020

    19

    Essence of the COMPAS Debate• ProPublica's main charge:

    ▪Black defendants face higher false positive rate: among defendants who did not get

    rearrested, black defendants were twice as likely to be misclassified as high risk

    ▪White defendants face higher false negative rate: among defendants who got

    rearrested, white defendants were twice as likely to be misclassified as low risk

    • Northpointe's (now Equivant, of Canton, Ohio) main defense:

    ▪COMPAS was not made to make absolute predictions about success or failure and

    it was designed to inform probabilities of reoffending across three categories of risk

    (low, medium, & high)

    ▪The system is well-calibrated, and that if a person is assigned to one of the three

    risk categories, we can treat them as having a certain risk level

    • Word of caution:

    ▪Neither calibration nor equality of false positive/negative rates rule out blatantly

    unfair practices [S. Corbett-Davies et al 2017]

    Fall 2020

    21

    Discrimination in Credit & Consumer Markets

    Redlining is the (indirect discrimination) practice of arbitrarily denying or limiting financial

    services to specific neighborhoods, generally because its residents are people of color or

    are poor

    19

    21

  • 11/11/2020

    10

    Fall 2020

    22

    Amazon Redlining

    No Amazon Free Same-day Delivery for Restricted Minority Neighborhoods

    Fall 2020

    23

    No Amazon Free Same-day Delivery for Restricted Minority Neighborhoods

    Amazon Redlining

    22

    23

  • 11/11/2020

    11

    Fall 2020

    24

    Amazon Redlining

    No Amazon Free Same-day Delivery for Restricted Minority Neighborhoods

    Fall 2020

    25

    Discrimination in Online Services

    • Non-black hosts can charge ~12% more than black hosts in AirBnB [M. Luca, B. Edelman 2014]

    • Price steering and discrimination in many online retailers [A. Hannak, G. Soeller, D. Lazer, A. Mislove, and C. Wilson2014]

    • Race and gender stereotypes reinforced on the Web [M. Kay, C. Matuszek, S. Munson 2015]

    • China is about 21% larger by pixels when shown in Google maps for China [G. Soeller, K. Karahalios, C. Sandvig, and C. Wilson 2016]

    24

    25

  • 11/11/2020

    12

    Fall 2020

    26

    Representation Bias

    https://towardsdatascience.com/gender-bias-word-embeddings-76d9806a0e17

    Word clouds for the nearest neighbours of “man” (L) and “woman” (R).

    • Gender Bias found in word embeddings trained with word2vec on Google News

    [T. Bolukbasi et al. 2016]

    • Represent each word with high-dimensional vector

    ▪ Vector arithmetic: analogies like Paris - France = London - England

    ▪ Found also: man - woman = programmer - homemaker = surgeon - nurse

    Topic observed: “occupations”

    Fall 2020

    27

    Toxicity: Twitter Taught AI Chatbot to be

    a Racist in Less than a Day

    https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist

    26

    27

  • 11/11/2020

    13

    Fall 2020

    29

    “Bugs” of Data-driven Decision Systems

    Data Aquisition[H. Suresh, J. V Guttag 2019]

    • When data is about people, “bugs” related to various bias types can lead to

    discrimination !

    Representation bias:

    selection or sampling bias

    Information bias:

    observation or

    measurement biasHistorical bias

    test

    “Bugs” in Data Collection

    • Skewed samples

    • Sample size disparity

    “Bugs” in Data Processing

    • Limited or proxy features

    • Data errors

    “Bugs” in Ground Truth

    • Tainted samples

    world

    according

    to data

    Fall 2020

    30

    Model Building and Evaluation

    [H. Suresh, J. V Guttag 2019]

    Aggregation bias

    Evaluation bias

    test

    “Bugs” of Data-driven Decision Systems

    world

    according

    to data

    29

    30

  • 11/11/2020

    14

    Fall 2020

    31

    Three Different Performance Problems

    • Discovering unobserved differences in performance

    ▪ Skewed and tainted samples: Garbage in, garbage out (GIGO)

    –Samples might be biased

    –Labels might be incorrect

    • Coping with observed differences in performance

    ▪ Sample size disparity

    –Learn on majority

    ▪ Limited features

    –Errors concentrated in the minority class

    • Understanding the causes of disparities in predicted outcome

    ▪ Proxies (redline attributes): Data as a social mirror

    –Protected attributes redundantly encoded in observables

    Fall 2020

    32

    Anti-Discrimination Learning

    ❶ Discrimination Discovery/Detection

    ▪ Unveil evidence of discriminatory practices by analyzing the historical dataset

    or the predictive model

    ❷ Discrimination Prevention/Removal

    ▪ Mitigate discriminative effects by modifying the biased data or by adjusting the

    learning process or by twisting the predictive model

    • Pre-processing: modify the

    training data

    • In-processing: adjust the

    learning process

    • Post-processing: directly

    change the predicted labels

    31

    32

    https://camo.githubusercontent.com/d6859ba768b02270d4fab924180a8a51633730c0/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f313630302f312a575258654a7a5537326272686b69586b4375374630772e676966

  • 11/11/2020

    15

    Fall 2020

    33

    Discrimination Discovery Framework

    [D. Pedreshi, S. Ruggieri, and F. Turini 2008]

    Database of

    historical decision

    records

    A criterion of

    (unlawful)

    discrimination

    A set of potentially

    discriminated

    groups

    INPUT OUTPUT

    A subset of decision

    records and potentially

    discriminated people for

    which the criterion holds

    Fall 2020

    34

    Formal Setup

    •X features of an individual (e.g., criminal history, demographics, etc.)

    ▪ may contain redlining attributes 𝑹 (e.g., neighborhoods)•A protected, sensitive attribute (e.g., race)

    ▪ binary attribute 𝐴={𝑎+, 𝑎−}•D=d(X,A) predictor of decision (e.g., criminality risk)

    ▪ binary decision D={d+, d−}

    •Y target variable, labels (e.g., recidivate)

    Note: random variables are drawn from the same probability distribution

    Pa{E}=P{E∣A=a}

    33

    34

  • 11/11/2020

    16

    Fall 2020

    35

    Three Fundamental Criteria

    • Sufficiency: Y independent of A conditional on D (Y ⊥ A ∣ D)

    ▪ similar rates of accurate predictions across groups

    ▪ also called Predictive Rate Parity, Outcome Test, Test-fairness,

    Well-Calibration

    • The above criteria fall into a larger category called “group fairness”

    • Independence: D independent of A (D ⊥ A)

    ▪ predictions are uncorrelated with the sensitive attribute

    ▪ also called Demographic Parity, Statistical Parity

    • Separation: D independent of A conditional on Y (D ⊥ A ∣ Y )

    ▪ equal true positive/negative rates of predictions across groups

    ▪ also called Equalized odds, Positive Rate Parity, Disparate

    Mistreatment, Equal Opportunity

    Fall 2020

    36

    • A is conditionally independent of B given C, if the

    probability distribution governing A is independent

    of the value of B, given the value of C, denoted as

    A ⊥ B | C

    ▪ learning that B = b does not change your belief

    in A when you already know C = c and this is

    true for all values b that B can take and all

    values c that C can take

    a ε dom(A),b ε dom(B),c ε dom(C) P(A=a |B=b, C=c) = P(A=a |C=c)

    • Note: conditional independence neither implies nor

    is implied by independence!

    Conditional Independence

    35

    36

  • 11/11/2020

    17

    Fall 2020

    37

    (Un)Conditional Independence Examples

    A ⊥ B | C

    A ⊥ B

    A ⊥ B | C

    A ⊥ B

    Fall 2020

    38

    First Criterion: Independence

    • Require D and A to be independent, denoted D⊥A (fairness through unawareness) [R. Zemel et al. 2013], [M. Feldman et al. 2015] [S. Corbett-Davies et al. 2018]

    ▪ allocation of benefits and harms across groups can be examined by looking at

    the decision alone

    • That is, for all groups a,b and all values d:

    Pa{D=d} = Pb{D=d}

    • When D is binary 0/1-variables, this means

    Pa{D=1} = Pb{D=1} for all groups a, b! (avoid disparate impact)

    (US legal context)

    (UK legal context)

    • Approximate versions:

    Disparate Impact (DI): Pb{D=1}/Pa{D=1}≥1−ϵ [M. Feldman et al. 2015]

    Calders-Verwer (CV): |Pb{D=1}−Pa{D=1}|≤ϵ [T. Calders, S. Verwer 2010]

    37

    38

  • 11/11/2020

    18

    Fall 2020

    39

    Simple Discrimination Measures

    [D. Pedreschi, S. Ruggieri, F. Turin 2012]:

    ▪ Risk difference = RD = p1 - p2 UK law

    ▪ Risk ratio or relative risk = RR = p1 / p2 EU Court of Justice

    ▪ Relative chance = RC = (1-p1) / (1-p2)

    ▪ Odds ratio = RR/RC

    ▪ Extended risk difference = p1 - p

    ▪ Extended risk ratio or extended lift = p1 / p

    ▪ Extended chance = (1-p1) / (1-p) US courts focus on

    selection rates: (1-p1) and (1-p2)

    Protected group vs.

    unprotected group

    Protected group vs.

    entire population

    Fall 2020

    40

    D

    Second criterion: Separation

    • Require D and A to be independent conditional on Y, denoted D ⊥ A ∣ Y [M. Hardt et al. 2016], [A. Chouldechova 2017], [S. Corbett-Davies et al. 2017] [M Zafar et al. 2017]

    • The probability of predicting D does not change after observing A when we have Y

    ▪That is, for all groups a,b and all values d and y:

    Pa{D=d ∣ Y=y} = Pb{D=d ∣ Y=y} (prevent disparate treatment)

    39

    40

  • 11/11/2020

    19

    Fall 2020

    41

    Confusion Matrix: (Mis)Match between

    Target Variable Y and Decision DTrue Label

    Pre

    dic

    ted

    Ou

    tco

    me

    TPRFPR

    FNRTNR

    PPV FDR

    FOR NPV

    Fall 2020

    42

    Definitions from the Confusion Matrix

    • For any box in the confusion matrix involving the decision D, we can require

    equality across groups [S. Mitchell et al. 2018]

    • Equal TPRs: Pa[D = 1 | Y = 1] = Pb[D = 1 | Y = 1]

    • Equal FNRs: Pa[D = 0 | Y = 1] = Pb[D = 0 | Y = 1]

    • D ⊥ A | Y = 1

    • Equal FPRs: Pa[D = 1 | Y = 0] = Pb[D = 1 | Y = 0]

    • Equal TNRs: Pa[D = 0 | Y = 0] = Pb[D = 0 | Y = 0]

    • D ⊥ A | Y = 0

    • Balanced error rates [A. Chouldechova 2016]: Equal FPR & FNR

    ▪Equal opportunity [M. Hardt, E. Price, N. Srebro 2016]: Equal FNR

    ▪Predictive equality [S. Corbett-Davies 2017]: Equal FPR

    41

    42

  • 11/11/2020

    20

    Fall 2020

    43

    Third Criterion: Sufficiency

    • Require Y and A to be independent conditional on D, denoted Y⊥ A ∣ D

    • Classifier accuracy should be the same for all the groups

    ▪ That is, for all groups a,b and all values d and y:

    Pa{Y=y ∣ D=d} = Pb{Y=y | D=d} i.e. achieve accurate equity

    D

    Fall 2020

    44

    Definitions from the Confusion Matrix

    • For any box in the confusion matrix involving the label Y, we can require equality across groups [S. Mitchell et al. 2018]

    • Equal PPVs: Pa[Y = 1 | D = 1] = Pb[Y = 1 | D = 1]

    • Equal FDRs: Pa[Y = 0 | D = 1] = Pb[Y = 0 | D = 1]

    • Y ⊥ A | D = 1

    • Equal FORs: Pa[Y = 1 | D = 0] = Pb[Y = 1 | D = 0]

    • Equal NPVs: Pa[Y = 0 | D = 0] = Pb[Y = 0 | D = 0]

    • Y ⊥ A | D = 0

    • Well-Calibration [G. Pleiss et al. 2017]: Equal FDR & FOR

    43

    44

  • 11/11/2020

    21

    Fall 2020

    45

    Trade-offs Are Necessary!

    • Any two of the three criteria we saw are mutually exclusive except in degenerate

    cases! [S. Mitchell, et al. 2018] [Kleinberg et al. 2017]

    • Independence vs Sufficiency:

    ▪Proposition. Assume balance for the negative and positive classes, and calibration

    within groups, then, either independence holds or sufficiency but not both

    • Independence vs Separation:

    ▪Proposition. Assume that equal FPRs, equal FNRs, and equal PPVs hold, then,

    either independence holds or separation but not both

    • Separation vs Sufficiency:

    ▪Proposition. Assume all events in the joint distribution of (A,D,Y) have positive

    probability, then either separation holds or sufficiency but not both

    • Variants observed by [Chouldechova 2016] [Kleinberg, Mullainathan, Raghavan 2016]

    Fall 2020

    46

    Separation vs Sufficiency Tradeoffs: Example

    • Suppose the following labels and predictions after optimizing our classifier without any

    fairness constraint

    ▪We get the predictions for group a all correct but makes one false negative mistake

    on group b

    Labels

    https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-3ff8ba1040cb

    D=1 D=0 D=1 D=0D=0

    Y=1 Y=0 Y=1 Y=0

    45

    46

  • 11/11/2020

    22

    Fall 2020

    47

    • Since we want to preserve separation (equalized odds), we decide to make two false

    negative mistakes on a as well

    ▪Now the true negative rates (specificity) as well as the true positive rates

    (sensitivity) are equal: both have 1 (3/3 & 2/2 ) and 1/2 (2/4 & 1/2 )

    Labels

    https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-3ff8ba1040cb

    1/2PPV 2/4

    Separation vs Sufficiency Tradeoffs: Example

    Fall 2020

    48

    Labels

    https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-3ff8ba1040cb

    • However, although positive predictive parity is also preserved, negative predictive

    parity is violated with this setting (sufficiency)

    ▪ It is not possible to preserve negative predictive parity without sacrificing positive

    predictive parity

    Separation vs Sufficiency Tradeoffs: Example

    47

    48

  • 11/11/2020

    23

    Fall 2020

    49

    How About Other Criteria?

    • Can we address the impossibility result for fairness independence, separation,

    sufficiency with other criteria?

    • Fundamental issue: All criteria we've seen so far are observational i.e., properties

    of the joint distribution of (A,X,D,Y)

    ▪Passive observation of the world

    ▪No what if scenarios or interventions (potential outcomes)

    • This leads to inherent limitations

    Fall 2020

    50

    Correlation vs. Causation

    • Correlation means two variables are related but does not tell why

    • A strong correlation does not necessarily mean that changes in one

    variable causes changes in the other

    ▪𝑋 and 𝑌 are correlated

    ▪𝑋 causes 𝑌 or 𝑌 causes 𝑋

    ▪𝑋 and 𝑌 are caused by a third variable 𝑍

    • In order to imply causation, a true experiment must be performed

    where subjects are randomly assigned to different conditions

    ▪Sometimes randomized experiments are expensive or immoral!

    ▪ In the context of fairness, sensitive attributes are typically

    imputable; hence, randomization is not even conceivable!

    Z

    X Y

    common cause

    X

    Y

    Y

    X

    49

    50

    https://camo.githubusercontent.com/8511dea8e222342f3d77e9ea9f4431fc0defbc1c/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f313630302f312a4235525f357038474e53517842495f4c335474314b672e676966

  • 11/11/2020

    24

    Fall 2020

    51

    Causation is a matter of perception

    David Hume (1711-1776) Statistical ML

    Karl Pearson (1857-1936)

    Judea Pearl (1936-)

    Mathematical foundations of causality

    We remember seeing the flame, and feeling a sensation called heat;

    without further ceremony, we call the one cause and the other effect

    Forget causation! Correlation is all you should ask for.

    Forget empirical observations! Define causality based on a

    network of known, physical, causal relationships

    Democritus,

    (460-370 BC)

    I would rather discover one causal law than be the king of Persia

    Some History

    Fall 2020

    52

    What do we Make of This?

    • Answer to substantive social questions not always provided by observational data

    • Association does not mean causation, but discrimination is causal

    ▪whether an individual would receive the same decision had the individual been

    of a different race, sex, age, religion, etc.!

    • Knowledge about causal relationships between all attributes should be taken into

    consideration

    ▪ fairness holds when there is no causal relationship from the

    protected attribute A to the decision D!

    • Need for causal aware methods in discovering and preventing

    discrimination in observational data, i.e., data recorded from the

    environment with no randomization or other controls

    51

    52

  • 11/11/2020

    25

    Fall 2020

    53

    Structural Causal Model

    • Describe how causal relationships can be inferred from non-temporal data if one

    makes certain assumptions about the underlying process of data generation

    • A causal model is triple ℳ=, where

    ▪ 𝑼 is a set of exogenous (hidden) variables whose values are determined by factors outside the model

    ▪ 𝑽={𝑋1,⋯,𝑋𝑖,⋯} is a set of endogenous (observed) variables whose values are determined by factors within the model

    ▪ 𝑭={𝑓1,⋯,𝑓𝑖,⋯} is a set of deterministic functions where each 𝑓𝑖 is a mapping from 𝑼×(𝑽∖𝑋𝑖) to 𝑋𝑖: 𝑥𝑖 =𝑓𝑖 (𝒑𝒂𝑖, 𝒖𝑖)

    – where 𝒑𝒂𝑖 is a realization of 𝑋𝑖’s parents in 𝑽, i.e., 𝑷𝒂𝑖 ⊆ 𝑽, and 𝒖𝑖 is a realization of 𝑋𝑖’s parents in 𝑼, i.e., 𝑼𝑖 ⊆ 𝑼

    Fall 2020

    54

    Causal Graph

    • Each causal model ℳ is associated with a direct graph 𝒢=(𝒱,ℰ), where

    ▪ 𝒱 is the set of nodes represent the variables 𝑼 ∪ 𝑽 in ℳ;

    ▪ ℰ is the set of edges determined by the structural equations in ℳ: for 𝑋𝑖, there is an edge pointing from each of its parents 𝑷𝒂𝑖 ∪ 𝑼𝑖 to it

    –Each direct edge represents the potential direct causal relationship

    –Absence of direct edge represents zero direct causal relationship

    • Assuming the acyclicity of causality, 𝒢 is a directed acyclic graph (DAG)

    • Standard terminology

    ▪ parent, child, ancestor, descendent, path, direct path

    53

    54

  • 11/11/2020

    26

    Fall 2020

    55

    Markovian Model

    • A causal model is Markovian if

    ❶The causal graph is a DAG

    ❷All variables in 𝑼 are mutually independent

    Each node 𝑋 is conditionally independent of its non-

    descendants given its parents 𝑷𝒂𝑋

    • Known as the local Markov condition (e.g., in Bayesian network), or causal Markov

    condition in the context of causal modeling

    ▪ it echoes the fact that the information flows from direct causes to their effect and

    every dependence between a node and its non-descendants involves the direct

    causes!

    Equivalent expression

    Fall 2020

    56

    Conditional Independence

    A node is conditionally independent of its A node is conditionally independent of all other

    non-descendants given its parents nodes in the network given the Markov blanket,

    i.e., its parents, children and children's parents

    55

    56

  • 11/11/2020

    27

    Fall 2020

    57

    A Markovian Model and its Graph

    Observed Variables

    𝑽={𝐼,𝐻,𝑊,𝐸}Hidden Variables

    𝑼={𝑈𝐼,𝑈𝐻,𝑈𝑊,𝑈𝐸}

    Model (𝑀)𝑖=𝑓𝐼 (𝑢𝐼)ℎ=𝑓𝐻 (𝑖,𝑢𝐻)𝑤=𝑓𝑊 (ℎ,𝑢𝑊)𝑒=𝑓𝐸 (𝑖,ℎ,𝑤,𝑢𝐸)

    Graph (𝐺)

    Fall 2020

    58

    Causal Graph of Markovian Model

    Each node is associated with an observable

    conditional probability table (CPT) 𝑃(𝑥𝑖|𝒑𝒂𝑖)

    • We can read off from the causal graph all the conditional independence relationships

    encoded in the causal model (graph) by using a graphical criterion called d-separation

    57

    58

  • 11/11/2020

    28

    Fall 2020

    59

    Definition of d-separation

    • A path 𝑞 is said to be blocked by conditioning on a set 𝒁 if

    ▪ 𝑞 contains a chain 𝑖→𝑚→𝑗 or a fork 𝑖←𝑚→𝑗 such that the middle node 𝑚 is in 𝒁, or

    ▪ 𝑞 contains a sink 𝑖→𝑚←𝑗 such that the middle node 𝑚 is not in 𝒁 and such that no descendant of 𝑚 is in 𝒁

    • 𝒁 is said to d-separate 𝑋 and 𝑌 [P. Spirtes, C. Glymour, R. Scheines 2000], if 𝒁 blocks every path from 𝑋 to 𝑌, denoted by (𝑋 ⊥ 𝑌 | 𝑍)𝐺▪With the Markov condition, the d-separation criterion in causal

    graph G and the conditional independence relations in a

    dataset D are connected such that, if we have (𝑋 ⊥ 𝑌 | 𝑍)𝐺, then we must have (𝑋 ⊥ 𝑌 | 𝑍)Data

    Fall 2020

    60

    d-separation Examples

    Path from 𝑋 to 𝑌 is blocked by the following d-separation relations

    (𝑋⊥𝑌| 𝑍)𝐺,(𝑋⊥𝑌| 𝑈)𝐺,(𝑋⊥𝑌| 𝑍𝑈)𝐺(𝑋⊥𝑌| 𝑍𝑊)𝐺,(𝑋⊥𝑌| 𝑈𝑊)𝐺,(𝑋⊥𝑌| 𝑍𝑈𝑊)𝐺(𝑋⊥𝑌| 𝑉𝑍𝑈𝑊)𝐺

    However we do NOT have

    (𝑋⊥𝑌| 𝑉𝑍𝑈)𝐺

    59

    60

  • 11/11/2020

    29

    Fall 2020

    61

    Causal Model of Predictions

    Observed Variables 𝑽={A, …,Xi,…, D} Hidden Variables 𝑼

    Model (𝑀)a=𝑓A (𝑝𝑎A, 𝑢A)xi=𝑓i (𝑝𝑎𝑖, 𝑢xi), i=1,…,md=𝑓D (𝑝𝑎D, 𝑢D)

    Graph (𝐺)

    𝑿

    𝑼A,⋯,𝑼𝑖,⋯,𝑼D are mutually independent(Markovian Assumption)

    A

    P(a| 𝑝𝑎A)D

    P(d| 𝑝𝑎D)

    Fall 2020

    62

    Modeling Discrimination as Path Specific Effects

    • Direct and indirect discrimination can be captured by the causal effects of A on D

    transmitted along different paths

    ▪Direct: causal effect along direct edge 𝜋𝑑 from A to D

    ▪ Indirect: causal effect along causal paths 𝜋𝑖 from A to D that pass though redliningattributes R

    [L. Zhang et al. 2017]

    [N. Kilbertus et al. 2017]

    [R. Nabi, I. Shpitser 2018] A D

    ▪ If we observe Prb(d+ | a+) != Prb(d- | a+) for each value assignment x of set X

    d-separating A and D, the difference must be due to the existence of the direct

    causal effect of A on D

    61

    62

  • 11/11/2020

    30

    Fall 2020

    63

    𝜋𝑑-specific effect:

    AD𝜋𝑑(𝑎+, 𝑎−)=Σq ( P(d+| 𝑎+,q) P(q| 𝑎−) ) – P(d+| 𝑎−)

    Q is D’s parents except A

    𝜋𝑖-specific effect:

    AD𝜋𝑖(𝑎+, 𝑎−)=Σq( P(d+|𝑎+,q) ΠG in A𝜋𝑖P(g|𝑎+,paG\{A})

    ΠH in A𝜋𝑖 \{D} P(h|𝑎−,paH\{A}) ΠO in V\ChildA P(o|𝑎+,paO ) – P(d+|𝑎−)

    A𝜋𝑖 : A’s children that lie on paths in 𝜋𝑖A𝜋𝑖 : A’s children that don’t lie on paths in 𝜋𝑖

    Quantitative Measuring

    Fall 2020

    64

    Causal Paths Example

    D

    • A bank makes loan decisions based on the zip codes, races, and income of the

    applicants▪ Inadmissible attributes

    –Race: protected attribute

    –Zip Code: redlining attribute

    ▪ Admissible attributes:

    –Income: non-protected

    ▪ Decision: Loan

    63

    64

  • 11/11/2020

    31

    Fall 2020

    65

    Causal Paths Fairness Metrics: Example

    • AD𝜋𝑑(𝑎+, 𝑎−)=ΣZ,I(P(d+|𝑎+,z,i) – P(d+|𝑎−,z,i)) P(z|𝑎−) P(i|𝑎−)

    • AD𝜋𝑖(𝑎+, 𝑎−)=ΣZ,I P(d+|𝑎−,z,i) (P(z|𝑎+) – P(z|𝑎−)) P(i|𝑎−)

    D

    Fall 2020

    66

    Causal Effect vs. Risk Difference

    • The total causal effect of A (changing from 𝑎− to 𝑎+) on D is given by

    ▪TE(𝑎+, 𝑎−) = P(d+| do(𝑎+)) - P(d+| do(𝑎−))

    transmitted along all causal paths from A to D

    • Connection with the risk difference

    ▪TE(𝑎+, 𝑎−) = P(d+| 𝑎+) - P(d+| 𝑎−)

    66

    65

    66

  • 11/11/2020

    32

    Fall 2020

    67

    Total Causal Effect vs. Path-Specific Effect

    • For any 𝜋𝑑 and 𝜋𝑖, we don’t necessarily have

    AD𝜋𝑑(𝑎+, 𝑎−) + SE𝜋𝑖(𝑎

    +, 𝑎−) = SE 𝜋𝑑 U 𝜋𝑖 (𝑎+, 𝑎−)

    • If 𝜋𝑖 contains all causal paths from A to D except 𝜋𝑑, then

    TE(𝑎+, 𝑎−) = SE𝜋𝑑(𝑎+, 𝑎−) - SE𝜋𝑖(𝑎

    −, 𝑎+ )

    “reverse” 𝜋𝑖-specific effect

    Fall 2020

    69

    Summing-Up

    • Observational criteria can help discover discrimination, but are insufficient on their own

    ▪No conclusive proof of (un-)fairness

    • Causal viewpoint can help articulate problems,

    organize assumptions

    ▪Social questions starts with measurement

    ▪Human scrutiny and expertise irreplaceable

    • What to do with the different flavors of fairness?

    ▪ constrain the decision function to satisfy a fairness

    flavor, or

    ▪ design interventions to reduce disparities in input variables and outcomes which

    would reduce disparities in decisions in the long term [C. Barabas et al. 2018], [J.

    Jackson, T. VanderWeele 2018]

    67

    69

  • 11/11/2020

    33

    Fall 2020

    70

    End-to-End ML Pipelines

    • Research on fairness, accountability, and transparency (FAT) of ML algorithms

    and their outputs focus solely on the final steps of the data science lifecycle!

    normalization

    • ML literature generally assumes clean training datasets (no missing, erroneous

    or duplicate values) and focuses on optimizing fairness metrics during in, pre or

    post processing

    Fall 2020

    71

    Upstream Discrimination Prevention

    • Industry practitioners more often turn their attention to the data first [K. Holstein et al

    2019]

    ▪ 65% of survey respondents reported having control over data collection or curation

    • Need tools in creating datasets that support fairness upstream i.e., that address the

    root cause of statistical bias in the data models are trained on [I. Chen, et al. 2018; B.

    Nushi et al. 2018]

    ▪ e.g., tools to diagnose whether a given fairness issue might be addressed by

    collecting more training data from a particular subpopulation or by better cleaning

    existing training data

    … and to predict how much more data are need or to be repaired

    ▪ i.e., tools to help actively guide data collection and pre-processing pipelines in order

    to jointly optimize fairness and accuracy of downstream models

    70

    71

  • 11/11/2020

    34

    Fall 2020

    72

    Preliminary Remarks

    • Data quality issues have the potential to introduce unintended bias and variability in the

    data that could potentially have a crucial impact in high-stake applications

    • The impact of upstream data cleaning to downstream ML models performance depends

    on

    ▪ the data quality issues and their distributions on the datasets (unknown)

    ▪ the effectiveness of the data repairing algorithms (also unknown without ground truth)

    ▪ the internal structure of the ML models (hard to interpret for high-accurate models)

    • It is known that the statistical distortion of datasets by cleaning tasks matters when

    estimating models’ accuracy [T. Dasu, J. M. Loh 2012], yet its impact on the models’

    fairness is only recently started to be investigated [S. Schelter 2019] (for missing values

    imputation)

    Fall 2020

    73

    Responsible Data Science Pipelines

    • Interplay of Bias Mitigation and Data Repairing interventions on data fairness and

    utility for downstream ML Models

    Unfairness

    Mitigation

    Data

    Cleaning

    Row DataML Model

    Data

    Bias

    Dirty

    Data

    Accuracy

    Fairness

    IBM 360 Fairness

    FairBench

    FairTest

    FairnessMetrics

    Themis

    FairPrep

    data from ethnic minorities may

    be noisier than data collected

    from the majority ethnic group

    72

    73

  • 11/11/2020

    35

    Fall 2020

    75

    Questions?

    https://www.lepoint.fr/invites-du-point/aurelie-jean-il-ne-faut-pas-reguler-les-algorithmes-mais-les-pratiques-15-09-2019-2335766_420.php

    Fall 2020

    76

    75

    76