Fair and Explainable AI: Measuring Fairness in Automated Decision-Making

Fairness in AI – Part I: Measuring fairness in AI

Machine learning (ML) is playing an ever-bigger role in everyday life by helping us make decisions – from choosing which movies to watch on Netflix or which news to read on Facebook to potentially life-changing decisions like job and loan applications or healthcare decisions. But can we call these automated decisions made by Artificial Intelligence (AI) systems fair? Can we rely on ML algorithms to decide about loan and job applications, or are they prone to biases that would negatively impact certain groups of people? Unfortunately in recent years, biased decisions made by AI have been ubiquitous: Google’s search engine seems to think the majority CEOs are white males, Amazon’s free delivery appears to be biased against black neighborhoods in US cities and its hiring model was biased against women, job adverts have been proven to be shown more to male rather than female advertising consumers.

Such unfairnesses have different causes. Problems can be caused by what an algorithm is trying to optimize for. A certain type of job advert being shown to more men than women can be the result of an algorithm that concentrates on advertising cost-efficiency in an industry that charges more for ad-delivery to younger women than for men. More often, the problem is the data itself – for example, in the past, CEOs were mostly white and male, and such outdated information may still power the latest AI systems. Flawed data produces flawed results, or, as computer scientists like to say: “Garbage in, garbage out.” But how can we decide if a system is unfair or not? What can a company do to avoid negative news headlines like those above?

How to measure fairness in the context of decision-making?

Fairness in the context of decision-making “is the absence of any prejudice or favoritism toward an individual or group based on their inherent or acquired characteristics,” as researchers at the University of Southern California’s Information Sciences Institute put it. But how can it be measured? Let’s imagine an ML model that decides about the creditworthiness of individuals applying for a bank loan. Using credit histories, income, age, criminal records etc., it classifies each person as having either good credit (and so likely to pay back a loan) and bad credit (likely to default). We need to take a closer look at the decisions the ML model made – and whether individuals deemed creditworthy actually paid back the loans they were given.

How we assess the fairness classifications in ML models

The so-called German Credit Dataset, for example, contains information about 1000 loan applicants. Each person is classified as good or bad credit risk according to a set of attributes – what they need the loan for, their age, gender, whether they own housing, how much they have saved, etc. Using tools built by DAIN Studios, we can assess the fairness of these classifications in a number of ways. We first choose what data scientists call the “protected variable”, the variable that creates the population set we are interested in. This can be ethnicity, gender, age or income group, etc. – or a combination of these, if we’re interested in, say, how people of a certain age group and ethnicity are treated by the ML model.

Let’s use gender. One of the most common definitions of fairness is Demographic or Statistical Parity, which in this example would make sure that the outcomes of an algorithm are equal for males and females. This means the German Credit Dataset should have a loan-application acceptance rate that is roughly the same for men and women. In the credit dataset, 73% of males received the loan they applied for, whereas only 61% of women did. Whether we reject the result as unfair depends on how strict we are. One common approach is to allow anything up to a 20 percentage difference between two values to be acceptable. That would in this case allow us to say that the dataset’s ML model satisfies Demographic Parity when it comes to gender.

One problem with Demographic Parity is that it can be satisfied in unfair ways. For example, with the help of the model we could scrupulously select the top 80% of male applicants and randomly reject every fifth female. This would make the outcome look fair according to Demographic Parity. An alternative fairness definition is called Equalized Odds, which in this case ensures that all creditworthy applicants have the same chance of getting a loan, regardless of whether they are men or women. This condition is fulfilled if male and female applications have the same proportion of so-called true positives (loans granted to creditworthy borrowers) and false positives (loans granted to risky borrowers).

The so-called confusion matrix below helps to make sense of this. Taking data for males and females from our German Credit Dataset, we can determine which proportion of each gender was rightly given a loan (true positives) and which was wrongly given one (false positives). Doing this shows us that if we follow the model’s recommendations, 70% of creditworthy women and 84% of creditworthy men are given loans, while 48 percent of credit-unworthy women and 40% of credit-unworthy men are also able to borrow. Whether the ML model treats men and women the same again depends on how seriously we take the discrepancies displayed by these pairs of error rates.

A third fairness criterion is Sufficiency. Akin to Equalized Odds, two conditions have to be met to ensure fairness. With bank loans, this means, first, that the proportion of borrowers who pay back the loan out of those for whom the model predicts that they would pay it back should be the same for males and females. Second, the proportion of people who default on their loan out of those for whom the model predicts that they would do so should be the same for males and females. As data scientists say, the positive predictive value (PPV) and the negative predictive value (NPV) have to be equal for the unprotected and protected groups – the latter being monitored for potential unfairness. The German Credit Dataset has a big discrepancy in PPV (65% women, 86% for men), while satisfying the equality of NPV (57% for both).

We see that there are many different ways to define and measure fairness. The different ways of measuring gender-based fairness of the German Credit Dataset tell us that a slightly smaller proportion of women (61% of female applicants) than men (73%) got a loan, that a smaller proportion of creditworthy women (70%) than creditworthy men (84%) were deemed eligible by the model, but that a slightly higher proportion of ineligible female borrowers can still get a loan (48% of credit-unworthy females compared with 40% of credit-unworthy men). In addition, the ML model was better at predicting whether male borrowers would repay loans, although it was equally as good at predicting the rates of male and female defaulters.

To measure the fairness of AI systems, we need to select a fairness metric depending on our specific use case.

There are many more fairness definitions and metrics, but most draw on the three types described. How can we create algorithms that satisfy them all? Luckily, we don’t have to. The Impossibility Theorem of Fairness states that the three fairness criteria – Independence (aka Demographic Parity), Separation (aka Equalized Odds) and Sufficiency – cannot be all satisfied at the same time. This means we need to make trade-offs when thinking about the fairness of AI systems and we need to select a fairness metric depending on our use case. Equalized Odds, for example, would not be the most suitable metric for studying the fairness of problems where false positives are rare, such as SARS-CoV-2 antigen tests. We should also think about whether the decision the ML model is meant to support is punitive (like determining parole) or supportive (like granting a loan).

The fairness tree developed by the Aquitas project at Carnegie Mellon University can help to steer us from our broad aim at its top to the recommended fairness metric at its base. With the German Credit Dataset, for example, we have focused on errors rather than representations. The granting of loans is an assistive intervention that affects a large percentage of the population. If we are most worried about being unfair to people without assistance (rejected loan applicants), the best fairness metric would be something called False Omission Rate (FOR) parity. (Since FOR = 1-NPV, this metric also relates to the Sufficiency criterion.)

Looking at FOR values, we see that 43% of women the model predicted to default actually would have paid back their loan – exactly the same proportion as among men. Even though the model falsely omits (or wrongly denies) 43 percent of female and male loan applicants from actually getting loans, it treats men and women the same. Therefore, in this case, if we focus on FOR as our fairness metric and females as our protected demographic group, we can confirm that our ML assisted credit scoring decisions do not show “any prejudice or favoritism toward an individual or group”. Fairness can be measured – once the right fairness definition to use has been identified by us.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFlare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
tableau_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
tableau_public_negotiated_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_dc_gtm_UA-111640802-1	1 minute	This cookie is used by Google Tag Manager to support Google Analytics on our Sites. It helps us monitor the use and performance of our Sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_JWW0KP3X8Q	2 years	This cookie is installed by Google Analytics 4.
_gat_UA-111640802-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
ai_session	30 minutes	This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform.
ai_user	1 year	A unique user identifier cookie, set by Microsoft Application Insights software, that enables counting of the number of users accessing the application over time.
AnalyticsSyncHistory	1 month	Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
prism_252943399	1 month	This cookie is used by Active Campaign for site tracking purposes.
visitorId	1 year	By default, the visitor ID is supplied to Coveo UA using the visitor (string) query parameter and kept in the local storage of the user browser. A third-party cookie can also be used to store the visitor ID if the current user browser accepts these kinds of cookies.
WFESessionId	session	These cookies are used by Microsoft Azure Application Insights, which collects site telemetry information, allowing us to analyze how some of our Sites are performing and to perform optimization.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
LinkedIn
muc_ads	2 years	Collects data on user behaviour and interaction in order to optimize the website and make advertisement on the website more relevant.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.

Cookie	Duration	Description
CONSENT	16 years 7 months 20 days 16 hours 15 minutes	No description
GetLocalTimeZone	session	No description
hid	session	No description available.

Fair and Explainable AI: Measuring Fairness in Automated Decision-Making

Fairness in AI – Part I: Measuring fairness in AI

How to measure fairness in the context of decision-making?

How we assess the fairness classifications in ML models

To measure the fairness of AI systems, we need to select a fairness metric depending on our specific use case.

References

Details

Computer Vision: Create an API in 60 minutes

Data Governance Roles and Responsibilities

Guiding C-Level Executives Through Business Ethics in the Data and AI Age

DAIN Studios

Studio HELSINKI

Studio BERLIN

Studio MUNICH