EECS 298: Social Consequences of Computing

Lab 8

Task

In this lab, you will build another simple machine learning model with the South German Credit dataset. After building the model, you will have some practice computing the demographic parity fairness metrics and plotting some results with matplotlib.

To get started, first download the South German Credit dataset using wget and then create a file called lab9.py in the same folder as the downloaded dataset.

$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/south-german-credit-lab9.csv

Data - South German Credit Dataset

The South German Credit dataset has information on attributes of loan applicants related to their loan application. The target variable is LoanDefault which is 1 if the applicant does not default on their loan (i.e., they pay it back) and 2 if the applicant does default (i.e., they do not pay it back). Refer to the dataset documentation for more details on all of the included attributes.

Since we are going to be calculating a fairness metric in this lab, then we must define the sensitive attribute in this setting and it will be Age.

lab9.py

First, import numpy, csv, sklearn, and matplotlib in as follows and set the random numpy seed. If you need to install matplotlib, run pip install matplotlib in your terminal.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection, linear_model
import csv

np.random.seed(298)

Read in south-german-credit-lab-9.csv however you’d like and separate all the features into a list called credit_training_features and all the labels (LoanDefault column) into a list called credit_training_labels. Make sure to cast all the data to ints as you read them in. As you read in the data, you are going to convert the Age variable to a binary variable in the following way

Age < 35, change the age to 0.
Age >= 35, change the age to 1.

Next, split the data into training and testing sets using model_selection.train_test_split(credit_training_features, credit_training_labels). Then, train a LogisticRegression model on the training data. Finally, generate a prediction on the testing data. Refer to Lab 7 for a refresh on these steps using sklearn.

Now, we will use these predictions to calculate empirical probabilities to use in our fairness metric calculation.

Demographic Parity

Let \(A\) be a random variable for the sensitive attribute. In this case \(A\) is a binary variable to indicate each age category specified above. Let \(\hat{Y}\) be a random variable for the predicted label for each input. In this case \(\hat{Y}\) is 1 if an applicant is predicted to not default and 2 if they are predicted to default.

Recall the definition of demographic parity from lecture: \(P(\hat{Y} =1| A = \text{Age < 35}) = P(\hat{Y}=1 | A = \text{Age >= 35})\)

Calculate each probability above and store the results in a dictionary such that you print out the results as follows

Test probabilities: {0: 0.697841726618705, 1: 0.8288288288288288}

From these outputs, we can see demographic parity is not satisfied since the probabilities for each age category are quite different which implies that \(\hat{Y}\) is not statistically independent from \(A\).

Finally, you will use matplotlib to plot a stacked bar graph which shows, for each age category, the number of inputs that received a prediction of 1 and the number of inputs that received a prediction of 2. You will use plt.bar() and you can read about the settings to create a stacked bar chart here

Your final bar graph should look as follows (don’t worry if your formatting is not exactly the same, but the shape of the graph should be the same):

Turn in lab9.py on Gradescope and turn in your bar graph to the Lab 8 Graph Submission on Gradescope as well.

Tips

Plotting a graph

Below gives an example of creating a simple plot using matplotlib.

import matplotlib.pyplot as plt # Import the library

# Data to graph
x = [1, 2, 3, 4, 5] 
y = [5, 10, 15, 20, 25]

plt.scatter(x, y) # Create a graph
# Maps data pairs to x-y coordinates 
# graph will include the points (1, 5), (2, 10), and so on

# Once a graph is created, you have a few options:
plt.show() # Open the graph in a pop-up window
plt.savefig(filepath) # Save the figure to the specified filepath

Customizing your graph

Find the list of all available colors here.

# Specify the color and label for your line
plt.plot(x1, y1, color="tab:blue", label= "Line 1 label") 

# Can plot multiple lines per graph
plt.plot(x2, y2, color="tab:green", label="Line 2 label") 

# Label your axes as follows:
plt.xlabel('Name of X-axis') 
plt.ylabel('Name of Y-axis') 
  
# Display a title on the graph:
plt.title("Title of graph")

# Show the legend and optionally specify its location to display the labels of each line
plt.legend(loc="upper center")
plt.show()