EECS 298: Social Consequences of Computing

Lab 9

Task

In this lab, you will use pandas to build a simple machine learning model with the South German Credit dataset. After building the model, you will report some empirical probability measures using a specified assumption about moral desert in this setting.

To get started, first download the South German Credit dataset using wget and then create a file called lab9.py in the same folder as the downloaded dataset.

$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/south-german-credit-lab-10.csv

Data - South German Credit Dataset

The South German Credit dataset has information on attributes of loan applicants related to their loan application. The target variable is LoanDefault which is 1 if the applicant does not default on their loan (i.e., they pay it back) and 0 if the applicant does default (i.e., they do not pay it back). Refer to the dataset documentation for more details on all of the included attributes.

Since we are going to be calculating a fairness metric in this lab, then we must define the sensitive attribute in this setting and it will be Age.

lab9.py

First, import pandas and sklearn in as follows. If you need to install pandas, run pip install pandas in your terminal.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Read in south-german-credit-lab-10.csv to a dataframe using the pandas function read_csv(filename). Again, change Age to a binary variable as follows by using the pandas function .apply on the dataframe you read in.

Age < 35, change the age to 0.
Age >= 35, change the age to 1.

Next, split the data into training and testing sets using train_test_split(dataframe, test_size=0.3, random_state=2) where dataframe is what you read in above with the edited Age column. This function will return two dataframes, one for training data and one for testing data. You will now split each into 2 dataframes, one for the features and one for the labels using the pandas function .loc. Each feature dataframe should include all rows and all columns except the LoanDefault column. Each label dataframe should then include all rows in only the LoanDefault column.

After splitting the data into these 4 dataframes, reset the index of each dataframe using .reset_index(drop=True) on each dataframe.

Then, train a LogisticRegression model on the training data. Finally, generate a prediction on the testing data. Refer to Lab 7 for a refresh on these steps using sklearn. This time, set max_iter=3000 in the LogisiticRegression model constructor to avoid learning convergence warnings.

Now, we will use these predictions to calculate some empirical probabilities.

Moral Desert

Let \(A\) be a random variable for the sensitive attribute. In this case \(A\) is a binary variable to indicate each age category specified above. Let \(\hat{Y}\) be a random variable for the predicted label for each input. In this case \(\hat{Y}\) is 1 if an applicant is predicted to not default and 0 if they are predicted to default.

Consider the feature CreditHistory and refer to the link in the introduction for the values this feature can have. Suppose we have the following criteria for moral desert in this loan lending scenario:

Each Age category is equally deserving of a positive prediction (\(\hat{Y} = 1\)) given that each person has “existing credits paid back duly till now”.
Each Age category is equally deserving of a negative prediction (\(\hat{Y} = 0\)) given that each person has “critical account/ other credits existing (not at this bank)”.

Use what you learned in lecture to compute the empirical probability using the testing data for each Age category for each moral desert criteria. Output your result to 5 decimal places (see float formatting below) as follows:

First probability calculation:
A = 1: 0.77966
A = 0: 0.75510
Second probability calculation:
A = 1: 0.10417
A = 0: 0.10526

Turn in lab9.py on Gradescope when you’re done.

Tips

Pandas

pandas is a package for data analysis and you can read the (full documentation here)[https://pandas.pydata.org/docs/reference/index.html]. The main data structure in pandas is a DataFrame such that data is visualized as a matrix in pandas and the functions in this package are built around this assumption. A DataFrame has named columns and rows that have values for each of these columns. The DataFrame also has an Index column to keep track of the position of each row. Below gives a simple example of reading in a csv file in pandas and using some of the functions.

Suppose file.csv looks as follows:

Col1,Col2,Col3
1,red,2
2,blue,5
9,green,1

Then, we can use pandas as follows:

import pandas as pd

df = pd.read_csv("file.csv") # creates a dataframe from our data

# Accessing columns
print(df["Col1"]) # prints the values in Col1

# Using .apply function
df["Col2"] = df["Col2"].apply(lambda x: 0 if (x == "red") or (x =="blue") else 1) # change Col2 to be 0-1 valued

# Using .loc function
smaller_df = df.loc[:, df.columns != "Col1"] # make a new dataframe with all rows (given by ":") not in Col1

even_smaller_df = smaller_df.loc[lambda row: row["Col3"]> 2, :] # make a new dataframe with the same columns, but only keep rows that have value >2 in Col3

# Print out index of smaller dataframe
print(even_smaller_df.index) # the indices of the rows will not change from the order they were read in even though we dropped a row!

# Resetting indices
even_smaller_df.reset_index(drop=True, inplace=True) # resets the index of even_smaller_df while dropping the old index list

Shallow vs. Deep Copy

As you may recall from C++, there are different ways to copy objects to new variable names. The same is true in Python, you will see an example of 3 ways to copy a list in Python below:

x = [[1,2],[3,4]] # starting with a list

# Direct assingment 
y = x # y points to the same place in memory as x
x.append("new value")
print(y) # [[1,2],[3,4], "new value"] -- y was also modified!

# Shallow copy
y = list(x) # y is a different list than x, but points to the same children elements as x
x.append([5,6])
print(y)# [[1,2],[3,4], "new value"] -- y does not get the new value now
x[0][0] = 0
print(y) # [[0,2],[3,4], "new value"] -- y was also modified!

# Deep copy
import copy
y = copy.deepcopy(x)
x[0][1] = 3
x.append("another new value")
print(y) # [[0,2],[3,4], "new value"] -- y is not modified at all now!

Float Formatting in Strings

When printing out floats, you may not always want to print out all the decimals. So, you can easily specify the number of decimal places to show with variable:.Nf where N is the number of places to show. For example:

fraction = 2.0/9.8 # 0.2040816327...

print(f"2 decimal places: {fraction:.2f}") # 0.20
print(f"4 decimal places: {fraction:.4f}") # 0.2041
print(f"8 decimal places: {fraction:.8f}") # 0.20408163