EECS 298: Social Consequences of Computing
Lab 9
Task
In this lab, you will use pandas to build a simple machine learning model with the South German Credit dataset. After building the model, you will report some empirical probability measures using a specified assumption about moral desert in this setting.
To get started, first download the South German Credit dataset using wget
and then create a file called lab9.py
in the same folder as the downloaded dataset.
$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/south-german-credit-lab-10.csv
Data - South German Credit Dataset
The South German Credit dataset has information on attributes of loan applicants related to their loan application. The target variable is LoanDefault
which is 1
if the applicant does not default on their loan (i.e., they pay it back) and 0
if the applicant does default (i.e., they do not pay it back). Refer to the dataset documentation for more details on all of the included attributes.
Since we are going to be calculating a fairness metric in this lab, then we must define the sensitive attribute
in this setting and it will be Age
.
lab9.py
First, import pandas
and sklearn
in as follows. If you need to install pandas
, run pip install pandas
in your terminal.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
Read in south-german-credit-lab-10.csv
to a dataframe
using the pandas
function read_csv(filename)
. Again, change Age
to a binary variable as follows by using the pandas
function .apply
on the dataframe
you read in.
Age
< 35, change the age to0
.Age
>= 35, change the age to1
.
Next, split the data into training and testing sets using train_test_split(dataframe, test_size=0.3, random_state=2)
where dataframe
is what you read in above with the edited Age
column. This function will return two dataframes
, one for training data and one for testing data. You will now split each into 2 dataframes
, one for the features and one for the labels using the pandas
function .loc
. Each feature dataframe
should include all rows and all columns except the LoanDefault
column. Each label dataframe
should then include all rows in only the LoanDefault
column.
After splitting the data into these 4 dataframes
, reset the index of each dataframe using .reset_index(drop=True)
on each dataframe
.
Then, train a LogisticRegression
model on the training data. Finally, generate a prediction on the testing data. Refer to Lab 7 for a refresh on these steps using sklearn
. This time, set max_iter=3000
in the LogisiticRegression
model constructor to avoid learning convergence warnings.
Now, we will use these predictions to calculate some empirical probabilities.
Moral Desert
Let \(A\) be a random variable for the sensitive attribute. In this case \(A\) is a binary variable to indicate each age category specified above. Let \(\hat{Y}\) be a random variable for the predicted label for each input. In this case \(\hat{Y}\) is 1
if an applicant is predicted to not default and 0
if they are predicted to default.
Consider the feature CreditHistory
and refer to the link in the introduction for the values this feature can have. Suppose we have the following criteria for moral desert in this loan lending scenario:
- Each
Age
category is equally deserving of a positive prediction (\(\hat{Y} = 1\)) given that each person has “existing credits paid back duly till now”. - Each
Age
category is equally deserving of a negative prediction (\(\hat{Y} = 0\)) given that each person has “critical account/ other credits existing (not at this bank)”.
Use what you learned in lecture to compute the empirical probability using the testing data for each Age
category for each moral desert criteria. Output your result to 5 decimal places (see float formatting below) as follows:
First probability calculation:
A = 1: 0.77966
A = 0: 0.75510
Second probability calculation:
A = 1: 0.10417
A = 0: 0.10526
Turn in lab9.py on Gradescope when you’re done.
Tips
Pandas
pandas
is a package for data analysis and you can read the (full documentation here)[https://pandas.pydata.org/docs/reference/index.html]. The main data structure in pandas
is a DataFrame
such that data is visualized as a matrix in pandas
and the functions in this package are built around this assumption. A DataFrame
has named columns and rows that have values for each of these columns. The DataFrame
also has an Index
column to keep track of the position of each row. Below gives a simple example of reading in a csv
file in pandas
and using some of the functions.
Suppose file.csv
looks as follows:
Col1,Col2,Col3
1,red,2
2,blue,5
9,green,1
Then, we can use pandas
as follows:
import pandas as pd
df = pd.read_csv("file.csv") # creates a dataframe from our data
# Accessing columns
print(df["Col1"]) # prints the values in Col1
# Using .apply function
df["Col2"] = df["Col2"].apply(lambda x: 0 if (x == "red") or (x =="blue") else 1) # change Col2 to be 0-1 valued
# Using .loc function
smaller_df = df.loc[:, df.columns != "Col1"] # make a new dataframe with all rows (given by ":") not in Col1
even_smaller_df = smaller_df.loc[lambda row: row["Col3"]> 2, :] # make a new dataframe with the same columns, but only keep rows that have value >2 in Col3
# Print out index of smaller dataframe
print(even_smaller_df.index) # the indices of the rows will not change from the order they were read in even though we dropped a row!
# Resetting indices
even_smaller_df.reset_index(drop=True, inplace=True) # resets the index of even_smaller_df while dropping the old index list
Shallow vs. Deep Copy
As you may recall from C++, there are different ways to copy objects to new variable names. The same is true in Python, you will see an example of 3 ways to copy a list in Python below:
x = [[1,2],[3,4]] # starting with a list
# Direct assingment
y = x # y points to the same place in memory as x
x.append("new value")
print(y) # [[1,2],[3,4], "new value"] -- y was also modified!
# Shallow copy
y = list(x) # y is a different list than x, but points to the same children elements as x
x.append([5,6])
print(y)# [[1,2],[3,4], "new value"] -- y does not get the new value now
x[0][0] = 0
print(y) # [[0,2],[3,4], "new value"] -- y was also modified!
# Deep copy
import copy
y = copy.deepcopy(x)
x[0][1] = 3
x.append("another new value")
print(y) # [[0,2],[3,4], "new value"] -- y is not modified at all now!
Float Formatting in Strings
When printing out floats
, you may not always want to print out all the decimals. So, you can easily specify the number of decimal places to show with variable:.Nf
where N
is the number of places to show. For example:
fraction = 2.0/9.8 # 0.2040816327...
print(f"2 decimal places: {fraction:.2f}") # 0.20
print(f"4 decimal places: {fraction:.4f}") # 0.2041
print(f"8 decimal places: {fraction:.8f}") # 0.20408163