EECS 298: Social Consequences of Computing
Lab 7
Task
In this lab, you will explore (part of) the COMPAS dataset and get experience building a simple machine learning model using the package scikit-learn. This lab will walk you through the four stages of the machine learning pipeline discussed in lecture (Data, Representation, Loss, Optimization) and you will compute some probabilities with the model we create.
To get started, first download the simplified COMPAS dataset using wget
and then create a file called lab8.py
in the same folder as the downloaded dataset.
$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/compas-data-lab-8.csv
Data - COMPAS Dataset (Simplified)
The original COMPAS dataset contains many features including demographic features, criminal history, and information about the current charge of defendants. In this lab, you will use the same simplified version of the dataset we used in Lab 7 so that we can follow along part of the critical analysis ProPublica did with this dataset. We saw partially how Northpointe created their tool in lecture, so this lab shows another way to choose features on a dataset to perform an analysis. We have repeated the explanation of the data below.
Following ProPublica’s analysis, the features
we consider and their corresponding numeric values are:
sex
: Male: 0, Female: 1age
:age
< 25: 0, 25<=age
<= 45: 1,age
>45: 2race
: Caucasian: 0, African-American: 1, Asian: 2, Hispanic: 3, Native American 4, Other: 5priors_count
: Number of prior offences to current charge.c_charge_degree
: Degree of current charge. Misdemeanor: 0, Felony: 1two_year_recid
: Indicator variable for whether the defendent re-offended two years after current charge. No: 0, Yes: 1
Again, following ProPublica’s analysis, the target variable
we are trying to predict and its corresponding numeric values is:
score_text
: Whether the defendent was classified as Low or High/Medium risk with Northpointe’s tool. Low: 0, High/Medium: 1
The features
and target variable
are coded as numeric values because the model we are using needs numeric values to make predictions (see below).
Representation - Logistic Regression Model
The model we are using, following ProPublica’s analysis, we are going to be using a Logistic Regression model to fit the data. A logistic regression model uses the given features
to predict the binary target variable
by fitting a logistic curve to the data (see lab slides). Thus, for a given set of inputs (a vector of features
), the logisitic regression model outputs a score in [0,1]
which represents the probability of target variable being equal to 1
(thus an output of 0
implies the target variable
is 0
with probability 1
).
In our setting, we have a dataset of defendants with associated features
and a target variable
(describe above) that we will use to train the logisitic regression model to determine how inputs (features
) are matched to an output (probability of target variable
being 1).
Formally, if $X$ is a vector of input features
(i.e., one defendant), then the logistic regression model learns parameters for the probability function $p(X) =\frac{1}{1+e^(-\beta_0+\beta_1 X)}$ where $\beta_0$ is the intercept
of the model and $\beta_1$ is the rate parameter
. Therefore, the prediction
, $\hat{Y}$, for an input $X$ is given as $\hat{Y} = 1$ if $p(X) \ge 0.5$ and $\hat{Y} =0$ if $p(X) <0.5$. See here for more details on the logistic regression model.
Loss - Fitting the Logistic Curve
The loss function associated with the logisitic regression model is called cross entropy loss
and you can read more about the details here. However, for the purposes of this lab assignment, you only need to know that minimizing this loss function suffices to find parameters $\beta_0, \beta_1$ that best fit the logistic regression model given in the previous section.
Optimization - Training the Logistic Regression Model
Now that we have introduced the model and loss function we will use, we can begin coding the training process. To get started, import numpy
and sklearn
. If you need to install sklearn
, run pip install scikit-learn
in your terminal. Set the random seed here as well to ensure stability of training results.
import numpy as np
from sklearn import linear_model, model_selection, metrics
np.random.seed(298298) # sets the random seed for stability of outputs
Next, read in the data from compas-data-lab-8.csv
however you’d like, but make sure to create a list of features
called compas_features
and a list of target variable
labels called compas_labels
. That is,compas_features
is a list
of lists
of features
(all but the last column in the dataset) of each defendant and compas_labels
is a list
of target variables
(the last column in the dataset). Make sure to cast the values to int
s as you read them in and populate the lists.
For example, looking at the first couple rows of compas-data-lab-8.csv
, you want each variable to look as follows
compas_features = [[0,2,5,0,1,0], [0,1,1,0,1,1],[0,0,1,4,1,1], ...]
compas_labels = [0,0,0, ...]
Next, it is common practice to split the available data into separate sets for training and testing the model. The primary reason for doing this is to evaluate the performance of the model on unseen data. When we train a machine learning model, we use a portion of the available data to fit the model to the underlying patterns in the data. The goal is to find a model that can accurately predict the target variable for new, unseen data. However, if we use all of the available data for training, we have no way of knowing how well the model will generalize to new data. By splitting the data into training and testing sets, we can use the training data to fit the model and the testing data to evaluate its performance.
You can use the following scikit-learn function to arbitrarily split the data into train and test sets:
x_train, x_test, y_train, y_test = model_selection.train_test_split(compas_features, compas_labels)
When training our logistic regression model, we will let the model learn the target values (y_train
) from the defendants in the training set (x_train
). Then, once trained, we will give it the defendants in the test set (x_test
) and ask it to predict the score_text
for each.
We can use scikit-learn to train a logisitic regression model in the following way
logistic_regression_model = linear_model.LogisticRegression().fit(x_train, y_train)
After that, we can use the trained model to make predictions on the test data.
y_test_pred = logistic_regression_model.predict(x_test)
Evaluating the Model
We are going to use accuracy and true positive rate to evaluate our model and gain practice in computing probabilities from model outputs. We will use the test set to evaluate our model and specifically calculate empirical probabilities in this section.
Accuracy
Let $Y$ be a random variable for the actual risk score for each defendant in x_test
(i.e., y_test
) and let $\hat{Y}$ be a random variable for the predicted score for each defendent in x_test
(i.e., y_test_pred
). Further, let $n$ be the number of examples in x_test
. Then, the accuracy of the model is defined as $\sum^n_{i=1} \mathbb{I}[Y_i = \hat{Y}_i]$ where $\mathbb{I}$ is the indicator function and $Y_i, \hat{Y}_i$ are the risk score and predicted risk score for the $i$th defendant, respectively. To get practice using functions in sklearn
, we will use the function metrics.accuracy_score(y_test, y_test_pred)
to get the accuracy of our model.
Compute the accuracy of the model and print out the result as follows
Accuracy of model: 0.7524303305249513
True Positive Rate
The positive
outcome of a machine learning model can be interpreted in many ways, but we will say that the positive
outcome corresponds to score_text
label of 1
. Using this definition, the true positive rate of a model is the probability that $\hat{Y}=1$ given that $Y = 1$. This is a conditional probability which we would write as $P(\hat{Y} = 1 | Y=1)$.
For this section, you will compute the true positive rate for each race so that we can compare the rates among the different groups. Let $A$ be a random variable to represent the race of each defendant. Then, compute $P(\hat{Y} = 1 | Y=1, A=a)$ for $a \in {0,1,2,3,4,5}$ (since there are 6 possible race categories – see Data section). Print out the results of your calculation. Thus, your output (combined with the accuracy calculation) should look as follows
Accuracy of model: 0.7524303305249513
Race Category: 0, TPR: 0.6144578313253012
Race Category: 1, TPR: 0.7244444444444444
Race Category: 2, TPR: 0.6
Race Category: 3, TPR: 0.6521739130434783
Race Category: 4, TPR: 0.0
Race Category: 5, TPR: 0.42105263157894735
Computing Conditional Expectation
For this section, you will compute the expected value of Y’ (i.e., the model prediction) conditioned on each age category for the test split so that we can compare the conditional expected values among the different groups. Recall that conditional expected value is computed as the weighted sum of possible values of Y’ (i.e., 0 or 1) where the weights are the conditional probabilities. That is,
\[E[Y' | Age = a] = 0 * P(Y' = 0 | Age = a) + 1 * P(Y' = 1 | Age = a)\]Print the results of your calculations. You output (combined with accuracy and TPR calcuations) should look as follows:
Accuracy of model: 0.7524303305249513
Race Category: 0, TPR: 0.6144578313253012
Race Category: 1, TPR: 0.7244444444444444
Race Category: 2, TPR: 0.6
Race Category: 3, TPR: 0.6521739130434783
Race Category: 4, TPR: 0.0
Race Category: 5, TPR: 0.42105263157894735
Age Category: 0, Expected Y' Value: 0.8181818181818182
Age Category: 1, Expected Y' Value: 0.3548034934497817
Age Category: 2, Expected Y' Value: 0.12794612794612795
Turn in lab8.py
to Gradescope when you’re done.