EECS 298: Social Consequences of Computing

Lab 7

Task

In this lab, you will explore (part of) the COMPAS dataset and get experience calculating basic probabilities based on large datasets. This lab will walk you through the dataset you need to analyze and you will use Python to compute the types of probabilities discussed in lab.

To get started, first download the simplified COMPAS dataset using wget and then create a file called lab7.py in the same folder as the downloaded dataset.

$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/compas-data-lab-7.csv

Data - COMPAS Dataset (Simplified)

The original COMPAS dataset contains many features including demographic features, criminal history, and information about the current charge of defendants. In this lab, you will use a simplified version of this dataset contained in compas-data-lab-7.csv so that we can follow along part of the critical analysis ProPublica did with this dataset. We saw partially how Northpointe created their tool in lecture, so this lab shows another way to choose features on a dataset to perform an analysis.

Following ProPublica’s analysis, the features we consider and their corresponding numeric values are:

sex: Male: 0, Female: 1
age: age < 25: 0, 25<= age<= 45: 1, age >45: 2
race: Caucasian: 0, African-American: 1, Asian: 2, Hispanic: 3, Native American 4, Other: 5
priors_count: Number of prior offences to current charge.
c_charge_degree: Degree of current charge. Misdemeanor: 0, Felony: 1
two_year_recid: Indicator variable for whether the defendent re-offended two years after current charge. No: 0, Yes: 1

Again, following ProPublica’s analysis, the target variable we are trying to predict and its corresponding numeric values is:

score_text: Whether the defendent was classified as Low or High/Medium risk with Northpointe’s tool. Low: 0, High/Medium: 1

The features and target variable are coded as numeric values because the model we are using needs numeric values to make predictions (see below).

Probabilities

We are going to use the above features of the COMPAS dataset to compute various probabilities. After reading the following probabilities you are to compute, process compas-data-lab-7.csv appropriately.

We define the following random variables to use throughout the rest of this lab. Recall that a random variable represents the outcome of a probabilistic event:

$S$ denotes the sex a given person in our dataset belongs to
$A$ denotes the age category a given person in our dataset belongs to
$R$ denotes which of the 5 racial categories a given person in our dataset belongs to
$I$ denotes the count of prior offences of a given person in our dataset
$C$ denotes whether the charge of a given person in our dataset was a misdemeanor or felony $(C = 1)$ or $(C = 0)$
$W$ denotes whether the defendent will re-offend two years after the current charge $(W = 1)$ or $(W = 0)$
$T$ denotes whether a given person in our dataset was classified as low or high/medium risk by Northpointe’s tool $(T = 1)$ or $(T = 0)$

In lab7.py, compute the following probabilities using the data in compas-data-lab-7.csv:

$P(S = s)$: For each of the categories for the sex of a given person in our dataset, calculate $P(S)$ and print the result as follows:

(Question 1)
P(S = 0) = 0.8096241088788075
P(S = 1) = 0.1903758911211925

$P(R = r)$: For each of the racial categories, calculate $P(R)$ and print the result as follows:

(Question 2)
P(R = 0) = 0.34073233959818533
P(R = 1) = 0.5144199611147116
P(R = 2) = 0.005022683084899547
P(R = 3) = 0.08246921581335062
P(R = 4) = 0.0017822423849643552
P(R = 5) = 0.05557355800388853

$P(R = r

T = 1)$: For each of the racial categories, calculate the conditional probability $P(R = r

T = 1)$ and print the result as follows:

(Question 3)
P(R = 0 | T = 1) = 0.25299890948745907
P(R = 1 | T = 1) = 0.6648491457651763
P(R = 2 | T = 1) = 0.0025445292620865138
P(R = 3 | T = 1) = 0.051254089422028346
P(R = 4 | T = 1) = 0.0029080334423845873
P(R = 5 | T = 1) = 0.025445292620865142

$E[I]$: Compute the expected number of prior offences for a random person in our dataset and print the result as follows:

(Question 4)
E[I] = 3.2464355152300715

$E[I	R = r]$: For each of the racial categories, compute the conditional expectation of the number of prior offences and print the result as follows:

(Question 5)
E[I | R = 0] = 1640.285482825664
E[I | R = 1] = 6922.034996759558
E[I | R = 2] = 0.21095268956578095
E[I | R = 3] = 88.15959170447181
E[I | R = 4] = 0.10158781594296824
E[I | R = 5] = 33.28856124432923

Final Output

Turn in lab7.py to Gradescope when you’re done. Your code will be autograded, so the exact values from your calculations should match that of the gradescope autograder. Your final output after all five questions should be:

(Question 1)
P(S = 0) = 0.8096241088788075
P(S = 1) = 0.1903758911211925
(Question 2)
P(R = 0) = 0.34073233959818533
P(R = 1) = 0.5144199611147116
P(R = 2) = 0.005022683084899547
P(R = 3) = 0.08246921581335062
P(R = 4) = 0.0017822423849643552
P(R = 5) = 0.05557355800388853
(Question 3)
P(R = 0 | T = 1) = 0.25299890948745907
P(R = 1 | T = 1) = 0.6648491457651763
P(R = 2 | T = 1) = 0.0025445292620865138
P(R = 3 | T = 1) = 0.051254089422028346
P(R = 4 | T = 1) = 0.0029080334423845873
P(R = 5 | T = 1) = 0.025445292620865142
(Question 4)
E[I] = 3.2464355152300715
(Question 5)
E[I | R = 0] = 1640.285482825664
E[I | R = 1] = 6922.034996759558
E[I | R = 2] = 0.21095268956578095
E[I | R = 3] = 88.15959170447181
E[I | R = 4] = 0.10158781594296824
E[I | R = 5] = 33.28856124432923

Tips

You might find the following definitions and formulas useful in calculating the above quantities.

Conditional Probability

We define conditional probability as the probability that an event happened given that another even occurred. We denote this as $P(X | Y)$ and its formal definition is as follows:

\[P( X | Y ) = \frac{P( X ∩ Y )}{P( Y )}\]

Expectation

We define the expectation of a random variable $X$ as the weighted average of its possible outcome values. We denote this as $E[X]$.

Conditional Expectation

We define the conditional expectation of a random variable $X$ as the average of its possible outcome values weighted by a conditional probability. We denote this as $E[X | Y]$ and its formal definition is as follows:

\[E[X | Y] = ∑ x * P(X = x | Y )\]

Refer back to the lab slides for simple examples and further information about calculating probabilities.