EECS 298: Social Consequences of Computing

Homework 3: To Predict and Serve

Due 11:59 PM EST on April 11

Coding Submission: 40 points

Written Submission: 20 points

Total Points: 60 points

Submission

This assignment consists of two parts:

  1. Programming - submit HW3.py
  2. Written Reflection - submit a PDF with your responses

Both parts will be submitted on Gradescope. Part 1 will be submitted to Homework 3: Coding Submission and Part 2 will be submitted to Homework 3: Written Submission. To access Gradescope, use the link on Canvas.

Part 1 will be graded using an autograder, so you'll be able to get feedback as soon as you submit - you can submit any number of times until you feel happy with your score! Your code will be tested on private cases in addition to the public cases you are given, and as such, your code should be properly generalizable to other, similar calculations. Your programming implementation will be graded on correctness. We encourage collaboration, but all work you submit must be your own.

Part 2 will be graded manually, but you can still resubmit as many times as you need to before the deadline. It is required that you typeset your written responses in a document editor or a program like LaTeX

All writing must be your own, and collaboration must not result in code or writing that is identifiably similar to other solutions.

Introduction to Predictive Policing

Predictive policing refers to the use of data analysis and machine learning techniques to identify patterns and make predictions about future criminal activity. Typical stated goals of predictive policing include allocating police resources more effectively and efficiently, reducing crime rates, and improving public safety. Predictive policing has seen increasingly widespread use in the US [1] [2]. Perhaps the predictive policing algorithm that has recieved the most media coverage is an algorithm called PredPol, developed by a private company (formerly called PredPol, then called Geolitica, now absorbed into SoundThinking (which was formally ShotSpotter) and part of a similar product ResourceRouter) [3].

PredPol is software developed by social scientists in collaboration with the Los Angeles Police Department (LAPD). It uses historical arrest data to predict the probability of future arrests occurring in specific areas using a machine learning model based on earthquake prediction called Epidemic Type Aftershock-Sequences (ETAS) [4]. The algorithm divides a city into grid cells as small as 500x500 feet and assigns a risk score to each cell which is the probability of an arrest occuring in that cell, which PredPol assumes is a proxy for a crime occuring in that cell. The idea is to then use this information to increase the presence of officers to those cells that have the highest risk of a crime. However, critics argue that flawed and systemically biased data results in racially discriminatory predictions and policing, where the use of such algorithms can help produce the very same flawed data that then gets fed back into these systems. This leads to a predictive policing system that reinforces patterns of over-policing, creating feedback loops that create a cycle of oppression [2] [5] [6].

In this assignment, we will investigate these claims by using real arrest data from the city of Oakland, California from 2009 to 2011. For cells, we will use census tracts, units of area that the US Census uses to collect population totals. These are convenient to use because it breaks up the city of Oakland into reasonably sized pieces, and the US Census conveniently collects lots of demographic data on the people who live in each tract. Using this data, we will investigate the racial distribution of the people arrested in this data and and the people who could be affected by the use of the PredPol algorithm on this data. We will also investigate what happens when assigning additional police to a tract results in additional arrests, which then get fed back into the model. We will loosely be following the original analysis of Lum and Isaac [6].

Datasets

For this assignment, you are provided three datasets as described above. For your implementations of different classes and functions, you will read in the data from CSV files. Descriptions of each of these files are below and the HW3.py section will describe how to read in each dataset.

Arrest Data

You are provided real data collected from the Oakland police department about arrests in the city in the form of arrests.csv. This dataset contains details on drug-related arrests in Oakland from 2009 to 2011. The columns include information such as the description of the incident and the location of the arrest. Use wget to get the file arrests.csv.

wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/arrests.csv

The columns of the data are as follows:

Demographic Data

You are also provided racial demographic data from the 2010 US Census for each tract where the Oakland police can make arrests in 2010_Oakland_Tract_Demographics.csv (as estimated by this dataset and Lum and Isaac[6]). This file has a column for the tract and the number of people living in that tract from each racial category the US Census collects. These racial categories are dictated by the US Census Bureau and are measured via people self reporting on the US Census. You may assume that the list of unique tracts in arrests.csv is the same as the list of tracts in this file. Use wget to get the file 2010_Oakland_Tract_Demographics.csv.

wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/2010_Oakland_Tract_Demographics.csv

The first column of 2010_Oakland_Tract_Demographics.csv is Tract and specifies the census tract that the population numbers for each demomgraphic of a single row correspond to.

NOTE: Tract is formatted differently here than in arrests.csv and implementation details are given below for how to resolve this difference

The remaining columns of the data correspond to demographic data of the tracct and are as follows:

Drug Use Data

Finally, you are provided with rates of drug use, broken down by demographic category in 2010_drug_use.csv. This data is from the National Survey on Drug Use and Health (NSDUH), and lists the percentage of people belonging to each demographic category who respond to the survey that they participated in illicit drug use in the last month (among persons aged 12 or older). The survey is from 2010. This will serve as a proxy for the ground truth, given that an anonymous survey conducted using careful sampling techniques will undoubtedly be better than arrest data in measuring illicit drug use. We make the assumption that the national drug use rates are the same in the tracts in Oakland. Use wget to get the file 2010_drug_use.csv.

wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/2010_drug_use.csv

This file consists of only two rows: the CSV header and the drug use percentages for the following categories:

Part 1 - HW3.py

Use wget to download the starter file and PredPol model file.

wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/homeworks/HW3.py
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/homeworks/pred_pol.py

There are two main classes you will implement: DataWrapper and ProbabilityAnalysis. DataWrapper will handle reading in and processing the data from the csv files so that it may be used in the PredPol model and for the probability analysis. ProbabilityAnalysis will use the PredPol model and DataWrapper to compute various probabilities and expectations as explained in the ProbabilityAnalysis section.

In the pred_pol.py file, you will find a function and a class that you should not change. Information for how to use each is given below:

In the HW3.py file, you will find the DataWrapper and ProbabilityAnalysis classes for you to implement. Each class includes constructors and member functions that will be useful for implementing further functions and completing the analysis questions. Information for how to implement the each class is given below.

DataWrapper

This class contains all of the data that we need, including the census tracts, the demographic data for each tract, the arrest data for each tract, and drug use data in the population. Details for each of the functions you will write are below.

TIP: See the note about how tract is stored in 2010_Oakland_Tract_Demographics.csv. It may be helpful to create an inner function here to extract the tract as the 6 digit string since this is how tracts are stored in self.tracts (i.e., add "00" to the end of the 4 digit version). For example, a tract written as 4053.01 and should be extracted as 405301 instead.

Use encoding='utf-8-sig' when you open the file (passed in as a keyword argument) if you run into issues.

{"402600": {0: 0,
            1: 0,
            ...
            10: 1,
            ...
            47: 1,
            ...}
}

TIP: It will help to use a library to process the dates and convert them to timestamps, such as the datetime library.

TIP: You might find the dictionary get() method useful for filtering the arrests logs.

ProbabilityAnalysis

This class contains functions to compute various probabilities and expectations using the DataWrapper class and the PredPol model. We will use the following random variables throughout the class implementation details and in the following analysis.

H = 1 if sum_t H_t > 0, and H = 0 otherwise

Implementation details for each of the functions you will write are below.

TIP: You can calculate the expected number of times a person of each racial category was arrested for each given arrest made (this number will be no more than 1) and then add up over all arrests, because for any (independent or dependent!) random variables X and Y, E[X+Y]=E[X]+E[Y]. Further, the expected number of times a person of race a was arrested for a single given arrest in tract r is exactly P(A=a|R=r) since we assume each person is arrested uniformly at random from the population in the tract.