EECS 298: Social Consequences of Computing
Homework 3: To Predict and Serve
Due 11:59 PM EST on April 11
Coding Submission: 40 points
Written Submission: 20 points
Total Points: 60 points
Submission
This assignment consists of two parts:
- Programming - submit
HW3.py
- Written Reflection - submit a PDF with your responses
Both parts will be submitted on Gradescope. Part 1 will be submitted to Homework 3: Coding Submission and Part 2 will be submitted to Homework 3: Written Submission. To access Gradescope, use the link on Canvas.
Part 1 will be graded using an autograder, so you'll be able to get feedback as soon as you submit - you can submit any number of times until you feel happy with your score! Your code will be tested on private cases in addition to the public cases you are given, and as such, your code should be properly generalizable to other, similar calculations. Your programming implementation will be graded on correctness. We encourage collaboration, but all work you submit must be your own.
Part 2 will be graded manually, but you can still resubmit as many times as you need to before the deadline. It is required that you typeset your written responses in a document editor or a program like LaTeX
All writing must be your own, and collaboration must not result in code or writing that is identifiably similar to other solutions.
Introduction to Predictive Policing
Predictive policing refers to the use of data analysis and machine learning techniques to identify patterns and make predictions about future criminal activity. Typical stated goals of predictive policing include allocating police resources more effectively and efficiently, reducing crime rates, and improving public safety. Predictive policing has seen increasingly widespread use in the US [1] [2]. Perhaps the predictive policing algorithm that has recieved the most media coverage is an algorithm called PredPol, developed by a private company (formerly called PredPol, then called Geolitica, now absorbed into SoundThinking (which was formally ShotSpotter) and part of a similar product ResourceRouter) [3].
PredPol is software developed by social scientists in collaboration with the Los Angeles Police Department (LAPD). It uses historical arrest data to predict the probability of future arrests occurring in specific areas using a machine learning model based on earthquake prediction called Epidemic Type Aftershock-Sequences (ETAS) [4]. The algorithm divides a city into grid cells as small as 500x500 feet and assigns a risk score to each cell which is the probability of an arrest occuring in that cell, which PredPol assumes is a proxy for a crime occuring in that cell. The idea is to then use this information to increase the presence of officers to those cells that have the highest risk of a crime. However, critics argue that flawed and systemically biased data results in racially discriminatory predictions and policing, where the use of such algorithms can help produce the very same flawed data that then gets fed back into these systems. This leads to a predictive policing system that reinforces patterns of over-policing, creating feedback loops that create a cycle of oppression [2] [5] [6].
In this assignment, we will investigate these claims by using real arrest data from the city of Oakland, California from 2009 to 2011. For cells, we will use census tracts, units of area that the US Census uses to collect population totals. These are convenient to use because it breaks up the city of Oakland into reasonably sized pieces, and the US Census conveniently collects lots of demographic data on the people who live in each tract. Using this data, we will investigate the racial distribution of the people arrested in this data and and the people who could be affected by the use of the PredPol algorithm on this data. We will also investigate what happens when assigning additional police to a tract results in additional arrests, which then get fed back into the model. We will loosely be following the original analysis of Lum and Isaac [6].
Datasets
For this assignment, you are provided three datasets as described above. For your implementations of different classes and functions, you will read in the data from CSV files. Descriptions of each of these files are below and the HW3.py
section will describe how to read in each dataset.
Arrest Data
You are provided real data collected from the Oakland police department about arrests in the city in the form of arrests.csv
. This dataset contains details on drug-related arrests in Oakland from 2009 to 2011. The columns include information such as the description of the incident and the location of the arrest. Use wget
to get the file arrests.csv
.
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/arrests.csv
The columns of the data are as follows:
Date
: The date of the arrest formatted as YYMMDD.Category
: The category of the drug-realted arrest.Desc
: The (brief) description of the crime committed.Addr
: The address where the arrest was made.Lat
: The latitude where the arrest was made.Long
: The longitude where the arrest was made.Tract
: The census tract where the arrest occured formatted as 6 digits. Ex: 402600.
Demographic Data
You are also provided racial demographic data from the 2010 US Census for each tract where the Oakland police can make arrests in 2010_Oakland_Tract_Demographics.csv
(as estimated by this dataset and Lum and Isaac[6]). This file has a column for the tract and the number of people living in that tract from each racial category the US Census collects. These racial categories are dictated by the US Census Bureau and are measured via people self reporting on the US Census. You may assume that the list of unique tracts in arrests.csv
is the same as the list of tracts in this file. Use wget
to get the file 2010_Oakland_Tract_Demographics.csv
.
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/2010_Oakland_Tract_Demographics.csv
The first column of 2010_Oakland_Tract_Demographics.csv
is Tract
and specifies the census tract that the population numbers for each demomgraphic of a single row correspond to.
NOTE:
Tract
is formatted differently here than inarrests.csv
and implementation details are given below for how to resolve this difference
The remaining columns of the data correspond to demographic data of the tracct and are as follows:
Total population
Hispanic or Latino
Total population, not Hispanic or Latino
One race total
White
Black or African American
American Indian and Alaska Native
Asian
Native Hawaiian and Other Pacific Islander
Some Other Race
Two or More Races
Drug Use Data
Finally, you are provided with rates of drug use, broken down by demographic category in 2010_drug_use.csv
. This data is from the National Survey on Drug Use and Health (NSDUH), and lists the percentage of people belonging to each demographic category who respond to the survey that they participated in illicit drug use in the last month (among persons aged 12 or older). The survey is from 2010. This will serve as a proxy for the ground truth, given that an anonymous survey conducted using careful sampling techniques will undoubtedly be better than arrest data in measuring illicit drug use. We make the assumption that the national drug use rates are the same in the tracts in Oakland. Use wget
to get the file 2010_drug_use.csv
.
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/2010_drug_use.csv
This file consists of only two rows: the CSV header and the drug use percentages for the following categories:
TOTAL
AGE 12-17
AGE 18-25
AGE 26 or Older
Male
Female
Not Hispanic or Latino
White
Black or African American
American Indian or Alaska Native
Native Hawaiian or Other Pacific Islander
Asian
Two or More Races
Hispanic or Latino
Part 1 - HW3.py
Use wget
to download the starter file and PredPol model file.
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/homeworks/HW3.py
wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/homeworks/pred_pol.py
There are two main classes you will implement: DataWrapper
and ProbabilityAnalysis
. DataWrapper
will handle reading in and processing the data from the csv files so that it may be used in the PredPol model and for the probability analysis. ProbabilityAnalysis
will use the PredPol
model and DataWrapper
to compute various probabilities and expectations as explained in the ProbabilityAnalysis
section.
In the pred_pol.py
file, you will find a function and a class that you should not change. Information for how to use each is given below:
generate_counterfactual
: Generates counterfactual arrest data for each of the given tracts at a given time step; see the analysis section for more on where we need to use this.- Inputs: A
DataWrapper
instance and a list of tracts to generate counterfactuals for at the given time step t. - Returns: A dictionary of counterfactually generated arrest numbers (values) for each given tract (keys).
- Inputs: A
PredPol
: This class represents the PredPol model.- In the constructor, the model is trained by default using the function
train_model
to build the model from training arrest data given by aDataWrapper
. predict
: This function is used to predict the probability of a crime occuring at a given time step for a given tract. Note that PredPol, even after the model is finished training, also needs the timestamps of all previous arrests to make predictions. Be sure to keep in mind this difference between what this function predicts and what the inputs are. (For those of you who have seen models like SVM, just as SVM makes predictions based on the support vectors, this model makes predictions based on the timestamps). To get a prediction for the likelihood of a crime occurring, pass in the following:- The
tract
to make a prediction for. - The timestep
t
to make a prediction for. - All timesteps (in the same structure as
DataWrapper.timesteps
) previous tot
.
- The
- In the constructor, the model is trained by default using the function
In the HW3.py
file, you will find the DataWrapper
and ProbabilityAnalysis
classes for you to implement. Each class includes constructors and member functions that will be useful for implementing further functions and completing the analysis questions. Information for how to implement the each class is given below.
DataWrapper
This class contains all of the data that we need, including the census tracts, the demographic data for each tract, the arrest data for each tract, and drug use data in the population. Details for each of the functions you will write are below.
__init__
: Construct all of the attributes of the DataWrapper using the passed in arguments or the other functions in the class.- Attributes:
arrests_path
: The file location of the arrest data.demo_path
: The file location of the demographic data.drug_path
: The file location of the drug use data.training_percentage
: The percentage of days that will be used for training a predictive model.tracts
: The list of tracts in the arrest data. Each tract should be stored as a string. Construct this with build_tracts().demographics
: Stores the demographic data for each tract. Construct this with process_demo_data().drug_use
: Stores the drug use data for different groups of the population. Construct this with process_drug_data().num_days
: Total number of days in the arrest data (including the first and last day). Compute and modify this value in process_arrests()
- Attributes:
build_tracts
: Create a list of all unique tracts inarrests.csv
to store in theself.tracts
attribute. Each tract should be stored as a 6 digitstring
.- Returns: a
list
of all unique tracts inarrests.csv
.
- Returns: a
process_demo_data
: Create a dictionary that stores the demographic data from2010_Oakland_Tract_Demographics.csv
for each demographic category in the header of thecsv
file.- Returns: a
dict
whose keys are eachtract
inself.tracts
and the values are dictionaries mapping each given category to their population numbers in thattract
.
- Returns: a
TIP: See the note about how
tract
is stored in2010_Oakland_Tract_Demographics.csv
. It may be helpful to create an inner function here to extract thetract
as the 6 digit string since this is how tracts are stored inself.tracts
(i.e., add"00"
to the end of the 4 digit version). For example, a tract written as4053.01
and should be extracted as405301
instead.Use
encoding='utf-8-sig'
when youopen
the file (passed in as a keyword argument) if you run into issues.
process_arrests
: Construct a dictionary for storing arrest data inarrests.csv
. The keys of the dictionary are the tracts inself.tracts
and the values are other dictionaries whose keys are date timestamps in the range[0, self.num_days - 1]
and values are the count of arrests made on that day.- Modifies: Compute
self.num_days
by counting the number of days between the first and last date inarrests.csv
, inclusive. The dictionaries will be of lengthself.num_days
, where each key represents a date between the first and last arrest. - Returns: The dictionary of arrest data. For example, the dictionary will look something like the below example for tract
402600
. This example shows that tract402600
has 0 arrests on days 0 and 1, and 1 arrest on days 10 and 47.
- Modifies: Compute
{"402600": {0: 0,
1: 0,
...
10: 1,
...
47: 1,
...}
}
TIP: It will help to use a library to process the dates and convert them to timestamps, such as the datetime library.
process_drug_data
: Create a dictionary to store illicit drug use rates from2010_drug_use.csv
for each demographic category in the header of thecsv
file.- Returns: A
dict
whose keys are the given racial categories and the values are the drug use rates stored asfloats
between 0 and 1.
- Returns: A
split_arrests_log
: Split the arrests data log into a training set, for training the PredPol model, and a test set for evaluating the model in the analysis section. Both the training data and the testing data should be in the same format asself.arrests_log
. The training data should consist of only days in the firstself.training_percentage
(rounded down) of days inself.num_days
and the testing data should countain the rest of the days.split_arrests_log
should filter out the days with 0 arrests in a given tract.- Returns: A tuple of two dictionaries
(training_arrests_log, testing_arrests_log)
each in the same format asself.arrests_log
. For example, ifself.num_days = 14
total days of arrests in the data andself.training_percentage = 2/3
, then the training data, for each tract inself.tracts
, should consist of days up through dayfloor(2/3 * 14) = 9
and the testing data should consist oft=10
throught=13
(this is the last day since the first datyt is 0-indexed).
- Returns: A tuple of two dictionaries
TIP: You might find the dictionary
get()
method useful for filtering the arrests logs.
ProbabilityAnalysis
This class contains functions to compute various probabilities and expectations using the DataWrapper class and the PredPol model. We will use the following random variables throughout the class implementation details and in the following analysis.
- A denotes which of the four racial categories a given person in Oakland belongs to.
- Y denotes whether the Oakland resident has used illicit drugs in the last month (Y=1) or not (Y=0).
- R denotes the tract that the arrest was made in (i.e., one of the tracts in
DataWrapper.tracts
). - H_t denotes whether a tract experiences a heightened police prescence at time t (we will define heightened police presence below).
- To make this a binary variable, we introduce H which is an indicator for whether a person will ever face a heightened police presence in their tract. Note:
H = 1 if sum_t H_t > 0, and H = 0 otherwise
Implementation details for each of the functions you will write are below.
__init__
: Construct all of the attributes of theProbabilityAnalysis
instance using the passed in arguments and setself.pred_pol
to be equal to an instance of thePredPol
class withself.dw
passed in.- Attributes:
- dw: An instance of a DataWrapper object to perform probability calculations on.
- racial_categories: Defines the categories of the sensitive attribute: race.
- pred_pol_model: An instance of PredPol, trained on the input DataWrapper object.
- Attributes:
compute_P_A
: Compute P(A) for each category inself.racial_categories
, that is the proportion of each race in Oakland (across all tracts).- Returns: A
dict
whose keys are the racial categories and the values are P(A=a).
- Returns: A
compute_P_Y_eq_1_given_A
: Compute the drug use percentage of each race, i.e., P(Y=1|A=a) for each race inself.racial_categories
.- Returns: A
dict
whose keys are the racial categories and the values are P(Y=1|A=a).
- Returns: A
compute_P_A_given_R
: Compute the proportion of each race in each tract, i.e., P(A=a|R=r).- Returns: A
dict
whose keys are thetracts
and the values aredicts
whose keys are each racial category and each value is P(A=a|R=r).
- Returns: A
compute_expected_arrests_given_A
: Compute the expected number of arrests each racial category inself.racial_category
will have. We assume that whenever an arrest is made in tract r, a uniformly random person is arrested from r: Everyone is equally likely in that tract to get arrested.- Returns: A
dict
whose keys are the racial categories and the values are the expected number of times people in that category were arrested, asfloats
.
- Returns: A
TIP: You can calculate the expected number of times a person of each racial category was arrested for each given arrest made (this number will be no more than 1) and then add up over all arrests, because for any (independent or dependent!) random variables X and Y, E[X+Y]=E[X]+E[Y]. Further, the expected number of times a person of race a was arrested for a single given arrest in tract r is exactly P(A=a|R=r) since we assume each person is arrested uniformly at random from the population in the tract.
compute_P_R_given_A
: Compute the probability of being in a certain tract given a racial category, i.e., P(R=r|A=a).- Returns: A
dict
whose keys are the racial categories and the values aredicts
whose keys are each tract and each value is P(R=r|A=a).
- Returns: A
update_previous_observations
: Used inProbabilityAnalysis.compute_P_H_eq_1_given_A()
(see below) to update the dataset fed intoPredPol.predict
.- Arguments:
previous_observations
: The dictionary of arrests data that is to be updated. This dictionary should initially contain only thetraining_arrest_log
when this function is called for the first time.test_set
: The ground truthtesting_arrest_log
found after runningdw.split_arrests_log()
.t
: The current timestep in the range of testing data.top_tracts
: DefaultNone
. A dictionary with keys of tracts and values of predictions fromPredPol.predict()
. Contains onlynum_top_tracts
as determined inProbabilityAnalysis.compute_P_H_eq_1_given_A()
.counterfactuals
: DefaultFalse
. A boolean to indicate whether true test data is fed into future timesteps or counterfactuals.
- Returns: An updated
previous_observations
constructed using the following steps: - If
counterfactuals = True
, callgenerate_counterfactuals
to generate arrest numbers for thetop_tracts
to add to timestamps passed into thepredict
function. This function generates arrest numbers under the assumption that we actually do send a heightened police presence at each timestep. Seepred_pol.py
for the implementation details and inputs togenerate_counterfactuals
. - Example: If
generate_counterfactuals
predicts5
arrests in a certaintract
at time t, then you should setprevious_observations[tract][t]
equal to5
, so you can pass in an updatedprevious_observations
dictionary at the next timestep to thepredict
function. - For the rest of the
tracts
or ifcounterfactuals = False
, add the ground truth arrest data for the current timestep t from thetest_set
toprevious_observations
. Refer back toDataWrapper.process_arrests()
andDataWrapper.split_arrests_log()
for details.
- Arguments:
compute_P_H_eq_1_given_A
: Usingself.pred_pol
, compute the probability that a person of each racial category experiences a heightened police presence, i.e., P(H=1|A). The PredPol algorithm outputs its belief of the probability of a crime occuring for the given tract and timestamp. Assume we will send a heightened police presence tonum_top_tracts
tracts at each test timestamp. That is, we will choose the topnum_top_tracts
tracts with the highest probability of crimes occurring in that timestep after callingself.pred_pol.predict
on eachtract
inself.dw.tracts
.- Arguments:
num_top_tracts
: An integer to determine the number of tracts there is a heightened police presence after each crime prediction.counterfactuals
: A boolean to indicate whether true test data is fed into future timesteps or counterfactuals.track_odds_ratio
: A boolean to indicate whether you should track odds ratios instead. See analysis question 7 for details.
- Returns: A
dict
whose keys are the racial categories and the values are P(H=1|A=a). The calculation of this probability requires the usage ofPredPol
to simulate police activity. Steps to implement this function and help you compute P(H=1|A=a) are given below. See theHint for additional help computing this probability after these steps. - Split the data from
self.dw
into training and testing arrest logs. - For each timestep t in the range of the testing arrest log (the same range found in
DataWrapper.split_arrests_log
), you will do the following three things:- Get a prediction for the probability of crimes occuring in each
tract
inself.dw.tracts
att
usingself.pred_pol.predict
. Remember to feed in all arrest data prior to t into thepredict
function. For example, for the first timestep, this will betraining_arrest_log
and you will add future arrest data for future values oft
(see the third thing to do). Seepred_pol.py
for implementation details and inputs toself.pred_pol.predict
. - Find the top
num_top_tracts
tracts in terms of the highest probability of crimes occuring. Make sure to mark each of these tracts as having recieve a heightened police presence, i.e., H=1 for these tracts! - Update the arrest data to feed into the predict function on the next iteration using
ProbabilityAnalysis.update_previous_observations()
.
</ul>
- Get a prediction for the probability of crimes occuring in each
- Arguments:
TIP: P[H|A] = sum_r P[H,R=r|A] and then use the definition of conditional probability to write P[H,R=r|A] in terms of P[H|R=r,A] and P[R=r|A]. The first probability can be calculated from the arrest data and the second from the previous function implementation.
You will notice an argument
track_odds_ratio
set toFalse
by default. This should only be set toTrue
for analysis question 7 to help you plot the data for that question. You will come back to edit this function when you get to question 7 and follow the instructions given there for implementation.
Analysis Questions
Answer each of the questions below as written responses and use the __main__
branch of the file to use the above classes/functions to help you find the answers. As a reminder, for Autograder questions, we will be grading your function implementations directly, so feel free to use the __main__
branch however you'd like. Use the default training_percentage
of 2/3
in your DataWrapper
instance.
For the sake of simplicity, this analysis will focus only on racial categories, and only four racial categories (which you will pass into the ProbabilityAnalysis
class instance), each in our demographic data: Hispanic or Latino
, White
, Black or African American
, and Asian
. (The other categories all have relatively small numbers in Oakland.) Answer each of the following questions using these four racial categories (i.e., set racial_categories = ["Hispanic or Latino", "White", "Black or African American", "Asian"]
).
Before we dive into the impact of PredPol, let's start by looking at underlying demographics and drug use.
1. [2 pts. Autograder] What is the proportion of each of the four racial categories in Oakland, i.e. what is P(A)? (This is short-hand notation for asking for a tuple
of four numbers, P(A=a) for each of the four racial categories a.)
2. [2 pts. Autograder] What is the probability that a person in each of the racial categories uses illicit drugs, i.e. what is P(Y=1|A)?
Now let's see how these numbers compare to the arrests made.
3. a. [2 pts. Autograder] What is the expected number of times a person of each racial category was arrested in Oakland?
3. b. [2 pts.] What is the total expected number of times a person of each racial category was arrested as a percentage of all arrests? Which racial group(s) were arrested in an outsized proportion to their overall proportion in the population?
Now let's move to analyzing the outputs of the PredPol algorithm. To measure the potential for PredPol to be discriminatory, we will start with group fairness, as discussed in class, specifically demographic parity (aka independence). There are two differences from the definition of demographic parity we used in class. The first is that the task is not binary classification: the algorithm doesn't make a decision about a person only once, but rather once every day -- does a person face heightened police presence at timestamp t, or not (we assume that a person will always be subject to a heightened police presence in a given tract if they live there). The other difference is that the sensitive attribute A is not binary but rather quaternary, with four distinct values. We introduced H as our binary indicator variable to fix the first difference and we will define how to measure distance from demographic parity below.
4. [4 pts. Autograder] Run PredPol on each day in the test set and compute the set of twenty tracts to send a heightened police presence to. What is P(H=1|A)?
Because A is not binary, we will instead measure distance from demographic parity as the difference between the largest probability, P(H=1|A=a_max), and the smallest probability, P(H=1|A=a_min).
5. [2 pts.] How far away from demographic parity is PredPol? We can also compare the rate of heightened police presence to use rates for each racial group. Which groups faced an outsized police presence compared to their rates of illicit drug use? Report P(H=1|A=a) and P(Y=1|A=a) for each a.
However, running PredPol only on the existing data doesn't take into account that by assigning more police officers to a given tract, they are more likely to make more arrests than would have otherwise occurred by not using PredPol. But PredPol uses those same arrests: the more recent arrests in a given tract, the more crime PredPol thinks is going to be there. So could PredPol be creating a feedback loop where its initial choices are reinforced, leading to initial bias or discrimination getting reinforced? This would be even worse than PredPol merely repeating the initial bias!
6. [2 pts. Autograder] Repeat the same analysis, i.e. compute P(H=1|A), except use as test-time input to PredPol the arrests that would have happened if police officers were assigned according to PredPol. Because they were not the arrests that actually happened, but arrests that would have happened had the police acted according to PredPol, we call this a counterfactual. In order to compute the desired probabilities, set counterfactuals=True
in ProbabilityAnalysis.compute_P_H_eq_1_given_A
.
Here's another way of understanding the difference between the original dataset and the counterfactual dataset. Our concern is that on the counterfactual dataset, PredPol gets more and more confident of its choice because of a feedback loop between sending police to a location and PredPol's confidence that there is crime there. We now want to compare the probabilities of expiriencing heightened police activity between the counterfactual and non-counterfactual datasets. For each dataset, let's split the tracts into two: the top ten tracts chosen by PredPol for that dataset, and every other tract.
7. a. [2 pts. Autograder] Run the same analysis again, but this time, for each of the two different approaches (i.e. counterfactuals=True/False
), keep track of the following for each time step in the testing data:
(sum over r in top ten tracts of P(H_t|R=r)) / (sum over r not in top ten tracts of P(H_t|R=r))
This is a sequence of odds ratios, representing how much more confident PredPol was on the top ten tracts for that dataset than all other tracts at each time step. Update ProbabilityAnalysis.compute_P_H_eq_1_given_A
to track this odds ratio at each time step and conditionally return the odds ratio instead when track_odds_ratio=True
.
7. b. [1 pt.] Plot these two sequences of odds ratios using matplotlib
, and include it in the written write-up. The x-axis should be each timestamp in the testing data and the y-axis should be the odds ratio value. What does this plot say about how counterfactuals affected how confident PredPol got over time?
Part 2 - Reflection Questions
Answer the following short-response questions. Your response should only be as long as necessary to answer the questions, but do make sure to briefly state why you are right for those questions that need it.
8. a. [2 pts.] What is PredPol trying to predict, and what does it actually predict? That is, what is the construct PredPol is trying to measure, and what is its target variable? Hint: the target is not necessarily the predicted output of PredPol!
8. b. [2 pts.] Suppose PredPol achieves very low error with respect to its target variable. How can it still fail to accurately predict the construct it's trying to predict, despite having low error rates? Use evidence from your analysis in questions 1-4 to support your argument.
9. [3 pts.] How does PredPol constitute a feedback loop as defined in lecture? Using terminology from class, explain why this feedback mechanism can lead to unfair outcomes in predictive policing.
10. a. [2 pts.] In analysis question 5, we measured how far from demographic parity PredPol was. What are we considering a disadvantage in this analysis? What are we considering moral desert?
10. b. [2 pts.] Name and justify one other way we could define moral desert in this context.
11. [2 pt.] We had assumed that whenever an arrest is made in a given tract, a uniformly random person is arrested from that tract. How might this assumption be tested? If it's wrong, what consequences for our analysis might this incorrect assumption cause? Be specific in what values in our analysis would change.
12. [2 pts.] Give and justify an example of harm (in terms of lack of fairness) that an algorithm like PredPol could exhibit that is not captured by statistical fairness definitions or feedback loops.
References
[1] Walter L. Perry, Brian McInnis, Carter C. Price, Susan C. Smith, and John S. Hollywood. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations. National Institute of Justice, 2013. https://nij.ojp.gov/library/publications/predictive-policing-role-crime-forecasting-law-enforcement-operations
[2] Karen Hao. Police across the US are training crime-predicting AIs on falsified data. MIT Technology Review, 2019. https://www.technologyreview.com/2019/02/13/137444/predictive-policing-algorithms-ai-crime-dirty-data/
[3] ResourceRouter. Website accessed March 2024. https://www.soundthinking.com/law-enforcement/resource-deployment-resourcerouter/
[4] G. O. Mohler, M. B. Short, Sean Malinowski, Mark Johnson, G. E. Tita, Andrea L. Bertozzi, and P. J. Brantingham. Randomized Controlled Field Trials of Predictive Policing. Journal of the American Statistical Association, 2015. https://doi.org/10.1080/01621459.2015.1077710
[5] Rashida Richardson, Jason Schultz, and Kate Crawford. Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice. 94 N.Y.U. L. Rev. Online, 2019. https://www.nyulawreview.org/wp-content/uploads/2019/04/NYULawReview-94-Richardson_etal-FIN.pdf
[6] Kristian Lum and William Isaac. To predict and serve? Significance, 2016. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1740-9713.2016.00960.x