EECS 298: Social Consequences of Computing

Homework 2: Celebrity Linkage Attack

Due 11:59 PM EST on Friday, February 21rd, 2025.

Written Points: 22 Coding Points: 38 Total Points: 60

Submission & Grading

This assignment is made up of two parts:

  1. Programming - submit HW2.py
  2. Written Reflection - submit a PDF with your responses

You’ll submit both parts on Gradescope. Part 1 will use the Code Submission assignment while Part 2 will use the Homework 1: Written Responses assignment. To access Gradescope, use the link on Canvas.

[!IMPORTANT]
Please do not handwrite your written responses - using any word processor (Word, Google Docs, etc.) or LaTeX is fine!

Part 1 will be graded using an autograder, so you’ll be able to get feedback as soon as you submit - you can submit any number of times until you feel happy with your score!

Part 2 will be graded manually, but you can still resubmit as many times as you need to before the deadline.

Introduction: The NYC Taxicab Database

The New York Freedom of Information Law (FOIL) grants the public the ability to access, upon request, records maintained by state government agencies. This includes the New York City Taxi and Limo Commission (NYC TLC), an agency which manages the city’s taxi cabs. In 2013, a data analyst named Chris Whong submitted a FOIL request to the NYC TLC and was in turn provided with a database containing full records of every taxi ride given in New York City that year. Chris Whong then published this data in its entirety. Due to the public nature of the NYC taxi system, along with the poor anonymization of the database done by NYC officials, this database is ultimately highly vulnerable to a linkage attack.

Recall that a linkage attack requires a (seemingly innocuous) feature or features contained in two or more datasets, one dataset containing personally identifiable information (PII) and the other containing sensitive information. For this assignment, you will perform a real linkage attack on this dataset by using public photos captured by paparazzi documenting celebrities in NYC taxi cabs since many of their pictures show the cab’s unique medallion number. In this case, the taxi medallion number and the date of the picture/trip will be used as the PII and the sensitive information we are trying to recover is each celebrity’s tip amount for their taxi trip.

Part 1 - Executing and Preventing Database Attacks

First, use the wget command to get the HW2.py starter code file and the two csv files used in this homework.

$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/homeworks/HW2.py
$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/trips.csv
$ wget https://raw.githubusercontent.com/eecs298/eecs298.github.io/main/files/celebrities.csv

For this assignment, you will be provided access to a lightly modified version in the interest of convenience of the original 2013 NYC TLC database in trips.csv. In addition, you will be provided with the celebrities.csv file containing information about another publicly available NYC TLC dataset with photographs of celebrities in taxi cabs.

Your task will be to perform a linkage attack using these two datasets. The NYC TLC dataset has tips, but no names of passengers, while the celebrity photographs we have names, but no tip amounts. The goal is to match celebrity information with trips in the taxi database. It is possible that not every celebrity in this dataset is boarding a taxi contained in the NYC dataset, and furthermore, there are many taxis in the NYC dataset for which we have no photographs. So, your implementation should not assume a strict one-to-one relationship between datasets. However, you may assume that a medallion and date match between the datasets is a correct match.

trips.csv

The original NYC dataset consists of hundreds of thousands of lines. You will be given a shortened, modified version of this dataset with 1000 lines. Each row in this dataset represents one trip made by a NYC taxi. The columns of this dataset are as follows:

The medallion number is used to identify the taxi cab and appears on the license plate of the taxi, the side of the taxi, and atop the taxi. The medallion numbers in this dataset follow a specific format, and are composed of one number, followed by one (uppercase) letter, followed by two numbers, e.g.: 1A23.

celebrities.csv

This dataset contains information about different celebrity trips as evidenced by paparazzi photos such as the following example pictures.

Emma Roberts Jake Gyllenhaal Extra picture

Each row in the dataset represents one celebrity trip. The columns of this dataset are as follows:

Deanonymizing the Medallion

Before you can execute the linkage attack, you will first need to reverse the anonymization that NYC TLC attempted. Hashing is the process of transforming one string of characters into another, often to make storage and access easier or to hide the original information. Performing a given hash on the same string will always produce the same output.

The medallion column in the trip dataset is hashed with a hash function called MD5. The hashing of the medallion data with MD5 appears to be evidence that NYC officials intended to anonymize the medallion and license numbers. However, it turns out that each medallion gets hashed to a unique hash value. Therefore, we can use our knowledge of the hashing function to map the unhashed medallions in trips.csv to the hashed versions of the medallions seen in celebrities.csv. This in combination with other data becomes Personally Identifiable Information (PII).

To encode a string using MD5, you can use the md5 library as follows:

import md5

hashed_value = md5(my_value.encode()).hexdigest()

[!NOTE] This is only one of many possible hash functions. While some hash functions can be used cryptographically to hide the input of the hash function, MD5 is not one of them.

HW2.py

First, you will implement several classes to set up the taxi trip database with trips.csv as described below. Then, you will use the main branch of the file to perform a linkage attack with the celebrities.csv information. We will then explore trying to protect against this attack by coarsening the query allowed to the database as well as making the query function differentially private.

Implementing the classes

Linkage Attack: perform_linkage_attack

You are now going to attempt a linkage attack by implementing perform_linkage_attack. Your goal with this attack is to gather a list of tuples of linked Trips and Celebritys so that you can figure out how much each Celebrity tipped on their trip.

[!TIP] Recall what our PII is in this setting to determine with attributes should be passed to query_trips to link a Celebrity to a Trip!

Use the "__main__" branch of the Python program to read in celebrities.csv and use this function as you wish to answer the questions.

1. [2 pts.] Report the celebrity who tipped the highest amount, along with the amount they tipped.

2. [2 pts.] Considering the data available in trips.csv (i.e., the column categories), name one thing an attacker could learn about the celebrity and describe how it might violate their sense of privacy.

Difference Attack: perform_difference_attack

For this part, you are now only allowed access to trips.csv using query_mean_tip. Since you can only get mean tip values for a given date range, you will have to perform a difference attack by implementing perform_difference_attack to recover the tip information for a single Celebrity. Recall that a difference attack is performed when you figure out information about an individual when querying a value with and without filtering them out of the query. You can filter out a given celebrity in a mean tip calculation using the filtered_medallion optional argument in query_mean_tip.

Your goal with this attack is to try to reconstruct the tip values for a particular Celebrity using only the query_mean_tip function.

[!TIP] You should only need two queries for each Celebrity.

Again, use the "__main__" branch of the Python program to read in celebrities.csv and use this function as you wish to answer the questions. Use your results from the linkage attack section to verify you calculated the correct tip_amount for each celebrity!

[!NOTE] It is okay if your calculated tips are slightly different (within 0.005) to the true tip values as this may happen with rounding errors.

Applying Differential Privacy (DP)

Refer to the query_mean_tip function you wrote for the previous section. Since we can recover Celebrity tip information with a difference attack, you will implement a differentially private algorithm to attempt to prevent the attack. You will implement the Laplace algorithm, which adds noise to the result of the query, making any difference in the output caused by the inclusion of one individual much harder to detect. Specifically, your task will be to return an epsilon-DP response if the epsilon argument is not None in the query_mean_tip function using the Laplace algorithm. epsilon represents the amount of noise added to the dataset during differential privacy.

The tip amount column in this database is a continuous value rather than a discrete value, and as such, we won’t be able to use the randomized response algorithm discussed in class. Instead, we can return the mean tip, plus a random value drawn from a distribution (the Laplace distribution, in this case).

Laplace distribution graph Laplace mechanism

Recall the graph of the Laplace distribution and the overview of the Laplace mechanism from class. Given a query function f (query_mean_tip) and a database D (trips.csv), we want a D.P. algorithm to return an estimate of the query f(D) by adding some randomness Y.

The Laplace distribution (like many other distributions, e.g. the Gaussian distribution) is parameterized by a variance, which for this distribution is called the scale parameter, which is always a positive number. The amount the output of the query can change when only one person’s data changes is called the sensitivity Δ(f). The Laplace algorithm inserts randomness by adding more variance when epsilon is smaller or the sensitivity is larger. Follow these steps to implement the Laplace algorithm for calculating the noised mean in your code:

The Laplace Algorithm

For the following questions, run the code that answers each question below in the order described to ensure solution consistency with the random number generator.

3. [1 pt.] First, we will check the accuracy of our noised queries. Query the mean tip for the month of January in 2013 (i.e., start_date="2013-01-01" and end_date="2013-01-31") without DP and with epsilon=0.8. Are the returned outputs close (within $0.50)?

4. [2 pts.] Set epsilon=0.8 and use perform_difference_attack in the previous section and use the same calculation to try to guess each celebrity’s tip amount. Report the tip guess for Judd Apatow and whether it is close to his actual tip value (within $0.50).

5. [2 pts.] Repeat the above with epsilon=0.1, epsilon=1, and epsilon=15. Were you able to calculate a tip guess for Judd Apatow that is within $0.50 of his actual tip value for any of these values? If so, which one(s)? Use your results to order the epsilon values from most to least accurate in terms of how close the calculated tip value was to the true tip value.

Part 2 - Written Reflection Questions

Answer the following short-response questions. In your responses, we are looking for an effort to apply concepts from lectures and readings to answer each of these questions. Make sure to briefly justify each answer.

6. [4 pts.] Give 3 differences between k-anonymity and differential privacy in terms of the properties described in lecture and explain the privacy implications of these differences. Consider the impact that implementing differential privacy had on predicting tip values in this dataset and consider the impact that implementing k-anonymity would have had. Which one is better at preventing difference attacks?

7. [2 pts.] Using the conception of privacy as restricted access to hidden information, describe whether this case study constitutes a violation of privacy. Why or why not?

8. Given that we were able to perform the linkage attack due to the poor anonymization of the taxi medallions, one reaction might be to remove the taxi medallion information altogether from the released data. However, the FOIL act is intended to provide transparency of government data to the public, so one should be careful in removing publically available information.

a. [2 pts.] Describe one possible benefit to the community (NYC residents taking the taxis) for publically releasing some version of the taxi medallion information (anonymized or not).

b. [2 pts.] Describe one possible way the linkage attack could still be performed even if the taxi medallion information was removed.

9. In this question, we will examine contextual integrity in this setting of celebrities using the taxi service in NYC.

a. [2 pts.] Describe the sociotechnical context for this setting: the sender and recipient, the subject, information, and relationship under which information is transmitted.

b. [3 pts.] Describe a setting with a similar sociotechnical context that is not this one (it should not involve the NYC TLC dataset). Make sure to point out why the context is similar. Then describe the norms of appropriateness and the norms of flow for that other context.

c. [3 pts.] Describe how norms of appropriatness and norms of flow are violated in this setting by using the previous two parts to this question. Conclude by stating whether contextual integrity holds in this setting.