Considerations for Consistent Terminology When Teaching Data Privacy Methods
Earlier this year, members of the Urban Institute’s Data Science Team worked with the Bureau of Economic Analysis (BEA) to provide trainings on data privacy and confidentiality. As part of a larger body of work on data privacy, we taught current techniques data curators can use to release data publicly while limiting the threat to individual confidentiality.
Given the recent explosion of publicly available data, protecting confidentiality has grown increasingly difficult. But among the interdisciplinary range of practitioners — including policymakers, economists, computer scientists, and statisticians — working on best practices, the dissensus around the terminology used to describe these methods is a large barrier to training new practitioners.
Frequently used terms in the data privacy field often overlap, causing ambiguity in existing research and writing. For example, computer scientists refer to a question about the data as a “query,” whereas statisticians will use the term “statistic.” Although seasoned data privacy experts understand that these terms describe the same concept, beginners might find the distinction confusing.
Our struggles to reconcile these differences internally came to a head when we began creating materials to cover the 2020 Decennial Census Disclosure Avoidance System as an example application of a specific data privacy technique called differential privacy. Teaching this application required us to agree internally on a set of terms that fit with existing research but that were also distinct from each other. Through this process, we learned a few key lessons others can apply to their own work to ensure consistency and understanding when introducing data privacy applications.
The 2020 Decennial Census provides rich material for teaching about statistical disclosure control (SDC) methods, or avenues for preserving privacy in released datasets. The Census Bureau is obligated by law to protect the confidentiality of survey respondents, so they’re frequently on the forefront of the data privacy field.
The Census Bureau made headlines when they announced that for the release of 2020 Decennial Census results, they would apply a class of methods under the umbrella of “differential privacy” as part of their disclosure avoidance system. (This post will not discuss the particulars of differentially private methods, but Urban has also published an in-depth explainer of differential privacy in the census use case.)
The census case was a natural fit for our BEA trainings, as the bureau’s application of differential privacy is one of the first publicly available examples of these methods applied to public-facing data. However, we ran into difficulty trying to explain the methodology in greater detail, as many terms used to describe each step overlapped.
Data privacy workflow
The process of developing and applying an SDC method is often iterative and subjective, requiring an understanding of the acceptable disclosure risk (or the risk of identifying an individual in publicly released data) for a given dataset and organization. This method also requires understanding the minimum standards for data utility, or the degree to which the data can be distorted to protect privacy before its usefulness diminishes past acceptable levels. The figure below demonstrates what this process can look like. It starts with defining thresholds for privacy and utility, then removing personally identifiable information and iterating on the methods used until the results are acceptable for release. This tension between privacy and utility is described further in this Urban report.
Data Privacy Workflow
Because our BEA trainings focused on advanced methodology, we spent the bulk of the course time diving into Step 4, developing an SDC method. Large and complex data (such as the 2020 Decennial Census results) often require developing and applying SDC methods with multiple steps, including pre- and postprocessing. In the census example, the raw survey responses need to be cleaned and aggregated (preprocessed) to define total population counts at each level of geography. Because the Census Bureau added differentially private noise from a Gaussian distribution, it was possible that the resulting values would have negative counts, which are impossible real-world values and would have confused lay data users. (The bureau also defined many other constraints that postprocessed values needed to satisfy.)
To teach these concepts, we defined the following steps of an SDC method, but note some SDC methods don’t have the last step:
- Preprocessing: Determine the priorities for which statistics to preserve (could be considered Step 3 in the workflow)
- Privacy: Apply a sanitizer to the desired statistic (i.e., altering the statistic)
- Postprocessing: Ensure the results of the statistics are consistent with realistic constraints (e.g., negative population counts)
In creating these definitions, our primary difficulty was defining the privacy step and the postprocessing step. Specifically, the words “sanitizer,” “algorithm,” and “mechanism” are often used interchangeably to refer to a set of steps applied to generate noise and add it to data to obscure individual values. This NIST introduction to differential privacy uses both “mechanism” and “algorithm,” yet this feature article in the Notices of the American Mathematical Society uses both “sanitizer” and “mechanism.”
Generally, “mechanism” is not used universally and “algorithm” refers to only one step of applying an SDC method. The census “Top Down Algorithm” doesn’t actually refer to their application of differentially private noise, but rather the postprocessing applied to the noisy results to ensure constraints are met. Even the term “sanitizer” can create confusion, as some authors use the term when naming new methods. For example, the “Laplace Sanitizer” is a specific method developed by John Abowd and Lars Vilhuber, but it’s different from the “Laplace sanitizer,” which informally refers to the application of Laplace noise. For practitioners who’ve never encountered these concepts, such as the BEA staff we were training, these distinctions are important, yet confusing.
To remove ambiguity, the Urban team decided on the following terminology. We referred to the sanitizer applied in the privacy step of an SDC method using lowercase “sanitizer” to separate this word from the title of a specific method, and we used “algorithm” to encompass the postprocessing step. We decided not to use the term “mechanism” at all to cut down on overlapping technical terms.
Part of our motivation for partnering with the BEA to train their staff on data privacy was helping increase awareness of SDC methods and their importance in making usable data available publicly. But the data privacy field is doing itself no favors by continuing to use overlapping and piecemeal terminology. Even something as simple as defining a “method” posed problems for us, let alone the more technical aspects of SDC methods, such as defining “sanitizer” and “mechanism.”
As we continue our efforts to educate policymakers and data stakeholders about data privacy methods, finding consensus on these definitions remains top of mind. We are working to maintain internal consistency via a style guide for writing about privacy and hope to eventually make publicly available flashcards using these definitions. We hope this work will set the stage for broader conversations within the field about removing ambiguity from data privacy SDC method terminology.