Communicating Synthetic Data Evaluation Results with a Parameterized R Markdown Report
Recently, a team of Urban Institute researchers partnered with the Allegheny County Department of Human Services and Western Pennsylvania Regional Data Center to create a fully synthetic version of the department’s confidential data tracking individual usage of child welfare services, behavioral health services, aging services, developmental support services, homeless and housing supports, and family strengthening and youth supports. Synthetic data are statistically representative pseudo-records, which can enable Allegheny County to preserve many features and properties of the confidential data while mitigating privacy threats to individual participants reflected in the data. The synthetic dataset can be publicly shared, helping researchers, service providers, members of the public, and other stakeholders understand these services and their recipients better.
Our process for generating synthetic data with our county partners was highly iterative and collaborative. We varied components of our synthesis methodology over the course of the project, ultimately generating 77 synthetic candidates. We evaluated each of these candidates across a series of metrics before selecting the dataset that best balanced data utility with data privacy. While the Urban Institute team provided technical capabilities and expertise, we relied on the county partners to guide data utility priorities based on their knowledge of its common use cases, ensure our assessments of data privacy aligned with county policies, and make decisions about tradeoffs between data utility and data privacy.
Given the iterative and collaborative nature of this process, we faced a central question: How can we evaluate dozens of these synthetic candidates and communicate a large amount of highly technical evaluation results to our county partners? We wanted to provide a one-stop shop for county partners to review and digest synthetic data evaluation results, and we determined that a parameterized R Markdown report solution met our needs along several key dimensions.
Parameterized R Markdown evaluation reports can facilitate comprehensive technical communication
Over the course of the project, we developed and continually refined a parameterized R Markdown template that aligned with the project’s iterative and collaborative nature, enabling us to share synthetic data utility and privacy metrics with our county partners and to respond to their feedback and requests.
Others at the Urban Institute have written about creating iterated fact sheets and using HTML pages for various project use cases. We drew from this work and used R Markdown not for final polished products, but as a flexible and workable internal tool throughout the life of the project. Below, we outline three key characteristics of parameterized reports that supported our use case.
1. R Markdown report templates can include parameters for use in the knit environment
Our reports featured three primary parameters: synthesis ID (synth_num
), replicate number (rep_num
), and confidential data structure type (conf_type
). These parameters allowed us to quickly dictate the synthetic and confidential data sources to evaluate as well as the types of analyses to run on our dozens of candidate datasets. The update process was as simple as changing the parameters in one place and knitting the report.
We can declare these parameters using the params field in the YAML header of an individual R Markdown (.Rmd
) file. However, instead of manually changing the YAML header for each of the many reports we created, we used a separate helper script to iterate over parameter values and pass these values to the params argument in the function rmarkdown::render()
. This helper script was structured as follows:
Upon knitting, parameters are stored within the params
object and can be used throughout the R Markdown report template file. You can access the synthesis ID number with params$synth_num
, for example. With this parameterized template setup, we quickly generated evaluation reports for each new batch of synthesis results to review internally and share with county partners.
2. R Markdown reports are flexible and allow for continuous updates
Our report template wasn’t created in a single day, week, or even month; it evolved over the course of the project as we continued the iterative synthetic data generation process. We continuously added, improved, and refined the metrics and analyses in the report as we received feedback and input from county data experts.
We sought to capture the following topics in our reports: typical data use cases such as tracking service usage over time and analyzing interactions between different services; priorities for preserving demographic characteristics like age, race, ethnicity, and gender; and considerations of disclosure risks and privacy policies. The report template we developed allowed us to easily respond to new information and areas of focus that we gained from our discussions with the county partners.
3. R Markdown reports cleanly display text, code, and results in a cohesive document
With R Markdown, we could weave narrative text with our computational analyses, which allowed us to better communicate technical results by including elements such as a table of contents, headers and body text, code chunks, inline code, and embedded visualizations in our HTML reports.
Although our partners were already experts in these datasets, the synthetic data generation process also required many specific technical decisions surrounding modeling and evaluation choices. To contextualize these decisions and results, we included descriptions of each evaluation metric alongside the relevant statistics and visualizations in the report template. We described the motivation for each metric, provided guidance for interpreting results (e.g., directionality and magnitude), and included references to outside resources.
The image below, for example, uses mock data points to display a sample report section on relative frequency comparisons:
All in all, R Markdown enabled our team to efficiently produce organized and comprehensive results, respond quickly to feedback, and communicate clearly with our county partners. Programmers using R or any other language could also consider using Quarto, a new multilanguage, open-source scientific publishing system that improves on R Markdown with greater functionality. For our use case and given our technical capabilities and expertise, building upon the existing work using R Markdown made for an effective collaboration with our county partners.