Data-driven decisionmaking in city government has expanded rapidly in recent years, driven by advances in technology and the digitization of many city services. While we at the Urban Institute applaud the growth of data-driven decisionmaking, we also recognize that there are real concerns about the potential for bias in data used to guide public decisions. Left unchecked, unrepresentative data can directly lead to inequitable policy outcomes that harm vulnerable groups.
For example, many public works departments have started using citizen complaint data, like 311 requests, to allocate scarce city resources to perform sidewalk repairs and fix potholes. On the surface, this may seem like a way to make governments more responsive to citizen needs. The problem is that citizen complaint systems are more likely (PDF) to be used by certain demographic groups (PDF), namely white residents, highly educated residents, and high-income residents. And as a result, their neighborhoods are often the first to get sidewalks repaired or potholes fixed. Although significant progress is being made in developing tools to assess algorithmic fairness and transparency, such as Aequitas from the University of Chicago or the What-If tool from Google, we could not find many readily available tools that could solve this specific issue for local governments by assessing fairness and representativeness of the datasets themselves. This was especially true for the most common type of data we saw cities investigating and providing on their open data portals — geospatial data.
To fill this gap, the Urban Institute team has built and is making freely available our new Spatial Equity Data Tool. This tool allows users to upload their own city-level data and quickly assess spatial and demographic disparities, provided the data has geographic identifiers in the form of latitude and longitude columns. The tool uses census tract–level data from the five-year American Community Survey (ACS) and allows users to select from a precompiled list of `baseline` ground truth datasets to compare their uploaded data against. When a user uploads their data, the tool automatically determines the dataset’s source city, pulls in the relevant census data, computes specific measures of geographic and demographic representativeness and visualizes them on a dashboard. We hope this will be a useful tool for government officials, community advocates, and nonprofit organizations to assess the representativeness of city-level data before using them for decisionmaking and to advocate for a more equitable distribution of resources.
More broadly, we built this tool in order to:
1) Democratize data analysis capabilities: Measuring representativeness of datasets is difficult; it requires assembling a ground truth measure from different data sources to compare your data against. While large municipalities might have the technical expertise, time, staff, and budget to perform such analyses, it is out of reach for smaller municipalities and organizations. We hope that our tool will allow anyone with an interest in data, regardless of their technical background, to perform such equity analyses.
2) Standardize definitions of equity and representativeness: Our tool constructs standardized metrics to systematically measure disparities in user uploaded data against a set of precompiled baseline variables. This means that users can easily and quickly make comparisons across time, across policy domains, and even across cities. This opens up the door to many types of analyses that would previously have been prohibitively time consuming for busy city leaders and community groups. All users have to do is upload two different datasets, and compare the results.
3) Embed equity into city decisionmaking: This tool is part of a larger research project to help cities more equitably use data and technology. Our accompanying research identified the need for accessible data tools that operationalize equity principles and facilitate regular and rigorous examination of the representativeness of data powering city decisionmaking. We think of this tool as one part of a larger decisionmaking process, involving community input, expert local knowledge and other community engagement strategies like data walks.
Our tool in action
To illustrate how the tool can be used, we’ll walk through an analysis using one of the sample datasets built into the tool: New York City public Wi-Fi hotspots.
Public Wi-Fi hotspots are an important public good, especially now that many schools and universities are transitioning to online learning. We know that even before COVID-19, students without home internet or computer access lagged behind their peers academically, and that this “homework gap” is more pronounced for Black, Hispanic, and low-income students. Many without internet at home rely on public Wi-Fi to connect to the internet for work and school. Providing equitable access to Wi-Fi is key to ensuring that these academic disparities are mitigated as much as possible during the pandemic. To see how equitable the current distribution of public Wi-Fi hotspots is, we use a dataset of public Wi-Fi hotspots available on New York City’s open data portal and ask which neighborhoods and demographic groups Wi-Fi hotspots serve. Users would only need to download the data from the city’s data portal, upload the data to our tool, click run analysis, and wait a few seconds for the dashboard to appear.
We now walk through the dashboard and each of its components below.
The first half of the dashboard displays a map of the geographic disparities present in the distribution of public Wi-Fi hotspots at the neighborhood (I.e. census tract) level. For each neighborhood in New York City, the tool calculates a geographic disparity score, which is the percentage-point difference between the share of all Wi-Fi hotspots that come from a neighborhood and the share of the baseline population that comes from that neighborhood. High disparity scores (blue areas of the map) mean that there are a lot of Wi-Fi hotspots in that neighborhood relative to the total population, which is our baseline population in this example. When you mouse over a tract on the map, you can see how the geographic disparity score is calculated.
The baseline population is the ‘ground truth’ that you think your data should be compared against. In the above map, the selected baseline population is the total population, which is the default used by our tool for all datasets. However, one of the key features of the tool is the ability to change the baseline population to better suit the uploaded data. In this case, one might expect that a truly equitable Wi-Fi hotspot program should specifically serve residents without access to internet at home. The population without internet access is one of several baseline datasets we have built into the tool, so we can assess whether and how New York city’s current distribution of Wi-Fi hotspots equitably serves this population.
When we change the baseline population, the first part of the disparity score — the share of the data coming from each tract — stays the same. But the second part of the score — the share of the baseline population coming from each tract — changes accordingly. The full list of selectable baseline datasets is shown below. Each of the baseline populations are a specific population from the five-year ACS. We chose these specific baselines as we believe, based on conversations with key stakeholders and early beta testers, that they are common target populations for municipal programs and nonprofit services.
Once we select the population of households without internet access as our baseline dataset, the map of geographic disparity scores changes.
We now see more clearly that there are many neighborhoods in Manhattan, particularly in the Upper West Side and Midtown, that are over represented by Wi-Fi hotspots. This means that these neighborhoods have a low proportion of the city’s population with access to internet but have a high proportion of the city’s Wi-Fi hotspots. On the other hand, there are large swaths of Bronx County and Kings County that are slightly underrepresented. City officials could use this analysis, along with more robust community input and other ground truth measures, to identify where to place future Wi-Fi hotspots to reach underserved communities.
We also give users the ability to view a data comparison, which shows your data on a slider map side by side with the selected baseline data. The left side of the map displays the share of your data that comes from each tract while the right side displays the share of the population without internet access that comes from each tract. We created this view so users can get a better understanding of how the geographic disparity score is constructed — it is simply the difference between the left side and the right side of the map.
The second part of the dashboard visualizes the demographic disparities in your data for a few selected variables. For each variable, we compute a demographic disparity score. This is the percentage-point difference between the representation of a demographic group in the data (the data-implied percentage) and the representation of that a demographic group in the city as a whole (the citywide percentage). Take a simple example city with two census tracts, each home to 50 percent of the city’s population. If tract A’s population is 20 percent Asian and tract B’s is 40 percent Asian, then the citywide percentage of Asian residents is (0.5)(20) + (0.5)(40) = 30 percent. The citywide percentage answers the question, “What share of the city’s total population are comprised of Asian residents?”
Now imagine 80 percent of Wi-Fi hotspots come from tract A and 20 percent of Wi-Fi hotspots from tract B. The data-implied percentage of Asian residents would be (0.8)(20)+ (0.2)(40) = 24 percent. The data-implied percentage of Asian residents answers the question: “What is the share of Asian residents in the average tract from which the data originate?”
And finally, the demographic disparity score is the difference between the data-implied percentage and the citywide percentage, or 24–30 =–6 percent. In this toy example, this means that Asian residents are underserved with Wi-Fi hotspots by 6 percent. Note that this is only a toy example, and in reality, our tool uses non-Hispanic Asian, white, and Black racial categories. Below is the demographic disparity chart with our selected demographic variables.
Here we can see that non-Hispanic white residents and highly educated residents are significantly overserved by Wi-Fi hotspots. On the other hand, non-Hispanic Black and Latinx (Hispanic) residents are some of the most underserved groups. In other words, the neighborhoods where Wi-Fi hotspots are concentrated tend to have high numbers of non-Hispanic white and highly educated residents and low numbers of non-Hispanic Black and Latinx residents. Moreover, school-age children younger than 18 and cost-burdened renters are underserved, suggesting that the city’s current Wi-Fi hotspot distribution does not equitably serve children who may most need these services to attend school and online learning classes. For the demographic disparity chart, we use “Latinx” to refer to people of Latino or Hispanic origin as asked by the census. We recognize that this gender-neutral term may not be the preferred identifier for many, but we use it to remain inclusive of gender nonconforming and nonbinary people.
Eagle-eyed readers may have also noticed that one of the demographic variables on the chart is households without internet access, which is the same as our baseline variable. And in this case, the chart implies that New York’s Wi-Fi hotspots underserve households without internet across the city, as there is statistically significant underrepresentation reported by our tool. This result highlights the importance of looking at both the demographic and geographic representativeness scores together. While households without internet are underserved as a whole, as shown by the demographic disparity scores, there are specific pockets of the city where households without internet access are significantly overserved, as shown by the geographic disparity map.
The ‘baseline’ population for this portion of the dashboard cannot be changed. It will always compare the demographics of your data to the city’s total population because the US Census Bureau only publishes detailed tract-level demographic data about the total population. The bureau does not, for example, provide data on all race and ethnicity combinations at the tract level for the population ages 65 and older. Finally, we have also built in the ability to export both the data and the static images for the geographic disparity map and the demographic disparity chart. This makes it easy for users to include images from the dashboard in reports, slide decks, or other output when presenting the data to others and enables them to use the data to build off of our analysis.
Because the geographic and demographic disparity scores outlined above rely on tract-level ACS demographic variables, they have associated margins of error. In order to ensure our disparity scores are statistically significant and not due to variation in the census figures, we generate 95 percent confidence intervals for our metrics using the appropriate margins of error and formulas in chapter 7 (PDF) of the ACS handbook. If the confidence interval contains 0, that is, we don’t have enough evidence to conclude a disparity exists, then we report that score as statistically insignificant and color it gray.
We heard from early beta testers that having the ability to manipulate the data before running the analysis would be helpful. So we added two advanced options that allow this. The most requested feature was the ability to filter the dataset before running the analysis. Say for example, we want to limit our analysis only to free Wi-Fi hotspots that allow unlimited data usage, given the potentially heavy data needs of online learning applications. When uploading your data, we allow users to select Advanced Options > Filter Data, select the FREE_COL column from the dataset (an indicator specifying whether or not the Wi-Fi allows for free unlimited usage), and set the value as ‘Free’. Users also have the option of applying numeric or date filters for other columns in the data.
The second advanced feature is the ability to weight your data by a particular column. Though this isn’t present in this particular dataset, say we had a column in the data that denoted the average speed of each Wi-Fi hotspot. We could weight our data by that column, which means that a Wi-Fi hotspot with a speed of 100 MB/s would be weighted twice as heavily as a Wi-Fi hotspot with a speed of 50 MB/s. This is important because analyzing equity in the distribution of Wi-Fi hotspots requires taking into account both the number of Wi-Fi hotspots and the quality of each Wi-Fi hotspot (in this case measured by average speed). Weighting allows users to ensure that the tool is actually measuring the level or quality of service provided by each data point. Because we do not have a speed column in our data, we did not perform the weighting for this analysis.
So you’ve used the tool and identified some potential disparities. What now? While our tool can give users a rough sense of the size and scale of the disparities present in the data, it cannot tell you why those disparities exist in the first place. To understand why those disparities exist and how to address them, we urge users to consider the following potential drivers of unrepresentativeness within their data:
· Data collection issues: The design and implementation of data collection systems can yield unequal representation. For example, resident generated datasets, such as 311 requests, may reflect higher utilization of the system by some groups of residents. Therefore, the data may not accurately represent the true need for city services.
· Program implementation: The program captured in the data may not have been designed for the equity objective our tool is assessing. Some cities put public Wi-Fi hotspots in government buildings or downtown commercial centers to cater to the tourist population. As a result, Wi-Fi hotspots would be very overrepresented in commercial neighborhoods but underrepresented in less-central residential neighborhoods.
· Historical inequities: Data reflect the biases of the systems that generate them. For example, police arrest data are often concentrated in low-income communities of color because of previous policy decisions to overpolice these communities. These biased data are often fed into predictive policing algorithms, which, in turn, send even more police officers into these neighborhoods, generating even more disproportionate arrest records.
· Mismatched baseline datasets: While our tool offers several baseline datasets, it may not offer the baseline that best represents the most equitable distribution of your data. For example, when analyzing disparities in pothole-repair-request data, the correct baseline dataset to compare against might be a dataset on traffic flow or some other measurement of likelihood of potholes. Unfortunately, that is not one of the datasets available in our tool.
These are certainly not the only potential drivers of unrepresentativeness in data, and often multiple drivers can be present simultaneously. We believe that the tool can raise some of the right questions to ask, and working with officials and community members with local expertise is key to answering them.
Taking a step back, this tool encapsulates a key Urban value: democratizing data and data analysis tools are key to making policymaking more equitable and evidence based. Tools like ours put the power of equitable data-driven decisionmaking into the hands of local policymakers, civil rights groups, and community activists.
We have worked with a small group of beta testers to decide on and build out the features for this initial version of the tool, and the fantastic feedback we received has left us with many ideas for additional features. These include adding more demographic variables and baseline populations, allowing users to upload their own baseline variables and allowing other filetypes, such as geojsons, and other of types geographic data, like polygons. We hope to continue developing this tool by incorporating these features and others based on user feedback.
We have published all the code powering the tool on Github in case you want to build out these features yourself or modify our methodology. If you would like to request any additional features or have specific questions about the tool, please reach out to email@example.com. If you have any questions while using the tool, please check out our technical appendix and FAQ.
Finally, we hope to write a future Data@Urban post describing how we built the tool using a primarily serverless architecture on Amazon Web Services. For now, go forth and analyze!