Image by RyanJLane/Getty Images

Creating Better, More Informative Map Legends in R

Data@Urban
6 min readDec 15, 2021

For many geography-oriented projects, maps have become the default vehicle for data visualization. But maps aren’t always the best way to visualize geographic data. Yes, they are familiar to people and have become easier to create, but they can also hide important takeaways. In some cases, the size of a geographic area may not correspond to the importance of the data value, or the legend may show aggregated groups instead of relaying a better sense of the distribution of the data.

Here, we propose a better way to use map legends that helps the reader understand how to read the map and gives the reader a view of the data distribution. We used this approach for a project in which we looked at how a variety of factors may correlate with the award rate for the Social Security Disability Insurance (SSDI) program. Below, we explain how we looked at patterns in SSDI awards across five different categories of disabilities defined by the Social Security Administration and created nearly 200 maps in R Markdown — which you can explore in full at Urban’s public Github repo — by leveraging several tools.

The map legend problem

Typical map legends aggregate values into groups, which prevents readers from determining specific values. Take this relatively simple map showing the number of people awarded SSDI benefits in 2018 per 100,000. The map divides the US by the roughly 2,300 Public Use Micro Areas (PUMAs), which are geographic areas within states defined to include at least 100,000 individuals. For PUMAs with the lowest SSDI award rates, you only know that fewer than 200 people per 100,000 receive SSDI awards. You don’t know the exact values of the PUMAs or how different the values are — perhaps the award rate in one PUMA is 1 per 100,000 and another is 199 per 100,000. There are a variety of ways to change the bins in the legend to address this aggregation problem, but converting the map legend to a bar chart that shows the distribution of the data, as we show below, is one way to make the data clearer.

Consider this next map, which shows DI “hot spots,” which we define as PUMAs in which the share of awards for specific disabling conditions is in the top 10 percent relative to other areas in the same year (you can read more about this concept and other patterns in the full paper). This map shows hot spots for SSDI award rates for reasons of mental disorders, which you can see are particularly high in the New England states of Massachusetts, New Hampshire, and Vermont. This map’s legend makes it difficult to see the exact values in each PUMA, but it also fails to show how many PUMAs fall within each category. Do those dark blue areas represent 10 or 14 occurrences? How many more times are PUMAs classified at the top end of the range than the bottom?

We used several libraries within R Markdown to more clearly highlight the true distributions within these maps.

R libraries to create maps

In addition to the standard dplyr, ggplot2, and tidyverse libraries used to work with data and create data visualizations in R, we used two other libraries to create the maps:

· Library(tigris) makes it painless to pull shape files for US Census Bureau geographies straight into R. For instance, the code snippet below creates a “special features” data frame with cartographic boundary shapefiles for all PUMAs in Alabama.

· Special features, or sf, is the new framework for geospatial analysis in R, and it is based on library(sf). It allows for geospatial points, lines, and polygons to be stored in the geography column of a data frame instead of a complex hierarchical data structure, which means the data play well with data frame-focused functions from library(dplyr).

Once the shapefiles are loaded, they are easy to map with geom_sf() from library(ggplot2). Library(tigris) and library(sf) make it possible to create a map in just a few lines of code. A few years ago, the Urban Institute had dozens of people using ArcGIS weekly — now only a handful do because of the power of library(sf).

# load packageslibrary(tidyverse)library(tigris)library(sf)# load PUMA shapefiles for Alabamaalabama <- pumas(state = “01”,class = “sf”,cb = TRUE)# map Alabamaggplot(data = alabama) +geom_sf()

Previously, we’ve outlined the eight elements of the layered grammar of graphics. Arrangement, what we call the composition of distinct data visualizations, isn’t included in the layered grammar of graphics, but is useful for data visualizations, like maps, that use fill (color) as an aesthetic mapping.

When mapping data, a defining challenge is how the map uses the x and y dimensions to display geography instead of the variable of interest. To expand the utility of the map legend on the SSDI hot spots for mental disorders map, we expand its shape to turn it into a vertical bar chart, where the colors of the bars correspond with the color encoding in the map. Now, the reader can see not only the specific values but also the distribution of the data across the entire country. The addition of the histogram/legend benchmarks the frequency of the hotspots.

Arranging the map and bar chart is easy to implement in R with Thomas Lin Pederson’s R package patchwork. For this map, we add X, Y, Z to the basic geom_sf() code shown earlier:

combined_figure <- wrap_plots(mental_disorders,legend) +plot_annotation(theme = theme(plot.margin = margin()))

Creating lots of maps

At the Urban Institute, we widely use R Markdown to organize analyses — and nearly 200 maps is a lot of maps. But two features are essential when working with lengthy analyses. First, a floating table of contents makes it easy to navigate the document, and any team member can quickly and easily scan the Markdown page to find the maps they want to see. The table of contents can be added to the R Markdown document by inserting Add toc: TRUE to the YAML header in the R Markdown document.

Second, tab sets collapse information and generate a webpage organized similarly to other pages on the internet, making it easy for team members to scroll and select maps. Here, we add {.tabset} next to a section in the R Markdown code to collapse all subsections into a tabset.

Our team consists of researchers at the Urban Institute and elsewhere, and the COVID-19 pandemic created new challenges for collaboration. R Markdown and GitHub Pages make it easy to collaborate on both large and incremental programming projects. GitHub Pages is a free feature in GitHub repositories that allows for free web hosting. R Markdown can create HTML documents, which we can easily share across the entire team.

In our workflow, we generate the HTML with R Markdown and then push the analysis to GitHub. Every time an updated analysis reaches the main branch, it instantly updates the results on a website. With this process, collaborators and reviewers, even if they are not programmers, can see all the results by visiting a URL rather than sharing code or image files. When working with so many maps — each presenting their own challenges to creating the most accurate representation of their findings — making collaboration as easy as possible is critical, which the tools within R Markdown and Github facilitate.

-Jonathan Schwabish

-Aaron R. Williams

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban
Data@Urban

Written by Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.

No responses yet