Why The Urban Institute Visualizes Data with ggplot2
Researchers who visualize data need tools that are flexible, reproducible, scalable, and consistently branded. Many researchers at the Urban Institute use the popular R package ggplot2 because it satisfies all of these needs.
• Flexible: Visualizations should be informed by the insights and ideas of a researcher instead of the limitations or defaults of the software tool.
• Reproducible and scalable: Visualizations should be entirely reproducible and the marginal effort required to iterate the 2nd, 10th, or 100th visualization should be far less than the effort required to make the first.
• Consistently branded: The source of a visualization should be recognizable based solely on the thematic components of the plot.
The ggplot2 package meets these needs because it is based on a coherent theory (the layered Grammar of Graphics), is written in a programming language designed to be extended with additional functionality (an “extensible” language), and has customized themes built into the package itself.
In this post, I’ll expand on these three qualities and the advantages of ggplot2 for data visualization.
ggplot2 is flexible
ggplot2 is an implementation and expansion of Leland Wilkinson’s Grammar of Graphics. Wilkinson’s work provides a framework for describing the components of a data visualization. Hadley Wickham extended the grammar of graphics with the idea of layers in his paper A Layered Grammar of Graphics and fully implemented the framework with ggplot2. Instead of relying on named graphics like bar plot or histogram, the layered grammar of graphics describes fundamental building blocks of static data visualizations. I’ve summarized Wickham’s take on Wilkinson’s building blocks below:
1. Data are the values represented in the visualization.
2. Aesthetic mappings are directions for how data are mapped in a plot in a way that we can perceive. Aesthetic mappings include linking variables to the x-position, y-position, color, shape, and size.
3. Geometric objects are representations of the data, including points, lines, and polygons.
4. Scales turn data values, which are quantitative or categorical, into aesthetic values.
5. Coordinate systems map scaled geometric objects to the position of objects on the plane of a plot. The two most popular coordinate systems are the Cartesian coordinate system and the polar coordinate system.
6. Facets (optional) break data into meaningful subsets.
7. Statistical transformations (optional) transform the data, typically through summary statistics and functions, before aesthetic mapping.
8. Theme controls the visual style of plot with font types, font sizes, background colors, margins, and positioning.
The example below uses data from library(AmesHousing), an R package with housing data from Ames, Iowa, and demonstrates how the syntax of ggplot2 flows from the framework outlined by Wilkinson and Wickham. Note that only snippets of the code used to create the plots below are included here. The entire R Markdown document with all code is available in this GitHub repository.
The code starts with the data “ames” in ggplot(). Aesthetic mappings are in aes() with square_footage mapped to the x-position and Sale_Price mapped to the y-position. geom_point() represents the data as points and the argument alpha = 0.2 adds transparency because many points are mapped on top of each other.
ggplot2 automatically generates scales. Here, scale_x_continuous() and scale_y_continuous() are used to specify styles for the labels and customized breaks along the x-axis and y-axis. ggplot2 defaults to the Cartesian coordinate system, no facets are used, and no statistical transformations are used, so these elements are not represented in the code. Finally, source() at the top of the code chunk sources the Urban Institute ggplot2 theme.
This example is brief. R for Data Science by Garrett Grolemund and Hadley Wickham, and ggplot2: Elegant Graphics for Data Anlysis by Hadley Wickham are both invaluable long form resources for learning the fundamentals of ggplot2.
ggplot2 is reproducible and scalable
Creating data visualizations with the package ggplot2 in a scripting language like R makes it easy to reproduce and scale visualizations. How?
1. It gives you the opportunity to show your work to others and future you. A script breaks down a process into digestible and reproducible steps, allowing the creator and others to easily reproduce any graphic. Scripting remains the standard for research when it comes to summary and inferential statistics. Researchers should strive for the same standard for most data visualizations. Scripts also simplify teaching because the input is a recipe that can be understood by the computer and the user. At the Urban Institute, we provide guides with code and examples for dozens of different plot types and maps. It’s cookbookery as an on-ramp to understanding, and for many researchers, it works!
2. R and ggplot2 work well with version control systems like Git because the scripts and data are simple text files. Version control facilitates reproducibility, makes experimenting easier with branching, and captures the history of the development of the visualization.
3. ggplot2 scales well with faceting and iteration. Faceting breaks data into meaningful subsets and constructs similar plots for each subset. This is often called “small multiples”. The following example uses the same code as above, but adds facet_wrap(~zoning, ncol = 4) + and tweaks some of the labels.
Iteration is programmatically repeating operations. ggplot2 can iterate using loops or functionals to create dozens, hundreds, or even thousands of plots in seconds. The following example iterates ggplot() over four subsets of the data and saves visualizations of each subset to PDFs.
Plots can also be iterated with R Markdown — something the Urban Institute uses to iterate fact sheets for different localities, organizations, or periods of time. Needless to say, creating graphics this way saves our researchers a tremendous amount of time.
ggplot2 allows for consistent branding
At an organization like the Urban Institute, a flexible, reproducible, and scalable data visualization won’t go far unless it complies with the Urban Institute Data Visualizations Style Guide. To that end, we built the Urban Institute ggplot2 theme, which adds Urban branding to ggplot2 plots with the addition of just one line of code to the top of any R script.
source("https://raw.githubusercontent.com/UrbanInstitute/urban_R_theme/master/urban_theme_windows.R")
This theme creates a set of rules for styling the components of the grammar of graphics. The first part of the theme creates the function theme_urban(), which handles sizes, spacing, font families, orientation, and the placement of elements.
The theme is based on ggplot2’s web of inheritances. At the top, the theme sets default line, rectangle, and text styles. These three attributes are then passed as defaults to the next layer of arguments. If nothing is specified for the second layer, then defaults are used. Then this second layer is passed to the third and on-and-on.
text -> axis.text -> axis.text.x
The theme can control all text styling, the plot background, plot axes, the legend, and the style of faceting.
The second part of the theme controls color. It first changes the default color from black to Urban’s standard blue. Second, it creates a nine-color palette for plots that use color as a geometric aesthetic.
These plots aren’t necessarily publication-ready, but the addition of one line of code makes researchers’ lives easier on the front end, editors’ lives easier on the back end, and the Urban Institute brand a lot more consistent.
Flexible, reproducible, scalable, and consistently-branded data visualizations improve the scientific process, enhance the communication of research insights, and save researchers time. Most tools for data visualization achieve one or two of these features. R and the ggplot2 package take advantage of all three, providing a powerful means for furthering Urban’s mission.