Illustration by Rhiannon Newman for the Urban Institute

Visualizing Vaccinations: Web Scraping and a Live Tableau Dashboard

Data@Urban
11 min readMay 28, 2021

Over the past year, journalists and researchers have found creative ways to analyze and visualize COVID-19 data, including trends on infections, deaths, and vaccinations around the world. But getting, cleaning, contextualizing, and visualizing COVID-19 data can be challenging because it’s hard to know where to start.

Our small team at the Urban Institute partnered with the data visualization consulting firm HealthDataViz to collect COVID-19 vaccination data and create a daily updated public Tableau dashboard that allows users to explore vaccination trends across the United States. But we didn’t want to simply re-create others’ interactive dashboards and visualizations, especially the excellent ones produced by the New York Times and the Washington Post. Instead, we wanted to incorporate some perspective on disparities in the vaccine rollout and how socioeconomic factors like median income and job type correlate with vaccination rates.

Together, we were able to build a public-use Tableau dashboard that displays information about daily vaccinations alongside socioeconomic data to provide a perspective on vaccination disparities not readily available elsewhere. We hope other researchers and organizations will see how easy it is to collect and disseminate similar dashboards to help their users and stakeholders. Here, we provide a step-by-step walkthrough of how we created our final product.

Part 1: Web scraping CDC data

The Centers for Disease Control and Prevention (CDC) publishes daily vaccination data for the US on their COVID Data Tracker dashboard. We wanted an easy way to collect these data, so we wrote a script in R that downloaded the data from the CDC, cleaned them, and then uploaded them to a Google Sheet. The Tableau dashboard would then pull from the Google Sheet, automatically updating every day.

You can find the full R web scraping scripts in this repository, and we go over some important snippets of the code below. To begin, we downloaded the data in JSON form from the CDC API using the jsonlite library:


library(jsonlite)
# We got this URL from inspecting the network tab on the CDC vaccine tracker page
vaccination_data_url <- “https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"

d <- jsonlite::read_json(vaccination_data_url)[“vaccination_data”][[1]]

Next, we conduct some data cleaning and convert the JSON data into an R dataframe. This is often the hardest part of working with JSON data because the data format may not directly translate into tabular data frames, which, in this case, we need for the Tableau dashboard. We were luckily able to use the as_tibble() function to convert each of the JSON records into a one row dataframe and stitch them together with map_df().


# Convert all fields in JSON into dataframe
df_all <- d %>%
# Apply as_tibble() function to every item in d, and stitch together all results into a df
map_df(as_tibble)

A lesson we’ve learned with scraping daily updated data is that it’s always a good idea to check that the data structure has stayed the same from one day to the next — data fields returned by CDC API often changed without warning. We perform this check with the dplyr:: all_equal function and stop the script if the fields have changed so the research team can confirm the new fields won’t affect the data points shown on the dashboard.


# Create a dataframe of column (ie field) names
colnames_df <- tibble(colnames = df_all %>%
colnames() %>%
sort())
# If the field types aren’t the same as those in the cache, then stop the script
if (!dplyr::all_equal(colnames_df, read_csv(“data/cdc_schema_db/fields.csv”))) {
# Write out new fields into a cache (ie a csv)
colnames_df %>%
write_csv(“data/cdc_schema_db/fields.csv”)
# Stop the script
stop(“CDC API schema has changed. Please see `cdc_schema_db/fields.csv`”)
}

We also scrape historic vaccination data from the Our World in Data (OWID) repository. The OWID team compiles data from the same CDC dashboard, but unlike the original CDC dashboard, they very helpfully maintain a time series record going back to January 12, 2021. We use these data for the time series plots on our dashboard.


owid <- read_csv(“https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/us_state_vaccinations.csv") %>%
select(date, location, daily_vaccinations_raw, daily_vaccinations, everything())

Finally, we authenticate with Google and upload all the data into a public Google Sheet, which is made delightfully easy by the googlesheets4 package.:


library(googlesheets4)
# Get env variables, which has google sheets ID and service account json file
google_sheet_id = Sys.getenv(“google_sheets_id”)
google_service_account_json = Sys.getenv(“google_key”)

# Set up google authentication with a service account. See here for more details:
# https://gargle.r-lib.org/articles/get-api-credentials.html#service-account-token
gs4_deauth()
gs4_auth(path = google_service_account_json)


## — — Write to Google Sheets — — -
owid %>%
sheet_write(ss = google_sheet_id, sheet = “state”)

We automated the process of running this script everyday with a scheduled job (e.g., cronjob) on Github Actions, which is a powerful automation tool for repetitive tasks. Stay tuned for a future Data@Urban post explaining the power of Github Actions and delving into the specifics of how we used it for this project!

Part 2: Appending socioeconomic data from the ACS and BLS

In addition to displaying COVID-19 vaccination data, we wanted to provide context on the socioeconomic disparities present in the rollout. So we used the ‘tidycensus’ R package to download state-level socioeconomic data from the 2015–19 American Community Survey (ACS) and placed them in the same Google Sheet. Our full script that downloads these data can be found here, and we present a few important snippets below.

We decided to pull variables for race and ethnicity, age, median income, and total population at the state level. So we used the get_estimates() and get_acs() functions and joined the results together by state name.


library(tidycensus)
# Get acs data via tidycensus
state_acs_data_breakdowns <- get_estimates(
geography = “state”,
product = “characteristics”,
breakdown = c(“AGEGROUP”, “RACE”, “HISP”),
breakdown_labels = TRUE,
year = 2019)
state_acs_data_medincome <- get_acs(
geography = “state”,
variables = c(medincome = “B19013_001”),
year = 2019)
state_acs_population <- get_estimates(
geography = “state”,
product = “population”,
year = 2019)
state_acs_data <- state_acs_data_breakdowns %>%
left_join(state_acs_data_medincome, by = “NAME”) %>%
left_join(state_acs_population, by = “NAME”)

Next, we read in state-level employment estimates for workers in the health care and education sectors in 2019 from the US Bureau of Labor Statistics. By using the file.download() function and the readxl R packages, we avoided having to manually download any Excel files and were able to incorporate every part of the data import process into the script. This was particularly helpful from a reproducibility standpoint, allowing anyone on the team to easily replicate the results. A helpful note for folks downloading .xlsx files with download.file(), you need to set mode = “wb” to ensure R downloads a binary file.


library(readxl)
library(gdata)
employment_OLS_education_raw <- read_excel(“input/oes_research_2019_sec_61.xlsx”)
employment_OLS_healthcare_raw <- read_excel(“input/oes_research_2019_sec_62.xlsx”)

Next, we did a little bit of data cleaning to filter to just the top-line worker totals and calculate rough percentages of education and health care workers by state.

Finally, we created two CSV files — one state-level dataset of employment estimates and one state-level demographic dataset. Because both datasets are only updated once a year, they didn’t require continual updating, as with the daily vaccination data. The final CSV files were manually imported into our Google Sheet as additional sheets for easy ingestion into Tableau.

Part 3: Putting it all together in Tableau Public

As discussed, the vaccination data update daily into the Google Sheet, and it was imperative that we had an interactive dashboard that would also refresh daily on Tableau Public. Tableau Public is Tableau Software’s free platform where users can publicly share and explore data visualizations online. Because it is a free online publishing platform, Tableau Public only offers one way for data to automatically refresh, which is through a Google Sheet data connection. The important caveat is that all data in the visualization workbook must be in the Google Sheet or the data will not refresh. This meant we needed to include all the above data sources, plus our two mapping files (an Albers Projection polygon file and a tile map coordinate file), which do not refresh, to the Google Sheet.

In Tableau, we used the Google Sheet connection option in order to point Tableau to the Google Sheet for all the data files.

Six of the seven files were connected using logical relationships between the files on the common “state” name to achieve the interactivity we wanted on the dashboard. The seventh file — the US vaccine time series data — was brought in separately because it was only needed for the overall daily vaccine trend chart and didn’t relate to any other state file.

With this relational model in Tableau, we could connect data from different tables without duplicating them in a typical join. This way, when we wanted to connect data that have one row per state with data that have multiple rows per state, the data integrity was maintained, and Tableau could bring in the appropriate fields without duplication.

Once all the data were connected in Tableau, we built the calculations that would drive the interactivity and the dashboard overall. Most of the calculations were relatively straightforward and involved calculating rates using population data.

One of the main features of the visualization is the ability to select a state and view that state’s data throughout the visualization, both in detail and for comparison. The map in the center of the visualization drives this filtering effect. We used an Albers Projection map as the default view, as regional differences become clear in a standard map of the United States.

However, smaller states on standard maps can be more difficult to see and select. Plus the larger states take up more visual weight, and we are more likely to focus on them. To avoid these problems, we decided to include an additional map option: the tile map.

The tile map gives each state equal visual weight and places states relatively close to their actual location.

In order to get this “map swapping” to work via a click of a button, we first had to create two individual but identical worksheets, one for each map type. Each map uses a different dataset to set the shapes for the map — the Albers Projection map uses a polygon file to draw the states and the tile map used the tile map dataset, which is a collection of x and y coordinates that places each state in its desired position on a grid.

We used parameters to initiate the swapping between the two worksheets on the dashboard. The parameter had two values: “Geographic Map” and “Tile Map.” Another calculation responds to the parameter, which had only the parameter field in it. This calculation is then placed as a filter on each worksheet — on the tile map, the value “Tile Map” was selected and on the projection map “Geographic Map” was selected.

So when the parameter changes values, only the map that matches the selected parameter value populates any data, and the other map is blank.

We now drop both worksheets onto a dashboard in a container. The container holds both worksheets and collapses the one with no data while the one with data expands. Tableau parameters can be used directly on the dashboard as a drop-down toggle, but this method isn’t as slick as setting up a custom button, which is what we did.

The button was designed in Figma and imported as two custom shapes (one for each map type) through Tableau’s shape repository. We created a calculation to drive this action using the parameter “Map Type” calculation from before which simply reflects the current parameter value.


CASE [p.Map Type]
WHEN ‘Geographic Map’ THEN ‘Tile Map’
WHEN ‘Tile Map’ THEN ‘Geographic Map’
END

This calculation is put on the shape marks card in Tableau so that when the parameter value is equal to “Tile Map,” the shape will show the geographic shape, indicating to the user that when the button is selected, the view will change.

The last step is to set up the parameter action on the dashboard. Under “Dashboard > Actions” in the Tableau dashboard, we added a “Change Map Type” parameter action. This action tells Tableau that when someone selects the button, the “Map Type” parameter will change based on the field “p. Next Map Type,” which will always be the opposite value of the parameter based on the calculation we created. Thus, the button and the map will continue to cycle between the two map views.

The visualization uses a number of parameter actions to drive interactivity, including the metric selection buttons at the top, where users can select one of four metrics. This level of interactivity allows for the visualization to have a lot of depth and analysis all on one dashboard. Additionally, people are given a bit of freedom as to what metrics and what states they want to analyze or explore further.

What we learned

Visualizing these COVID-19 vaccination data will hopefully help users learn more about vaccine distribution and access disparities. By combining daily vaccination data with existing population counts and socioeconomic information, we can illustrate disparities in vaccine distribution — inequities we don’t completely explore in this project. All of our data can be downloaded from the Urban Institute Data Catalog and will continue to be updated each day. We hope these data can be a helpful baseline for others to bring in other state-level data and perform their own calculations.

For folks building real-time data pipelines, there can never (ever!) be enough data quality checks in your code.

Viewing this product and process through a wider lens, we hope that other organizations can see the power of building teams of data visualization experts and content experts. Our small team has expertise across many different fields, and only by working together could we create an effective tool quickly. We also hope other organizations consider how they might use a dashboarding tool like Tableau (or others like Qlik, Flourish, and Datawrapper) to help communicate their data and analyses to wider audiences.

-Jon Schwabish

-Ajjit Narayanan

-Claire Boyd

-Lindsay Betzendahl

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.