Illustration by Rhiannon Newman for the Urban Institute

Continuing Our Journey to Open Data with the Urban Data Catalog

In September 2019, Urban announced the public launch of the Urban Data Catalog, a web-based data repository powered by the open-source DKAN data platform. The data catalog has become crucial to furthering Urban’s mission by offering high-value datasets to facilitate research by analysts who can use our data to create insights for policymakers, community-based organizations, and the public. For example, we published our version of the Census Household Pulse Survey Public Use Files with calculated estimates of COVID-19’s effects on US adults and their households by geography and race/ethnicity to further inform researchers about the crisis.

But our Data Catalog team (a cross-center effort of web development, communications, research, and tech staff) also wanted the catalog to become a resource within Urban, providing a way to organize Urban’s numerous datasets for easy access by our researchers and for us to analytically track which datasets are in demand.

Nearly three years into this project, we want to highlight some lessons we’ve learned on our open data journey and some challenges we’ve faced. Although much work remains, we believe our experience can help others with their own open data journey in their organizations.

Optimizing the catalog

An essential step in this journey to create a better data catalog has been enabling our researchers to easily post their data . We conduct one-on-one outreach to researchers who lead projects, recruiting them to contribute their datasets and leading them through the process. We also encourage researchers to pair their data catalog entries with corresponding products, including online data tools, features, or publications; to link from their publications to the catalog entry; and to make their code available in Urban’s public GitHub instance for features or data tools.

But we’ve found active uptake requires more than just encouragement. So, we’ve simplified our process as much as possible, providing an online intake form, guidelines, and recommendations. With these guidelines, we steer researchers through creating their data catalog entry, have them verify that they’ve accounted for data privacy and security restrictions and that the data conform with data-use agreements and relevant institutional review board data security plans, and provide recommendations for choosing a license (usually we suggest the Open Data Commons Attribution license).

Although simplifying the data catalog process on the front end has been crucial for researcher participation, we also recognized we needed a robust back end to accommodate dataset storage. Currently, we allow dataset uploads up to 10GB, which we can handle and store efficiently thanks to our web development team’s integration of the Drupal S3 File System module into our DKAN instance. This module enables the catalog to leverage Amazon’s Simple Storage Service as the back-end file system, rather than the more costly “active storage” default provided by our Drupal hosting service.

To fully integrate a dataset into the catalog, we take full advantage of DKAN’s many metadata fields. In general, we include a summary description of the dataset, a license for use, links to a data dictionary or codebook, and a list of related content, such as publications or blog posts. This additional information helps our audiences understand the methods behind the data and use the data appropriately. Urban’s communications team copy-edits the text and documentation before making a dataset public in the catalog to ensure readability and consistency with Urban’s style.

When we invest time and effort to create a well-documented dataset with links to features, publications, and source code, Google and other search engines are more likely to promote it — and for Urban, this visibility translates to impact. Having the catalog available on the web and indexed by Google makes our data findable through Google’s Dataset Search, and many of our datasets are highly ranked. If you search for “rental assistance,” “fair housing,” or “spending on kids,” Urban Data Catalog entries will appear in the top 5 or 10 results.

Urban’s communications office and the data catalog team have also worked to cultivate internal and external use and help people access the catalog’s considerable resources. New datasets are regularly featured in Data@Urban blog posts, and we send a quarterly newsletter to our Data@Urban readers with information on new datasets. Within Urban, we’ve advertised the catalog to researchers through internal quarterly newsletters and presentations. And we have regular meetings with our team to keep our agenda current, track progress, and continuously improve.

As a result of this work, we have added 51 new public datasets since launch, and now have 72 public datasets in the catalog, across a spectrum of topics: urban and rural areas, health policy, safety net policy, and more.

Some of the most visited datasets on the data catalog include the following:

  • Estimated Low Income Jobs Lost to COVID-19: The Urban Institute data science team used data from the US Bureau of Labor Statistics, IPUMS 2012–18 American Community Survey microdata, and Urban’s 2018 census LODES (Longitudinal Employer-Household Dynamics Origin-Destination Employment Statistics ) data to estimate the number of low-income jobs lost because of COVID-19 by industry for every US census tract based on workers’ residence. This project checks all the boxes for good data practices. All data for the interactive feature are integrated into the catalog, the code is supplied in GitHub, and the publication, repository, and feature links are included in the catalog. Work on this project was funded internally by the Urban Institute. The tool received significant attention: Raj Chetty and the Harvard University Opportunity Insights team used the data, Innovate Memphis constructed an early warning map, and many other groups found new and innovative ways to build on Urban’s work.
  • State-by-State Spending on Kids Dataset: This dataset provides a comprehensive accounting of public spending on children from 1997 through 2016. It draws on the US Census Bureau’s Annual Survey of State and Local Government Finances, as well as several federal and other non-Census sources, to capture state-by-state spending on education, income security, health, and other areas. The data were assembled by Julia Isaacs, Eleanor Lauderback, and Erica Greenberg of the Urban Institute, working in collaboration with Margot Jackson of Brown University for her study of public spending on children and class gaps in child development. This work has been supported by the Russell Sage Foundation and by the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health.
  • Debt in America 2021: This dataset contains information derived from a random sample of deidentified, consumer-level records from a major credit bureau. Estimates are also incorporated from summary tables of the US Census Bureau’s American Community Survey. The Debt in America 2021 work and data dashboard were funded by the Annie E. Casey Foundation, with additional support from the Ford Foundation in 2017.

Lessons learned about measuring dataset visibility

In making our datasets more visible, we’ve been aided by DKAN’s inherent capabilities as a Drupal-based system. The Drupal XML Sitemap Module allows us to generate an XML sitemap, which we submit to Google. From there, Google and other search engines that recognize the sitemap protocol can pull everything they need regarding dataset metadata. You can see our data catalog entries in Google’s Dataset search here. Web analytics allow us to see which datasets are having impact and trending and what sites link to the datasets. Looking at the referrals analytics metric lets us see how much traffic comes from organic searches, Twitter, Urban newsletters, and Data@Urban. We continue to investigate the best ways to gauge the catalog’s impact internally and externally.

The catalog has been especially useful for projects that use the same datasets as part of a series — the Census Pulse Public Use Files and Estimated Low Income Jobs Lost to COVID-19 dataset entries fall into this category. We can update the catalog entry with new data, which allows features, web-based data tools, and other analyses to draw from that reference version of the data. The catalog is also an excellent resource for when funders require us to post data publicly, though we encourage posting even if it isn’t a funder requirement.

Overall, the web analytics show that this ecosystem of publication, feature, data, and source code is a good model for research dissemination, and it helps expand our audience and impact. We attract a number of audiences through the Urban Data Catalog — people who want to report on our conclusions, use the data for their own communities, and explore the data further or collaborate with us. Importantly, by making the data open and visible, we contribute to Urban’s reputation as a trusted source of data and rigorous analysis.

Future goals and aspirations

Our work on the data catalog is continually in progress. We want to expand the topics and datasets and reach new external audiences. We also seek to improve our outreach to our own researchers (maybe they will read this blog post!), so they will link from features or research papers to the data posted on the catalog, fostering a community of open data and reproducible research. Keeping the site up-to-date regarding our version of Drupal is a big priority for us now — currently we’re on Drupal 7, but we will need to migrate to Drupal 9.

The catalog also offers a high-value asset for researchers putting together a proposal. Most of all, we want to offer and promote the catalog as a standard practice for our researchers to post research, code, and data, thereby improving the standards for the presentation of evidence-based research from Urban. And hopefully, our open data story inspires other research institutions to consider creating portals of their own.

Recommended resources and links

-Doug Murray

-Kathy Pettit

-Graham MacDonald

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

--

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Data@Urban

Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.

More from Medium

Who is the Collaborative Innovation Factory Worker Node?

Monetise Your Data From Wind Farms By Leveraging Snowflake Data Marketplace

Data Exchange In The Marketplace

Time travelling with Snowflake | Crimson Macaw

Logging messages from R to Azure Log Analytics