Illustration by Sammby/Shutterstock

Announcing the Urban Institute Data Catalog

Data@Urban
7 min readSep 10, 2019

We believe that data make the biggest impact when they are accessible to everyone.

Today, we are excited to announce the public launch of the Urban Institute Data Catalog, a place to discover, learn about, and download open data provided by Urban Institute researchers and data scientists. You can find data that reflect the breadth of Urban’s expertise — health, education, the workforce, nonprofits, local government finances, and so much more.

Built using open source technology, the catalog holds valuable data and metadata that Urban Institute staff have created, enhanced, cleaned, or otherwise added value to as part of our work. And it will provide, for the first time, a central, searchable resource to find many of Urban’s published open data assets.

We hope that researchers, data analysts, civic tech actors, application developers, and many others will use this tool to enhance their work, save time, and generate insights that elevate the policy debate. As Urban produces data for research, analysis, and data visualization, and as new data are released, we will continue to update the catalog.

We’re thrilled to put the power of data in your hands to better understand and respond to many critical issues facing us locally and nationally. If you have comments about the tool or the data it contains, or if you would like to share examples of how you are using these data, please feel free to contact us at datacatalog@urban.org.

Here are some current highlights of the Urban Data Catalog — both the data and research products we’ve built using the data — as of this writing:

- LODES data: The Longitudinal Employer-Household Dynamics Origin-Destination Employment Statistics (LODES) from the US Census Bureau provide detailed information on workers and jobs by census block. We have summarized these large, dispersed data into a set of census tract and census place datasets to make them easier to use. For more information, read our earlier Data@Urban blog post.

- Medicaid opioid data: Our Medicaid Spending and Prescriptions for the Treatment of Opioid Use Disorder and Opioid Overdose dataset is sourced from state drug utilization data and provides breakdowns by state, year, quarter, drug type, and brand name or generic drug status. For more information and to view our data visualization using the data, see the complete project page.

- Nonprofit and foundation data: Members of Urban’s National Center for Charitable Statistics (NCCS) compile, clean, and standardize data from the Internal Revenue Service (IRS) on organizations filing IRS forms 990 or 990-EZ, including private charities, foundations, and other tax-exempt organizations. To read more about these data, see our previous blog posts on redesigning our Nonprofit Sector in Brief Report in R and repurposing our open code and data to create your own custom summary tables.

- Education Data Portal: Researchers in Urban’s Center on Education Data and Policy assembles education data in our Education Data Portal from a number of federal sources, standardizes the data, cleans them, and provides them via flat files, an API, a point-and-click tool, and Stata and R packages for public use. Read our previous Data@Urban posts to learn why we built the Education Data Portal, why we power it using an API, how we built the API, and how we built the R and Stata packages for the portal.

- Earned income tax credit data: Researchers in the Tax Policy Center standardize, clean, and create files at different geographic levels based on zip code–level data for low-income individual tax filers. The data tool allows for the creation or download of extracts for specific subsets of the data.

- State and local finance data: Researchers in the Tax Policy Center standardize and clean 40 years of state expenditure data from the US Census Bureau’s Annual Survey of State and Local Government Finances. You can use the Data Query System to create and download comparisons across states for select subsets of the data.

- Alternative tax policies: These open data produced as part of a research report and interactive visualization exploring alternatives to the 2017 Tax Cuts and Jobs Act. The dataset contains explorations of the effects of more than 9,000 potential plans.

Our Journey to Open Data

Launching the open data portal required a wide range of skills to promote the idea internally, select the right platform for our needs, and adjust our staffing and communications plans after our pilot launch.

The true driving force started with the team of internal champions, who we’ll call the data catalog committee. The two coauthors here have been part of a few of Urban’s past efforts to catalog data internally that were ultimately unsuccessful because of limited understanding among the staff of the potential benefits. But in 2015, the stars aligned and Urban senior staff banded together at a strategic planning retreat to recommend that Urban develop a data catalog. Not surprisingly for researchers, their primary motivation was not altruistic; they were interested in getting access to compelling new data from their colleagues. To cultivate further awareness of the idea, we launched an all-staff survey in 2016 that endorsed the tool’s usefulness, and Urban’s executive office decided that the data catalog would become a priority.

In response, Urban convened a team of researchers, web developers, digital communications specialists, data visualization developers, research programmers, and the newly minted data science team. After evaluating a number of data governance solutions (such as Collibra), the team landed on the DKAN Open Data Platform, largely because of the web development team, to whom we are very grateful. The web development team relies on well-supported, widely used, open source tools that allow them to more easily customize, update, maintain, and debug the platform over time. Because Urban required the ability to customize any tool in house to make it work for our research needs, DKAN was the obvious choice. We chose DKAN over its cousin CKAN because our web development team has significant in-house Drupal expertise (the Urban Institute website is also run using the open source content management system).

Technical expertise aside, DKAN also has no licensing fees, making it a better fit when compared with other more expensive software solutions we were considering (CKAN excluded) for what, at the time, was just a pilot project that had yet to prove its worth and that needed investment to customize to our needs. The solution we developed was not free — we do budget for maintenance, bug fixes, upgrades, and web hosting (for those curious, we use Pantheon). However, our existing expertise in Drupal, combined with no licensing fees, allowed us to customize the solution while achieving lower overall costs for our team, which has made it easier to fit the data catalog into annual institutional budgets over time.

After the initial vetting process was complete, the committee initially used DKAN as an internal metadata system and as a repository for Urban’s most valuable datasets. The idea was that institutional awareness of existing data might help researchers save time and come up with more innovative research using newly discovered data.

After cataloging metadata about Urban’s most valuable datasets in our internal DKAN system in 2018, we quickly realized that the process was going to be plagued by how to scale the solution and how to advertise it to our Urban colleagues. The pilot system offered little incentive for researchers to take time to update and publish, so it required a central team with additional funding. And although we built an internal search tool to make data easier to find, the catalog was often lost among many available productivity applications for the typical researcher. To increase internal take-up, we needed to do more active promotion than we originally thought to remind our researchers why and how to contribute to and use the catalog.

Moving to Really Open, Open Data

In late 2018, the team decided that the next priority for the catalog would be to create a public-facing, open data catalog. The data catalog committee had long been pushing for a public, open data catalog. One set of motives was focused on benefits to the field — our support for open data in local, state, and federal governments; research reproducibility; giving back to the community; and democratizing data. Internally, we also needed to make the case about how open data had helped Urban and its researchers, addressing the more common requirements from funders and academic journals for open data, institutional branding of data, and easier searchability.

Going forward, we are not requiring researchers at Urban to post all of their data on the catalog — rather, those who want to share their data in a more seamless, public-facing portal now have the option. It also enables researchers whose projects require them to publicly share their data, our data visualization team, and researchers producing data for the public good to provide data and metadata all in one place. We know the catalog committee will still need to encourage our colleagues to do the extra work to post their data, but we hope the payoff becomes worth the effort and the practice becomes commonplace.

We hope that the data catalog will help Urban achieve both its internal goals and its mission to give back to the community by democratizing access to one of our most valuable resources — data.

-Graham MacDonald

-Kathy Pettit

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.