Why we built an API for Urban’s Education Data Portal
This summer, the Urban Institute published its new Education Data Portal, one of the first online resources that brings together a wide array of K–12 and higher education data for exploration and analysis. Instead of posting Excel, Stata, or CSV files — the formats in which many of the original agencies provide these data — we created an application programming interface (API) to make the data as accessible as possible for researchers and analysts. This post explores our rationale for developing an API to access these data and the parallel API we built to access the associated metadata. In a future post, we’ll dive into the technical details of building and implementing the API from the ground up so that you, too, can create your own flexible, accessible data portal.
What is an API?
APIs translate structured URLs into instructions that run on one or more servers. Each URL (or set of similar URLs) is a different “endpoint.” Sometimes, APIs cause complex calculations to run on the server, such as the Google Maps API or the Urban-Brookings Tax Policy Center’s Tax Proposal Calculator. In other cases, data in a database are filtered or searched and returned. This is the model for the Education Data Portal, and in this case, each endpoint corresponds to a different dataset a user can access and filter.
When should you build an API?
There are three primary advantages to building a data API over posting data in flat file formats.
First, data APIs are good for large, complex datasets. Not all data are best accessed through an API. Some datasets, such as the one hosted on our tool demonstrating adjustments to National Assessment of Educational Progress scores, are too small to justify the time and expense of building an API. The data can easily be downloaded and browsed as a CSV file.
On the other hand, data APIs are often built for large datasets, such as the American Community Survey API from the US Census Bureau (which contains more than 16,000 variables) or the campaign finance data API from ProPublica (which contains candidate and committee data dating back to 1980). The education data API has a similarly broad scope with millions of records. If all these data were stored in a single CSV, the CSV would need to contain thousands of columns and millions of rows. Programs such as Excel would be unable to open such a file, and finding a single piece of data would be difficult, even in a statistical programming language like Stata or R. By building an API, we allow users to quickly and efficiently browser, search, and filter the data.
Second, the Education Data Portal (and many similar data APIs) brings data together from many data sources. Without the API, a user who wanted information from each of these disparate sources would have to learn the ins and outs of each of those datasets — their file formats, entity identifiers, update schedules, and more. The API, and the tools built on top of the API, aggregates all this information, normalizes it across different datasets, and lets users access it using a single interface. If a user wanted to combine general information about all primary and secondary schools in New Jersey with information about all colleges and universities in New Jersey — both published by different parts of the National Center for Education Statistics — they could make a single trip to the data portal to access both datasets.
Although these data are gathered from disparate sources, the API lets users access them via the same methods or code, filter them using the same variables (in this case, the New Jersey Federal Information Processing Standard code), and access data returned in the same format.
Finally, these data are frequently updated. If the data were accessed by downloading a file, users would have to keep an eye out for updates and then redownload and reanalyze that file when it changed. Such updates may work for one-off research projects but are time intensive and difficult to maintain for ongoing work. An API allows users to build scripts, websites, or applications that directly and automatically access the API data. If the data update, the same API call will return revised data, pushing through to the user’s application or scripts and generate updated tables, graphs, and other outputs.
Metadata: The API behind the API
It’s great that sites, applications, and scripts will automatically update when something in the data changes, but how do users know when we add new endpoints (i.e., datasets) to the data portal or update the variables in a dataset? Enter the metadata API.
The metadata API runs parallel to the data API and returns a smaller set of values that describe the data. We’re using the metadata API to power three products:
· The educationdata Stata package and the educationdata R package. These packages contain helpful functions that make requests to the metadata API to get summary information about the data and allow Stata and R users to easily access useful information, such as variable type, allowable filter values, and helpful error messages. Most importantly, the metadata API allows all data access requests made through the packages to retrieve data in the proper format, with human-readable variable name and value labels properly applied, a task that can save time.
· The API documentation site. This living document is built almost entirely from metadata API calls. To build out the list of endpoints on each topic page of the documentation (Schools, School Districts, and Colleges), we make a call to a metadata endpoint of this form:
This returns JSON-formatted data with information on each endpoint. We use this to build out the navigation menus in the left-hand column and the names, descriptions, data sources, and code samples for each endpoint.
Next, for each endpoint, we make two metadata API calls. This call
populates the table of values for each endpoint, and these calls
populate the list of CSV download files associated with each endpoint.
The site uses two other API calls to populate content separate from the endpoint lists:
This returns a list of data sources, displayed on the home page and at the top of each topic page, and
populates the version history on the home page.
These metadata will periodically update or expand. Because the documentation site and programming packages are built on API calls, we need to update the metadata only once, at the source. Once the API is updated, all three of these applications will automatically update, without any action required by our developers or users.
As we continue to update the education data API, the documentation site and associated packages will automatically update with no additional effort. We also plan to build more tools, such as a point-and-click data explorer and other stories and data tools we haven’t yet conceived, all powered by the API. Over the years, we will continue to expand the ecosystem of products that depend on the API, as will, we hope, other developers and researchers. By continuing to add and update the data at its source, we can ensure that all these products have long lives and ongoing utility for many audiences.