Building R and Stata packages for the Education Data Portal
We recently introduced the Urban Institute’s Education Data Portal, an online resource that makes K–12 and higher education data easily accessible to the public. Our previous posts explored why we created the portal, why we decided to build it using an API, and how we built the API. In this fourth and final post of this series, we explore the reasons we created R and Stata packages to interface with the API and how you can use them to access the data.
What is a package?
In both R and Stata, a package is a way of combining code and documentation in a uniform and organized manner so they can be easily shared with other users. If you have ever made a graph in R using ggplot2 or exported a regression model in Stata with outreg2, you have used a package that has added functionality to the base software. Typically, R users access packages via the Comprehensive R Archive Network (CRAN), while Stata users typically download packages from the Statistical Software Components (SSC) archive.
Why write a package?
The Education Data Portal API provides a wealth of data and metadata to users — one of the API’s biggest strengths — but these data can be a lot to parse for nontechnical researchers unfamiliar with HTTP requests or JSON data structures. For that reason, we built R and Stata packages that abstract much of the time-consuming and programming-intensive tasks from the researcher.
Let’s walk through a simple example of accessing the portal’s data in R without the package (for brevity, we do not include this same example in Stata, but even this simple example is more difficult for Stata users).
Suppose you want to obtain the 2013 enrollment data for third graders from the Common Core of Data (CCD) from the US Department of Education. The endpoint (the URL used to interact with the API) for that data source is https://educationdata.urban.org/api/v1/schools/ccd/enrollment/2013/grade-3/ and can be obtained in R using code such as this:
The code returns a list of four items. The “count” item returns the number of records in this endpoint (102,815) while the “results” item is a data frame structure containing 1,000 observations of six variables. Astute readers will note that 1,000 is considerably fewer than the more than 102,815 available rows. This leads us to the appropriately titled “next” item, which contains the URL for the next page in the endpoint. Going to https://educationdata.urban.org/api/v1/schools/ccd/enrollment/2013/grade-3/?page=2 will return the next 1,000 rows of this endpoint. To obtain the full dataset, we can use a bit more R code to loop through all iterations of “next”:
This code calculates the number of pages for the endpoint, constructs a vector of URLs with the correct page marker, obtains the data frame for each of the URLs, and finally binds them into a single data frame.
So far, we have obtained the 2013 data for a single grade. But what if you want the entirety of the CCD enrollment data across all years and grades? This endpoint currently covers all years from 1987 through 2014, for 22 grade-level options. One solution is to copy the above code and apply it for all 616 endpoints. A more effective solution would rework our code into a function that accepts year and grade level as input and then applies that function across all year-grade combinations.
Though possible, this task is more complex in Stata. First, you’d need to install the package insheetjson
from SSC. Then for each variable in the dataset, you’d need to consult the documentation and generate a new variable with the correct name and datatype. Then you’d need to call the insheetjson
function, specifying the table name, columns, and offset. To get all “pages” returned by the API, you’d need to use local variables to keep track of the pages and send the right URL to insheetjson
and ensure your dataset does not overwrite existing data. As you can see, Stata makes this process unwieldy and requires a significant amount of manual work from the researcher to ensure each dataset is read in without error.
One of the goals of the Education Data Portal is to combine and open education data from different organizations. With the R and Stata packages, we could leverage the strength of the Education Data Portal API into a single function that returns a rectangular data frame to a researcher.
Using the R and Stata packages
To use the Education Data Portal packages, first install one from our GitHub page for either R or Stata. In R, if you don’t yet have devtools
installed, make sure you install.packages(‘devtools’)
first. In Stata, you will first need to install the libjson
library using the “ssc install libjson”
command.
R:
Stata:
The prior code to fetch CCD enrollment data across all years and grades can now be accomplished in a single line of code in both R:
and Stata:
That’s all it takes. Behind the scenes, the package leverages the metadata API to get the full set of years for which there are data, and when new data are added, the packages detect the additional data automatically. We have also added validation checks that ensure the package is working as expected and returning useful error messages when it does not.
There are several optional parameters that can also be passed to the get_education_data function. A full function call looks like the following in R:
Where:
· level (required) — API data level to query.
· source (required) — API data source to query.
· topic (required) — API data topic to query.
· by (optional) — Optional list
of grouping parameters for an API call.
· filters (optional) — Optional list
query to filter the results from an API call.
· add_labels (optional) — Add variable labels (when applicable)? Defaults to FALSE
.
· csv (optional) — Download the full CSV file? Defaults to FALSE
.
A similar full-function call with multiple options in Stata might look like the following (you can find the full description of options by typing help educationdata
after installing the package):
Going back to our first example, if you wanted to obtain the 2013 CCD enrollment data for third graders, the function call would look like the following in R:
And Stata:
In this case, we set the add_labels flag to TRUE. The R package uses one of the metadata API endpoints to add to proper character labels to the race and sex variables in this dataset, making the dataset easier to interpret. In Stata, variable value and variable labels are added by default.
The Future of R and Stata Package Development
The best part about the Education Data Portal packages is that when the data update in the raw data catalog, you can access them immediately without needing to reinstall or alter the package. As data are added, your current installation should work without a hitch. Of course, we’ll also update the Stata package to ensure the latest information about the API options are available in the Stata help command.
We hope that these packages never have to change and will remain useful for education data analysts, researchers, data journalists, and others, but we know we make mistakes or can make things better. These packages are still in development, and you can view the most recent changes for each on Github (R here and Stata here).
We recommend checking back for updates if you’re struggling with a certain function or run into an error — we may have already fixed it in the latest release. If you do find an error or bug — or would like an additional feature — we welcome feedback, questions, and comments at educationdata@urban.org or on the Github Issues pages for the R or Stata packages. We look forward to hearing from you and can’t wait to see what you do with the data!