Illustration by the Urban Institute

Democratizing Big Data Processing for the Education Data Portal

Data@Urban
6 min readJan 28, 2020

Earlier this year, we announced our plan to create a central source of preprocessed, big, local datasets spanning various sectors and issues. Our long-term goal is to make it easier for people to use data to drive decisionmaking and hold decisionmakers accountable.

Thanks to the Amazon Web Services (AWS) Imagine Grant Program, our Technology and Data Science team had the opportunity to build a proof of concept system that extends the functionality of the Urban Institute’s Education Data Portal application programming interface (API) for a few of the portal’s datasets. Where the original API endpoints allow researchers to easily access millions of raw records for further analyses, these new prototype summary endpoints provide rapid, customized access to aggregated statistics in seconds. You can read more on the Education Data Portal from our previous posts on why we created the portal, why we decided to build it using an API, how we built the API, and our R and Stata packages for accessing the API.

Why write a new set of API endpoints?

Although the ability to pull millions of records is invaluable for some users, such as researchers who want access to the full, raw data, it can be difficult for others who may lack the time or technical skills needed to access and analyze the data. Consider a policy analyst who wants to know the total school-level enrollment by county for 2007. To get this information from the current Education Data Portal API, she would need to write and run code similar to the following:

This snippet uses the Education Data R Package to pull the National Center for Education Statistics’ Common Core of Data for both the enrollment and the directory endpoints of the Education Data Portal API. These two datasets are then joined, filtered, and summarized to produce the final results.

Pulling all the raw data from the API can be a time-consuming process. In all, this code took about seven minutes to run and involved downloading more than 750,000 records and more than 60 variables.

Instead, the new prototype summary endpoints allow users to pull the same data by visiting a single URL:

In contrast, this code took around seven seconds to get the same results — 60 times faster for the same analysis, achieved with less code.

The summary endpoints API also scales to multiple years without the need for additional processing time, thanks to our new technology. Let’s imagine a second, more complex example, where the policy analyst wants to know the total school-level enrollment across K–12 institutions by county over time and then wishes to drill down into zip code–level enrollment for a specific county. Pulling all of the raw data from the API and analyzing it takes hours on an average laptop, and in cases like this, the laptop often runs out of physical memory and crashes before it can complete the job. However, using the summary endpoints, the analyst could retrieve those results with just two URL calls:

This code took just 15 seconds to run — a few hundred times faster than is possible for even an advanced analyst on the current portal, assuming they have a computer with enough processing power. The summary endpoints eliminate the responsibility of data collection and analysis for the policy analyst who may not have the time, money, or knowledge to create the summary statistics herself.

But it does more than just enable a policy analyst. For example, Urban’s data visualization team, other web developers, and folks with expertise in business intelligence dashboard tools like Tableau and Power BI can now more easily develop custom visualizations for schools, school districts, and other groups that can access billions of rows of education-related data in different ways and visualize them in seconds. With the recent addition of our geographic endpoints and a future build-out of the summary endpoint capability, we hope that parents, policymakers, school officials, advocates — anyone with an internet connection — will be able to easily analyze and visualize data at any level on bullying, enrollment, test scores, and more in seconds, at the click of a mouse.

Using the prototype summary endpoints

The general form of an API call is as follows (see the Education Data Portal documentation site for more information on the topics, sources, endpoints, and variables available):

`https://educationdata.urban.org/api/v1/{section}/{source}/{topic}/summaries?var={var}&stat={stat}&by={by}`

Only a small subset of summary endpoints is available in this proof of concept stage. For additional information on using the summary endpoints, the subset of summary endpoints currently available. including the data sources available as part of this pilot release, please see our GitHub page.

How we built it

The summary endpoints are built on a fully serverless framework, where all computational resources are managed and provisioned by a cloud provider; in our case, Amazon Web Services. Employing a serverless framework allowed our team to focus on building out the application functionality without worrying about maintaining the necessary infrastructure.

To build the summary endpoint functionality, we used different services from the AWS serverless platform. First, the underlying raw data records from the Education Data Portal were converted to Apache Parquet files — a highly optimized, efficient storage format. For the Common Core of Data enrollment endpoint, more than 14 GB of CSV files were compressed to only 1.5 GB in Parquet. These data are stored on AWS Simple Storage Service (S3), and at $0.023 per GB per month, it costs less than a dollar a year to host these data in the cloud.

When a user goes to a summary endpoint URL, their request is routed to API Gateway, an AWS service that handles the creation, management, and security of running an API. In this case, API Gateway then calls AWS Lambda, a computing platform where code is run on Amazon’s servers and where we only pay for the exact amount of time the code is run. We use Lambda to execute a Python function that parses the URL requested into a properly formatted SQL query. This query is then run on AWS Athena, a serverless query service that runs directly on the data stored in S3, instead of a traditional database. Similar to AWS Lambda, when querying AWS Athena, we pay only for the time it takes to run a query. Finally, the Lambda function retrieves the results of the Athena query and passes them back to API Gateway, which returns them to the user.

With this serverless model, the summary endpoint API can be queried 5,000 times for less than $5. And even better, we do not need to change anything about the infrastructure to query 50,000 for less than $50; scaling is accomplished automatically on the cloud provider’s end.

What’s next

Currently, data are only available for 6 of our more than 30 data endpoints on the portal. Once we receive feedback from potential users, we hope to release a more complete rollout, which will include summary endpoints for the remaining data sources. We also hope to add functionality to the existing Education Data R and Stata packages to enable them to access these summary endpoints.

The summary endpoints are currently in an experimental proof of concept phase, and we appreciate all feedback. If you do find an error or bug — or would like an additional feature — we welcome suggestions, questions, and comments at educationdata@urban.org or on the GitHub issues page for the project. We look forward to hearing from you and can’t wait to see what you do with the data!

-Kyle Ueyama

Want to learn more? Sign-up for the Data@Urban newsletter.

--

--

Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.