Building Summary Endpoints for the Education Data Portal
Last year, we announced our collaboration with the Cloudera Foundation to accelerate our progress toward a vision of providing free and open access to high-quality education data for a diverse set of stakeholders pushing for change. Our long-term goal is to make it easier for people to use data to inform decisionmaking and hold decisionmakers accountable.
Thanks to the financial, technical, and software support of the foundation and its top-tier engineering experts, and generous product donations from Cloudera, Inc., the Urban Institute’s data engineering team has built out summary endpoint functionality on the Education Data Portal that allows users to easily explore aggregated data on multiple dimensions, including geography, school, school district, congressional district, and school characteristic.
Where the original API endpoints allow researchers to easily access millions of raw records for further analyses, these new summary endpoints provide rapid, customized access to aggregated statistics in seconds. These endpoints build upon our earlier prototype made possible by the Amazon Web Services (AWS) Imagine Grant program.
Why write a new set of endpoints
In our previous announcement post, we laid out the following scenario:
Running a summary of enrollment over time by school district using our current infrastructure takes minutes to complete, and under high load, it can take more than an hour. Based on our discussions with the Cloudera Foundation, we’re excited to explore storing data in efficient Apache Parquet format on Amazon Web Services S3 and using big data services (such as Impala) on a Cloudera-powered big data cluster, which we think will allow us to summarize data in seconds, without increasing cost significantly.
To get this summary of school district enrollment by grade for all years available in the Education Data Portal, the user would previously have had to run the following code:
On a standard laptop, this code took almost two hours to run and pull the nearly seven million observations needed to generate the summary statistics. Though the ability to pull millions of records is invaluable for some users, such as researchers wanting access to the full, raw data, it can be too time consuming for those looking to get quick insight from the data.
With the summary endpoint functionality, the user can instead get the same information from a single URL call:
This code ran in only seven seconds, generating the same end result in a fraction of the time. The summary endpoints eliminate the onus of data collection and analysis for those who may not have the time, money, or knowledge to create these summary statistics on their own.
We have built out the Education Data R and Stata packages to connect to the summary endpoints. For instance, the example above can be called directly from the R package:
And similarly for the Stata package:
Example usage
The general form of an API call is as follows:
`educationdata.urban.org/api/v1/{section}/{source}/{topic}/summaries?var={var}&stat={stat}&by={by}`
where
· `{section}`
is one of `college-university`
, `school-districts`
, or `schools`
.
· `{source}`
is the data source, as in `ccd`
, `crdc`
, `ipeds`
, `edfacts`
, `saipe`
, or `scorecard`
.
· `{topic}`
is the endpoint, as in `enrollment`
, `ap-exams`
, and so on.
· `{var}`
is the variable to be summarized. It must be a numeric, nonfilter variable.
· `{stat}`
is the statistic to be calculated. Valid options are:
`sum`
`count`
`avg`
`min`
`max`
`variance`
`stddev`
`median`
· `{by}`
is the variable(s) to group results by. Multiple variables can be given as a comma separated string, as in `by=fips,race`
. All queries are also automatically grouped by `year`
by default.
A simple call will look like:
https://educationdata.urban.org/api/v1/schools/ccd/enrollment/summaries?var=enrollment&stat=sum&by=fips
This will return the sum of `enrollment`
by `fips`
code and `year`
for the `schools/ccd/enrollment`
endpoint.
Where applicable, tables have been prejoined against their associated `directory`
file, providing additional options for `by`
and `filter`
arguments. For example, we can call:
https://educationdata.urban.org/api/v1/schools/ccd/enrollment/summaries?var=enrollment&by=school_level&stat=sum
even though `school_level`
is only available in the `schools/ccd/directory`
endpoint and not the `schools/ccd/enrollment`
endpoint.
Additional filters can be added to the API call as additional query string parameters in the form `{filter}={value}`
. Multiple values can be given as a comma separated string. For example:
https://educationdata.urban.org/api/v1/schools/ccd/enrollment/summaries?var=enrollment&stat=sum&by=sex&fips=6&race=1,2
will return the sum of enrollment by sex and year for the `schools/ccd/enrollment`
endpoint, where `fips`
is 6 and `race`
is either 1 or 2.
How we built it
The summary endpoint functionality is built upon the Cloudera Data Platform Public Cloud, deployed using AWS compute and storage resources. In particular, we use a Cloudera Data Hub cluster running Apache Impala, a massively parallel processing SQL engine. Data Hub clusters can be easily scaled up or down, depending on our usage needs.
Impala is optimized to work especially well with the Apache Parquet binary columnar storage format. To get started, the data engineering team built a pipeline for converting the original CSV data files into Parquet format, which compresses the data while making them more efficient for retrieval and analysis by big data systems. For example, nearly 50 gigabytes of CSV files became fewer than 10 gigabytes in Parquet format. Impala has the ability to directly query data stored on the Amazon Simple Storage Service, a cost-effective cloud storage resource. This means we can store the data that power the summary endpoints for fewer than $3 a year.
Once the Parquet data has been created, we use AWS Lambda, a serverless computing platform, to trigger a Python function that automates the creation of the database tables in Impala with the help of the impyla package. At this time, we run a compute stats statement, which can help optimize query performance.
After all tables have been created, we run a suite of tests that ensures the summary endpoints are working as expected and that the results are consistent with what a user would see if they did a similar analysis using the raw data endpoints. We again use Python along with the implyla package to check that the table shape and column summary statistics are the same across both the Impala deployment and the mySQL database, which powers the original data endpoints.
When a user goes to a summary endpoint URL, their request is routed to API Gateway, an AWS service that handles the creation, management, and security of running an API. In this case, API Gateway then calls another AWS Lambda Python function that parses the URL requested into a properly formatted SQL query. This query is then run on the Impala database. Finally, the Lambda function retrieves the results of the query and passes them back to API Gateway, which returns them to the user.
Finally, we use Cloudera Workload Manager to monitor the performance of the summary endpoints. With Workload Manager, we get valuable insight into which endpoints are queried most often and how long they take to complete. This helps us understand potential bottlenecks in the system and optimize the performance of the system to better serve our user base.
Looking ahead
We have truly appreciated how much we have learned from our collaborators at the Cloudera Foundation and are excited to continue building out the data engineering capabilities at the Urban Institute with their helpful guidance.
We are excited to see how our users use the summary endpoints to leverage data as they never have before!