How We’re Tackling Performance Bottlenecks in the Education Data Portal API
For more than five years, the Education Data Portal has been a one-stop shop for all major national datasets on schools, districts, and colleges. During this time, we’ve developed a number of tools and features to make these data increasingly accessible to a range of audiences. Recently, we’ve worked to address one of the most common pain points users have reported: the time it takes to pull large amounts of data.
This lag time typically becomes noticeable when pulling gigabytes of data, which isn’t uncommon for portal users. Pulling 10 years of K–12 student enrollment data disaggregated by grade, sex, and race, for example, requires downloading more than eight gigabytes of data from the portal. Similarly, requests for detailed data on college awards by race and sex across institutions and years can quickly grow to dozens of gigabytes or hundreds of millions of records.
Performance bottlenecks have also become more apparent as users have started using the portal in new and innovative ways — most notably, building dashboards and interactive tools to put actionable insights into the hands of stakeholders. For a researcher, waiting a couple minutes for R or Stata code to run is often acceptable; for a user looking at a dashboard, even a few seconds of latency can be a major source of frustration.
To improve API response times, we focused on three areas: caching, database, and logging optimization. By targeting these areas, we were able to reduce latency without increasing our infrastructure costs.
Caching optimization
The Education Data Portal uses a common architecture pattern for modern web applications. The API layer leverages the Django REST framework, and a Redis cache sits in front of a MySQL database to serve common requests from memory. Cached requests are returned almost instantaneously, while requests that query the database can be much slower, especially when the relevant database table isn’t well indexed for the specific set of requested filters.
Django’s built-in caching middleware generates keys by hashing the requested URL. Because each key maps to a unique URL, identical responses can be stored multiple times under separate keys because of differences in URLs that don’t affect the actual data returned. The following URLs, for example, return identical information on historically Black colleges and universities in Washington, DC, but would be cached separately because the user entered their desired parameters in a different order:
· https://educationdata.urban.org/api/v1/college-university/ipeds/directory/2018/?fips=11&hbcu=1
· https://educationdata.urban.org/api/v1/college-university/ipeds/directory/2018/?hbcu=1&fips=11
Similarly, the mode parameter that identifies where API requests come from (e.g., the R or Stata packages) would cause these URLs to be cached separately despite returning identical data to the user:
· https://educationdata.urban.org/api/v1/college-university/ipeds/directory/2018
· https://educationdata.urban.org/api/v1/college-university/ipeds/directory/2018?mode=R
· https://educationdata.urban.org/api/v1/college-university/ipeds/directory/2018?mode=stata
To prevent this duplication, we created subclasses that inherit from Django’s built-in middleware classes to preprocess the requested URL before creating the key. In these subclasses, we pass the request through a simple utility function to strip the mode parameter and sort the remaining parameters. As a result, if a request with the same set of relevant query string parameters is already cached, the API will serve the request from memory, even if the order of those parameters differs or if the requests come from different modes. With this update, more than 90 percent of API requests are served from memory each month (up from around 75 percent previously) without increases to our infrastructure costs.
Database optimization
To speed up requests that query the database, we also used standard database optimization strategies. Typically, we create indexes for a database table when adding a new API endpoint to the portal by predicting how users will make requests. If we anticipate users will frequently request school enrollment data by sex and race for a particular state, for example, we would add a composite index to the relevant database table based on the sex, race, and state predicates in the SQL WHERE clause.
In some cases, however, users make requests that weren’t optimized for our initial set of indexes. Using tools like Amazon’s Performance Insights and MySQL’s slow query log, we were able to pinpoint common database queries that take a long time to execute and prioritized those queries for optimization. By analyzing the syntax of these SQL statements and using MySQL’s EXPLAIN statements, we were able to identify useful indexes and add them to the relevant tables.
Recently, we added hundreds of new indexes affecting dozens of endpoints including school enrollment, district assessments, and college awards datasets. Based on benchmarks, many common data requests are now more than 10 times faster as a result of this database optimization. Over the coming months and years, we plan to continue looking for patterns in slow SQL queries and adding useful indexes where applicable.
Logging optimization
To better understand usage patterns, we log basic information about each API request. This archive informs how we prioritize our data release schedule and allows us to catalog the aggregate number of users and requests made through the portal. By writing a custom logging module, we were able to capture this information more efficiently to make every request more performant.
Initially, we used an external Django utility to log information about requests back to our database. This utility, however, slowed down requests by capturing more information than we needed. It also wrote the full response for each request back to the database, which became particularly expensive for large requests. By using Django profiling tools, we could see that for many requests, the majority of the response time was spent logging information back to the database, rather than fetching data records and returning them to the user.
By writing custom logging middleware that captures only the minimum amount of information we need to track usage patterns, we’re now able to continue to catalog portal history without slowing down requests.
What’s next
As a result of these performance improvements, approximately 95 percent of data requests are now returned to users within 30 seconds, 90 percent within 10 seconds, and 75 percent within 2 seconds (we define a “data request” as a set of HTTP requests from a given mode, base endpoint, and IP address where the difference between chronological pairs of requests is less than three minutes — meaning “data requests” often span hundreds or thousands of individual HTTP requests).
Although this improvement demonstrates progress toward tackling bottlenecks, our efforts to monitor and increase performance remain ongoing. We love hearing about the innovative tools our users are building to derive insights using data from the portal, and we’re committed to ensuring that our API meets the needs of our growing userbase for years to come.