Image courtesy of The Urban Institute

The Importance of Education Data that Empower People to Drive Change

At the Urban Institute, we believe the best policy decisions are based in facts. We envision a future in which more policymakers, researchers, and advocates have access to the data they need to make informed decisions and work toward policies and programs that better serve all students.

To fulfill that vision, in April 2018, the Urban Institute first launched the Education Data Portal as a freely available, one-stop shop for accessing cleaned, harmonized federal education data and metadata. Since then, supported by the Overdeck Family Foundation, the Bill & Melinda Gates Foundation, Arnold Ventures, and Amazon Web Services, we’ve added dozens of new datasets, Stata and R packages, a data explorer, and a school-level demographics tool, and we continue to update the data to keep the portal fresh and relevant.

Over the past few months, COVID-19 has radically shifted the education landscape, from early education through college, in ways that will affect students and transform school systems for years to come. Though it will be a few years before federal data collections presented in the portal reflect the pandemic’s impact, we can use data from the past few years to understand which schools are most likely to be affected and how different policy solutions might play out.

It is more critical now than ever that we make trusted data quickly and intuitively accessible on any device so a broader set of stakeholders can use this rich data education resource to inform mitigating the pandemic’s negative effects on educational outcomes.

Announcing our collaboration with the Cloudera Foundation

Today, we’re excited to announce a collaboration with the Cloudera Foundation, which will allow us to accelerate progress toward our vision: free and open access to high-quality education data for a diverse set of stakeholders pushing for change. Through financial, technical, and software support, the foundation and its top-tier engineering experts will enable our team of researchers and data scientists to improve our process and build out new capabilities that will benefit changemakers and the technology-for-good field at large, including the following:

· building out both summary endpoints for and web applications on the Education Data Portal that allow people to easily explore aggregated data at the school district, congressional district, state, or other geographic level

· providing a full, transparent, visual and code-based view of our data processing pipeline, from ingestion of the data from its original source to the final portal-ready version, in addition to a transparent data policy, which we hope will enable all users to trust the data’s quality and usefulness for driving change

· giving back by writing about how we built the infrastructure for the portal here at Data@Urban and sharing our methods, code, and findings to allow others to take advantage of lessons learned

· automating and improving the efficiency of our current data pipelines and API to reduce processing time, increase data availability, and increase the speed with which we provide data to allow for a seamless user experience

We are very grateful to have found a collaborator that shares our big vision and excitement and who bring a wealth of technical expertise to help us deliver on this vision in ways we could not have achieved alone.

Where we are now

Currently, our data system works well for a range of use cases. Data are processed by our education data experts in Stata, and metadata are curated in Excel. Data updates are triggered by our data science team using Python scripts that write files to S3 and to our RDS MySQL database in the cloud, where indexes are automatically added as part of the upload process. Metadata are manually uploaded by our education data team to S3, where quality checks are automatically triggered to test the metadata for consistency with the data.

The full stack is set up by our internal DevOps (a combination of software development and IT operations) and web application development teams, and the web team coordinates all technical aspects of the API and database systems with the DevOps and data science teams.

Data quality checks are manually triggered by our data science team and run in parallel by invoking a separate lambda function for each check. Failed checks are automatically reported over email to the appropriate user, who then resolves the issue and passes the data back to the data science team to upload and rerun the suite of quality checks. R and Stata package checks are manually run by our data science team. All of this is run in a staging environment, and once all checks have passed, they’re copied over to the production environment that all of our users see at

The system works very well; response times for API requests average less than a second, and even extremely large requests, like for all of the schools in California, complete in just a few seconds.

Where we are going

And yet we see a number of areas where we can improve. In collaboration with the Cloudera Foundation and Cloudera engineers, we plan to explore the following:

· Running a summary of enrollment over time by school district using our current infrastructure takes minutes to complete, and under high load, it can take more than an hour. Based on our discussions with the Cloudera Foundation, we’re excited to explore storing data in efficient Apache Parquet format on Amazon Web Services S3 and using big data services (such as Impala) on a Cloudera-powered big data cluster, which we think will allow us to summarize data in seconds, without increasing cost significantly.

· We’ve managed to reduce our time spent doing manual quality checks from two weeks to a few hours over the past two years, but we could reduce this time further. In addition, much of our data processing for new years of data happens manually, and we could save time when there are no changes to update. In combination with Cloudera Foundation experts, we plan to explore professional data pipeline software (such as Apache NiFi) and quality frameworks (such as Great Expectations), which could allow us to automate data processing, easily view where things are going well and where data might need further investigation, and save time by further automating the data pipeline.

· Code for processing the data is not public, and changes to the data are visible through our release notes, newsletter, and by comparing previously stored data to current data. However, using data pipeline technology with graphical interfaces, such as Apache NiFi, and automating the data pipeline using similar technology with version control, will allow us to provide completely transparent change detection and versioning to increase trust in our data as they change over time.

Looking ahead

As experts in education data and policy-focused data science, we’re excited to learn from and work with the talented engineers at the Cloudera Foundation and Cloudera to make this vision a reality. Enabling key decisionmakers to process hundreds of millions of rows of data from their phone, tablet, or laptop in seconds is a daunting task. It will require production-grade data engineering skills to build robust, scalable data pipelines, testing and production systems that catch bugs before they go live, appropriate big data infrastructure to process and return user requests in a timely manner, and infrastructure support that ties all these pieces together.

If all goes well, our key users will be able to leverage data as they never have before, and we will have built a powerful set of tools to better democratize big data to provide critical information to education policymakers in the years ahead.

On top of the benefits to the field, our collaboration with the Cloudera Foundation will allow Urban to learn and grow in the data engineering, data operations, and big data space. It is our sincere hope that this collaboration will encourage similar work in other domains, such as financial well-being, housing, and climate change, both inside and outside of Urban, even as these fields need better data to understand the pandemic’s long-term effects.

We believe this project, along with the others supported by the Cloudera Foundation, such as work at AidData and the Terre des hommes foundation, has the potential to provide for the next generation of infrastructure that will connect data experts and their high-quality data and analytics capabilities to those who need those data most to drive change in their communities during these times. We hope you will be on the lookout for future Data@Urban posts and announcements to learn more as we dive into this work.

-Graham MacDonald

Want to learn more? Sign up for the Data@Urban newsletter.

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.