The Urban Institute Education Data Portal offers open, free access to education data from a number of data sources in one place, with the goal of making it easier for researchers, practitioners, and policymakers to generate rigorous, accurate, and actionable insights to improve outcomes for students.
As more and more people have begun to rely on the portal for their work — we now have well over 1,000 unique active users every month — we decided to take a closer look at our public data policy to ensure that it aligned with best practices from leading data providers in the field.
In this blog, I’ll walk you through where we started, the background research we did, the top data providers we relied upon for examples (with links), the conversations we had, and where we ultimately settled. We hope this helps other organizations pursuing similar activities, just as those that came before us helped to inform our process. (For those who just want to see the data policy, click here)
Before I walk you through the steps, a little about the team. The Education Data Portal is a collaborative effort between education researchers and data experts, data scientists, web developers, communications experts, and data viz professionals. The core management and decision-making group is led by Matt Chingos, Vice President of the Education Data and Policy Center, and includes Erica Blom (education Research Associate), Ben Chartoff (Director of Data Visualization), Dave D’Orio (Director of Web Application Development), Alex Tilsley (Senior Manager, Communications and Outreach), Irene Ngun (Senior Project Manager), and myself (representing the Data Science team). The best projects are a collaboration among talented individuals in many fields, and this effort is no exception.
How we started
When we initially discussed the data policy, the group felt strongly, and came to agreement around, two primary goals:
1) The data should be publicly available and free to use… except for commercial entities. The group felt that we should not put any but the most permissive restrictions on the portal (250 requests per second, maximum), and allow people to use it freely. However, at the time we did not feel that it was appropriate for private entities to build on our work destined for the public good. We did not choose an off-the-shelf, standard data license because the team initially compromised and decided to customize to this particular use case — making the license simple and short (two paragraphs) and easy to understand.
2) People should be encouraged to cite our data appropriately. In our discussions, we felt it was important to show the value of the portal to the community through citations, so we could ensure its continued existence and understand where it was (and wasn’t) working well for people.
Our background research
A number of factors led us to reconsider this data policy in 2020. As I mentioned previously, more and more people were relying on the data portal. Other Urban projects — such as Urban’s Data Catalog — began to adopt open license standards (we default to the Open Data Commons Attribution License). And finally, funders began moving toward more open data standards and data sharing policies and explicitly asking us about our data stance. As a result, we began an effort to explore best practices in the field to inform an update of the Education Data Portal’s public data policy.
We scanned the field online and through Urban’s extensive network of partners and friends to put together a list of the existing data policies specifically for similar institutions providing data to a broad audience for the public good. We compiled them in one place, and you can find the links to the ones we found most relevant and useful for our discussion at the bottom of this post. In summary, we found a wide variety of data policies and access requirements and discussed them all as a team to come to the best solution for the Education Data Portal.
Our conversations and debate
Based on this background research and our needs, I drafted an initial set of recommendations, sent them to the team for review, and then met to iron out the differences that emerged in edits and comments. In this section, I’ll walk briefly through our primary agreements and points of discussion during the process.
· Standard license: We all agreed we needed to update our license to follow an open, permissive license already established in the field, both to build on existing best practices and legal protections and provide a license users could better understand.
· Liability limitation: We agreed that we needed additional language that protected Urban from claims, damages, and losses related to data use and reuse if we were to use a more permissive standard (see “Choice of license”, below).
· Different citation options: Based on user feedback, we decided that two citation examples and recommendations for use were better than one. We created a shorter, more compact citation style for ease of use with data visualizations to go with our more traditional, researcher-focused citation.
· Notification of use: We agreed that we did not want to add additional language requiring users to notify us when they used the data, something that we do not have the time to enforce and believed might prevent good-intentioned people from using the data. Instead, we decided to highly encourage reporting and included language demonstrating how doing so could add value for the user. In addition, our choice of license model, described in the following section, required citation, which would help us adequately track our use.
Points of Discussion and Where we Settled
· Choice of license: We debated the right standard license model for the data for some time, and ultimately ended up in a compromise that, while it was not everyone’s first choice, got us all to an agreement. Ultimately, we settled for the Open Data Commons Attribution License (ODC-By) v1.0, which allows for pretty much any use of the data by any party — commercial or non-commercial — as long as they appropriately cite us. Some on the team favored a completely permissive license without citation requirements, while others favored having a stricter version of the license that continued our policy of forbidding commercial use. Ultimately, we decided that attribution was valuable for demonstrating the portal’s impact and that permitting commercial users would only bolster our user’s advocacy efforts to keep the portal sustainable in the long term.
· Request to ask permission for redistribution: We briefly discussed whether or not to require the user to contact us if they wanted to redistribute the data in whole or in part. However, we decided, per our choice of license conversation, that additional users of the data, in any form, would be helpful in showing impact and adding to the group of people advocating for the portal’s long term sustainability.
· Requirement to register: Some strongly advised us to collect information from users so that we could more easily show impact, and in so doing, deny them access unless they were registered. Sticking with the theme of increased use being good for impact and our long-term prospects, we decided to continue to leave the portal barrier-free, until such time as large amounts of traffic, intense use, or high infrastructure costs make such gatekeeping worthwhile.
· Blocking use for extremely high volume users: We currently set a rate limit for the portal at 250 requests per second, which, as of this writing, no one has exceeded. While some argued we should make this explicit in our policy, others made the case that no other example sites included this in their language and we could take the action to block users regardless of whether they were forewarned. In the end, we decided that, given our lack of any incidents, we would pass on including this language and revisit only if this becomes an issue in the future.
I hope this provided an interesting view into our data policy update process, and that it can help you or your organization in your journey to specifying your own policy for publicly available data. If you have any feedback, questions, or comments, please reach out to the team at email@example.com. And if you use Education Data Portal data in your work, please be sure to let us know!
Addendum: Public Data Policy Guidance Summaries and Links for Selected Top Public Data Providers
IPUMS and NHGIS
· Citation: https://usa.ipums.org/usa/cite.shtml
· Requires registration to access and for people to affirmatively agree to legal terms: https://uma.pop.umn.edu/usa/user/new?return_url=https%3A%2F%2Fusa.ipums.org%2Fusa-action%2Fmenu
COVID Tracking Project
· Terms and conditions allow CC Non Commercial usage license: https://covidtracking.com/terms-and-conditions to modify and use and to use for research, with attribution and license information. Have a way to submit a notice if you think the site has taken copyrighted information. They also have language on non-endorsement of links.
Western Pennsylvania Regional Data Center
· Requires clicking accept on Data Use Agreement before using data, with definitions at the top: https://data.wprdc.org/dataset
Data Driven Detroit
National Preservation Database
· They provide detailed data update notes: https://preservationdatabase.org/documentation/data-notes/
· Separate users by commercial use, and plan to share when registering: https://preservationdatabase.org/register-as-a-new-user/
· Specific information on use and prohibitions also when registering: https://nhpd.preservationdatabase.org/Account/UnpaidUserRegistration
Longitudinal Census Database
· Requires e-mail address, prohibits commercial use, cannot be redistributed without authorization, with a recommended reference: https://s4.ad.brown.edu/Projects/Diversity/Researcher/LTDBDPDload/Default.aspx
· Have a custom request and data merge request: https://evictionlab.org/data-request/
· Accessing data requires e-mail: https://evictionlab.org/get-the-data/
· Citation and sharing requirements on data page, along with changelog and changes page: https://data-downloads.evictionlab.org/
Open Data SF
· Summarize their thought process in choosing a license for their data: https://datasf.org/blog/data-license-liberation-day/. Recommend having a license but not necessarily requiring citation, as it can be hard for data users to keep track and can be a lot of text.
NOLA Data Center
· Require citing original source, not republishing data tables and entire web pages without permission, provide e-mail. https://www.datacenterresearch.org/copyright-acceptable-use/
Can’t use for profit, need to contact (provide e-mail) to get approval to do so. Can’t use data to identify individuals, liability limiting statement, agree to cite and provide a citation, grants a “right to access”, makes users enter e-mail to accept data use agreement: https://edopportunity.org/get-the-data/
*An earlier version of this story said that some Urban staff favored forbidding non-commercial use of the data. The sentence has been corrected to state that staff favored forbidding commercial use of the data (corrected 1/29/21).