Choosing a Geocoder for the Urban Institute
*Google Maps and HERE pricing and free tier limits were updated to reflect up-to-date data (corrected 3/12/20).
Here at the Urban Institute, researchers often need to geocode addresses — that is, take input text, like postal addresses, and turn them into latitude and longitude locations on the Earth’s surface. Geocoding is important because it quantifies unstructured or semistructured address data and allows researchers to perform aggregated spatial analyses and create maps.
Urban hasn’t had a standardized reproducible geocoding process. Researchers have been geocoding their data using a wide range of tools, such as the US Census Bureau’s geocoding API, ArcGIS Online, and even typing addresses by hand into Google Maps. More importantly, we didn’t know which of these geocoders was the fastest, most accurate, or most cost efficient for our needs.
To fill this gap, the Tech and Data team decided to research available services and build a centralized geocoding service for Urban researchers. Although we considered building our own in-house geocoding service using open source tools and address data (such as Pelias and Open Street Map), we decided instead to use an existing third-party geocoding service, as we found it quicker to set up and more accurate out of the box.
After surveying Urban researchers about the tools they use and the features they need, we assembled a list of potential geocoders and evaluated them based on terms of use restrictions, cost, accuracy, and speed.
Terms of Use
One important but often underestimated aspect of geocoding is what’s allowed in the terms of use (TOU) for the geocoding service. Researchers have unique requirements for geocoding services, including the ability to do the following:
· locally store geocode results
· use the results in further spatial analysis
· create derivative works, such as reports, using the geocodes
· publish the latitudes and longitudes publicly
· overlay the latitudes and longitudes on third-party basemaps
Unfortunately, the TOU and acceptable use policies of many of the most popular geocoders explicitly prohibit one or all of these activities. Both Mapbox and Google prohibit prefetching, storing, or bulk downloading content. Google Maps also prohibits creating content from their data, including using “latitude/longitude values from the Places API as an input for point-in-polygon analysis,” which is a common data analysis method.
The only way to get around these restrictive terms of use is to purchase licenses to the content from third-party vendors. In our research and outreach, we’ve found this to be prohibitively expensive, as it’s mostly geared toward large enterprise clients.
Below is a listing of some of the popular geocoding services, with links to their TOUs and whether they allow locally storing the results and overlaying the results on third-party basemaps. These are just some of the many requirements your organization may have.
In short, deciphering whether research-specific use cases were permissible was challenging, as the TOUs of most geocoding services are geared toward the needs of developers and business users, not researchers. When in doubt, we suggest reaching out directly to the geocoding service and confirming you are not violating their terms of use. After review with our legal team, we decided that Google Maps, Mapbox, MapQuest, and LocationIQ had terms of use that were too restrictive for our organization and had to exclude them from our list of potential geocoding services.
Cost
One of our primary concerns for choosing a geocoder was cost. Almost all the geocoders we analyzed had different pricing schemes, making an apples-to-apples comparison difficult. Geocoders like Google Maps and ArcGIS Online have a pay-as-you-go model where you are charged for each address you submit for geocoding. Others like HERE offer 250,000 free geocodes per month, as well as monthly subscriptions for higher capacity. In addition, ArcGIS StreetMap has an annual subscription for unlimited geocodes, and the Census Bureau’s geocoder is completely free.
All that said, below are the publicly available prices for geocoding 10,000 addresses from a sample of popular geocoders. Keep in mind, the true costs are more complicated because most services have varied, often more expensive enterprise pricing schemes for permanent storage of geocodes, as well as daily limits, volume discounts, free tiers, or required software licenses. For example, you will need an ArcGIS license to use any of the ArcGIS geocoders. Nevertheless, this should be a good starting point for cost considerations:
Ultimately, the most cost-efficient geocoder depends on how many addresses you will be geocoding each year and within what timeframe you need the results. Assuming you’ll be geocoding fewer than 250,000 records a month, both Bing and HERE offer generous free tiers that could meet your needs. If you have higher geocoding needs or need a quicker turnaround, it might make more sense to look at the other paid options. Of course, cost alone doesn’t tell the full story.
Accuracy
In the real world, address data is almost always messy and riddled with typos. So the accuracy of a geocoding service depends on a geocoder’s ability to fix misspelled or misformatted addresses and return accurate latitudes and longitudes. To test the accuracy and text correction capabilities of the geocoders on a level playing field, we created a dataset of 40 randomly chosen addresses in the United States (1 address was in Canada) and their latitudes and longitudes, which were confirmed by manually scrolling to the center of the buildings as seen on Google Maps. These manually inspected latitudes and longitudes were the baseline “ground truth” we compared all the geocoders with.
We then created two more versions of each addresses: a slightly misspelled version (difficulty level 2) and a version with misspelled and missing information (difficulty level 3), for a total of 120 addresses. Below is an example of one of the addresses under each of the three difficulty levels. You can download the full test address dataset from the Urban Data Catalog here.
We then submitted these 120 addresses to each of the geocoding services we were considering and measured the accuracy of the results. We defined an “accurate” result as a geocode that was within 100 meters of the Google Maps baseline. “Accuracy” here is defined as accurate relative to our manual Google Maps check, which may not represent actual accuracy but was a sensible starting point for our purposes.
Unfortunately, our legal team determined that we cannot publicly share the exact results, or the exact list of geocoders we tested for accuracy, as most of the terms of use either specifically prohibit sharing this information or include language that strongly discourages defamation of any kind. Because our data are only one potential test and may not fully describe the accuracy of a geocoder for every use case, our efforts could be construed as violating these terms. And to be clear, the list of geocoders presented in the tables above do not represent the list of geocoders we tested internally.
We understand some readers’ frustration and share it. We would love to have been able to share this information with you here. If you want to learn more about our testing process, please reach out, and we’ll be more than happy to talk.
At a high level, however, we found that ArcGIS StreetMap Premium and ArcGIS Online performed very well on our specific tests, with almost no drop-off in accuracy at higher difficulty levels, in which addresses contained significant data entry errors. But this is still a very small sample set of addresses and is by no means a substitute for a full test at another organization, so other geocoders could perform better on a different test dataset that may be more compatible with other organizations’ use cases. However, we feel fairly confident that the geocoders that did well on our test dataset have dependable text correction capabilities and comprehensive address databases for our use cases here at Urban.
Finally, we also saw that all geocoders performed best with difficulty level 1 addresses, reminding us that the greatest increases in geocoding accuracy comes from cleaning your addresses thoroughly before submitting them for geocoding.
Speed
Speed was also an important factor to consider because we sometimes have projects that require millions of addresses to be geocoded quickly. To compare the speed of the geocoders, we recorded how long the geocoding process took for a toy dataset of 1,000 addresses. These tests were complicated by the fact that each of the geocoding services offered different ways to submit addresses for geocoding.
Most offered some kind of online API, some supported batch file job submissions, and a few had file upload interfaces. Wherever possible, we used the APIs to submit geocoding jobs and took advantage of the ability to submit batch jobs. Again, we can’t share the exact results of our tests because of TOU restrictions, and your mileage may vary based on your specific use case. But overall, we found that ArcGIS Online and ArcGIS StreetMap were very snappy for our specific use cases.
Confidentiality
One other concern we had was how suitable our selected geocoder was for confidential data. Researchers at Urban sometimes work with confidential or sensitive address information, and sending these data to a third-party geocoding service over the internet presents lots of potential security concerns.
In addition, when we took a closer look at the terms of use for many of the commercial geocoding APIs, there were even more concerns about data privacy. The Bing Maps terms of use, for example, gives Microsoft the right to store geocoding requests and results and transmit these data to third parties. That simply wasn’t an acceptable risk for some of the sensitive data we work with.
After discussion with our institutional review board, we decided we prefer an offline geocoding service that we could set up on our servers behind our firewalls to ensure respondent confidentiality for any study. Of all the geocoders we tested, the only one that fit this condition was ArcGIS StreetMap Premium, which consists of an offline address database you load into ArcMap/ArcGIS Pro.
Which geocoder won?
Because Esri’s StreetMap Premium geocoder was the only one that could fulfill our confidential geocoding needs, that was our first choice. We determined it would be much easier to set up one system, especially one as accurate as Esri’s, as opposed to setting up dual systems: one for nonconfidential data and another for confidential data.
We contacted Esri and received a discounted quote for an annual subscription to StreetMap Premium because we are a part of Esri’s Nonprofit Organization Program and already have an Enterprise ArcGIS license. That, combined with how well StreetMap Premium scored on our speed and accuracy tests above, convinced us this was the best geocoder for our needs.
The one disadvantage of the StreetMap Premium geocoder is that it requires some knowledge of ArcGIS, as you need to load your data into ArcMap/ArcGIS Pro before geocoding. To avoid burdening researchers with this hurdle, we created a Python script that uses the arcpy package to geocode addresses automatically given an input CSV file. We then deployed the Python script as a web application on AWS (Amazon Web Services).
Now, all researchers have to do is upload a CSV file with addresses into a website, and a few minutes (or hours, for very large jobs) later, they receive their geocoded results via email. We plan on writing a future Data@Urban post that lays out exactly how we built this system with AWS Codestar, EC2 instances, and arcpy. But in the meantime, we’re happy to share the scripts and templates upon request!
Overall, we found StreetMap Premium to be the right mix of accurate, quick, cost efficient, and confidential for our particular use case. There were also no service limits, so we could geocode as many addresses as our servers can handle. It also helped that we already had Enterprise ArcGIS accounts and could easily integrate the geocoder into our systems.
Although the geocoding requirements for your organization may vary significantly from ours, we hope this blog post makes choosing a geocoding service a slightly less scary prospect. And in case you need it, you can download the cost, terms of use, and address test data in this blog from the Urban Data Catalog here.