SiteMonitor: A Tool for Responsible Web Scraping
“Web scraping” — using a program to collect data from a website — raises a number of issues for both researchers collecting data and the sites that host the data. The technique gives researchers access to data that would otherwise be too difficult or even impossible to collect. For example, Urban Institute researchers recently scraped and analyzed Washington, DC, court records to examine the share of court cases that result in a conviction.
When data are made publicly available but in ways that are not easy for someone to obtain — such as when they’re being stored in multiple, dense text boxes on a website — what responsibilities should potential web scrapers follow? At Urban, we have developed a tool called SiteMonitor to help researchers automate their data collection efforts responsibly.
Check before you scrape
Before you scrape, get to know the website you’re collecting the data from. There are different ways online data collection can be stymied. Sometimes, restrictions are explicitly stated by the hosting site in their terms of service. Here is a portion of the terms of use from the real estate website Zillow:
“Prohibited Use:
You agree not to […] conduct automated queries (including screen and database scraping, spiders, robots, crawlers, bypassing ‘captcha’ or similar precautions, and any other automated activity with the purpose of obtaining information from the Services) on the Services”
Data providers can also create robots.txt files, which tell automated tools — like search engines — areas to avoid within a site. These permissions, among others, must be checked before any automated collection begins. But once there is implicit permission to collect data, there is still a responsibility to do so without negatively affecting the site itself. This is why we created a new tool to help researchers responsibly automate their data collection.
SiteMonitor
We created SiteMonitor for the Criminal Background Checks and Access to Jobs report. The research team collected millions of records from the District of Columbia Courts database to study the ability of people with criminal records to return to the workforce after prison. Our goal was to allow the web collection program to be responsive to the host site itself, slowing down the program’s queries when necessary to protect the website but speeding them up when possible to reduce the data collection time.
SiteMonitor does not perform the searches for you; its only role is to dynamically determine a reasonable delay between the searches. This means it can be used alongside any Python-based web searching library you prefer, including the popular Requests and Selenium libraries. The program works by first establishing a baseline response time with a “burn-in” period, when the program creates only a small delay between searches. Though all the specific values are adjustable, the default process is as follows:
1. Use the first 100 searches to calculate the mean and standard deviation of response times.
2. If the rolling mean of the last 25 searches exceeds the mean established in step 1 by two standard deviations 20 times, then add five seconds to the delay between searches.
3. If 20 searches in a row have a rolling mean below the mean from step 1, decrease the delay by five seconds.
In other words, we search the website and calculate whether the search is slowing the site down. We adjust our search speed to find an optimal interval to conduct the full data extract. Our future searches are compared with this initial set of observations.
All of these parameters are customizable, and SiteMonitor allows us to specify different response categories at the start. For example, we may want to track the response time of a website’s landing page and its search query. We can track these and other metrics and then display summary statistics for the user to help control the speed and length of the search.
How to use SiteMonitor: An example
We use SiteMonitor and the Python Requests library to perform 85 requests to the Urban Institute Education Data Portal schools directory file. To use Urban’s SiteMonitor, clone the GitHub repository or download and unzip it (the GitHub repository also includes more detailed instructions, a description of the parameters available for customization, and further examples). Ensure the site_monitor.py
file is in the same working directory as your main Python program — in this case, sitemonitor_ed_demo.py
— or include the path to SiteMonitor with your Python code as follows:
If you don’t already have them, install the NumPy and Matplotlib libraries. SiteMonitor’s sole dependencies reside outside of base Python, so you will need these two tools to ensure SiteMonitor functions properly.
We can now start up and initialize SiteMonitor.
In this example, we leave SiteMonitor’s defaults in place, and only change the burn-in, which is the number of requests that SiteMonitor will use to initially define “normal” response times on the site (you can see the full list of options available on the GitHub page). During the burn-in period, SiteMonitor inserts a 10-second gap between each request to ensure that the web scraper is not the cause of any changes in latency during this benchmarking period. You can change the length of the burn-in period (in number of requests) when initializing SiteMonitor. You might want to increase the burn-in period when there’s significant variation in wait times to ensure you get a reliable baseline measure of site responsiveness.
After the burn-in period — in this case, the first 20 requests — SiteMonitor calculates a reasonable threshold for maximum response times and cuts the gap between requests to zero. The base assumption in initially removing the delays between requests is that in most scenarios, the web scraper is not affecting site latency, which we have found to be the case in most of our tests.
In this case, we define “reasonable” as two standard deviations above the mean response time calculated in the burn-in period. If the application experiences consistent response times above this threshold, SiteMonitor will start adding increasing delays until response times level off, at which point it believes the application is not affecting site latency. The “reasonable” threshold of two times the standard deviation can be configured using the choke_point
parameter, where choke_point=2
is the default.
In our case, the program never hit our threshold — two times the standard deviation of the burn-in period plus the mean of the response times in the burn-in period. At the end of the program, you have the option to display the graph of response times, with the burn-in threshold in dashed grey and the acceptable response time threshold in dashed red. Our program did not affect the response time of the site, even with no delay between responses, so SiteMonitor never had to “throttle” our requests in order to be kind to our application programing interface. If response times had risen above the dotted red line, SiteMonitor would have inserted increasing delays, but in this case, our web scraping was not adversely affecting the website.
Automated data collection is an important tool for researchers, contributing to novel projects that would not otherwise be possible. The potential for a program to negatively affect a target site through a heavy, unmonitored load of requests, however, is a legitimate concern for the hosting site. The SiteMonitor tool can help by ensuring your searches are responsive to the burden on the host site on a search-by-search basis.