How Web Scraping Powered Our Analysis of Work-Based Learning Opportunities in Community Colleges
Community colleges play a vital role in building America’s workforce, helping students realize their career goals and providing them with the skills to meet employer needs. With career-focused curricula and diverse student populations, community colleges also offer an ideal setting for work-based learning, which consists of opportunities such as internships, apprenticeships, and practicums that focus on career preparation and training in supervised, work-like settings. However, little work has been done to measure the prevalence of work-based learning programs in community colleges, as no centralized dataset of community college course descriptions exists to answer which schools offer work-based learning programs or what types of programs exist.
In part one of this two-part blog series, we walk through our web scraping project to gather course information and build toward a comprehensive set of course descriptions. Web scraping, or the use of programming to extract text or data from a website, offers the potential for harmonizing text from thousands of course pages across hundreds of college websites. For our pilot, we used three common Python packages, each with their own advantages and limitations: Scrapy, Selenium, and BeautifulSoup.
Throughout this project, we were mindful of adhering to responsible web scraping practices. By repeatedly making calls to the same domain, we ran the risk of overwhelming the website with requests. Urban researchers have previously provided guidance on responsible web scraping best practices, including checking the robots.txt file of a website at its base domain (e.g., www.urban.org/robots.txt) for permissible web scraping practices, which we followed. They also created SiteMonitor, a tool for determining whether requests are slowing a website down, which we used.
Below, we discuss our application of each Python package, key takeaways from the analysis, and steps we took to ensure our work was undertaken responsibly. All of the code can be viewed in our public GitHub repository.
Scrapy
Scrapy is a powerful and flexible web scraping framework designed for large-scale web scraping projects. It was a logical choice for our pilot because we needed to comb through hundreds of websites and subdomains where course descriptions might live.
With Scrapy, users can create “spiders” to perform web crawling, or the discovery of URLs from which they want to scrape or extract data. Spiders can follow links and navigate through websites, making it easier to scrape a large amount of data. Spiders also have built-in data processing capabilities, which enable users to clean, validate, and process scraped data before storing it in formats like CSV, JSON, and XML. Scrapy can handle multiple requests in parallel, an advantage for scraping large and/or many websites. Scrapy users should note that spiders can be configured to adhere to the scraping permissions outlined in a robots.txt file, keeping their use in line with responsible practices.
In the early stages of this project, we tried to use Scrapy to traverse the websites of all community colleges in the National Center for Education Statistics’ Integrated Postsecondary Education Data System (IPEDS) concurrently and scrape all pages containing the keywords “catalog” or “course description.” We encountered two issues with this approach. First, differences in site navigation structure posed a web crawling issue, as trying to find where course descriptions were located would have required studying each individual website. Second, differences in the structures of the course description pages themselves posed a web scraping issue that would have required us to build a custom scraper to extract course information from nearly every site.
The complexity in web crawling could have been mitigated by a list of root URLs that were closer to or at our target pages from which the spider could begin. Without this list, we had to rely on the keyword approach discussed above in order to find where course descriptions were located, and this method didn’t yield the right set of pages for each unique site. Unfortunately, the scale of our research question was too large to contend with issues in both site navigation and text extraction. Because it’s generally easier to scrape text from the same website, we decided to focus our efforts on one state’s website instead.
Although Scrapy didn’t fit our use case, we recommend it as a fast, powerful tool for projects when websites follow a similar structure, the list of pages to be scraped is already known, or there are relatively fewer domains to crawl through but a larger number of pages to scrape within those domains.
Selenium
For our narrowed focus, we chose to analyze Florida because the Florida Department of Education (FLDOE) provides a database of all postsecondary courses offered by public vocational-technical centers, community colleges, and universities, which was precisely the information we were looking for on a smaller scale. Although the information was public facing and readily accessible via FLDOE’s click-based interface, there was no option to download the text data we needed about each course in one batch.
Here, we turned to Selenium, which excels at collecting information from websites with dynamic elements, such as dropdown menus and clickable buttons. In this way, it can mimic and automate how a person would traverse a web page to find the desired information.
From FLDOE’s website, we selected the institutions that were also in our IPEDS list of community colleges. Next, we loaded all the community college courses offered by that institution, navigated to each course page, and collected its HTML text. These steps represent one iteration of our “loop,” which we repeated for each course pertaining to an institution and for each institution in our list. The GIF below illustrates one pass through this loop, up until we arrive at the course description to be scraped.
One important part of the Selenium workflow that we made full use of was the seemingly simple concept of waiting. Just like an internet user navigating a website would need to wait for certain options to appear before selecting or clicking on them, we had to build the same behavior into our Selenium code. We could have accomplished this task by using a function like Python’s time.sleep(), which allows a user to specify an arbitrary amount of time in between actions, also known as an implicit wait. Although this function was useful in certain instances, the load times could vary widely when repeatedly making calls to a webpage. Also, if a certain element only took a few seconds to load, a conservative 30-second wait time would add hours to the total runtime needed for the thousands of courses.
Fortunately, Selenium includes the concept of an explicit wait, which allowed us to wait precisely until a specific condition was met, such as a particular button becoming clickable. Although the Selenium scrapers have to deal with unique elements that correspond to the quirks of a particular website, the functions we wrote to click on a button or select a certain dropdown option combined with the logic to execute those functions only after explicit waits translated widely and effectively. The code below provides examples of these functions that we reused throughout our program. You can see how we called these functions in practice and what dependencies they have in our public repository.
To confirm that our Selenium process wasn’t disruptive to the FLDOE website, we relied on SiteMonitor. We also ran the Selenium scraper only during off-peak hours (e.g., evenings and weekends) where possible to minimize disruption. We added error handling parameters to the code that made it trivial to pick up where we left off, which was crucial not only for interrupting the scraper when we wanted to, but also for dealing with instances where the site took too long to load and our script needed to be run again. These parameters also ensured that once we scraped a course page, we never had to again, lessening the runtime and the strain placed on the website. In other words, rather than assuming that the entire web scraping operation would run from beginning to end with no issues, we built safeguards in our control flow to account for planned and unplanned interruptions.
Beautiful Soup
After using Selenium to navigate to each course, we arrived at a static page with text information about that course, including the course description. To extract text from these static sites, we used BeautifulSoup, a Python package that can pull data from HTML and XML files, including web pages. BeautifulSoup users can parse a web page for specific HTML tags and the information contained within them, such as <h1> for a particular header, <p> for a paragraph, or <table> for tables. By inspecting a particular object of interest on a webpage (which can be done by right-clicking on it then clicking “Inspect” within a web browser), a user can home in on specific tags or text and reliably extract the desired content.
With just a few lines of code, we could extract all the text from each course page and store it as a JSON object with keys (such as “Prerequisites” and “Course Description”) and values (the corresponding text for those keys).
This approach of leveraging Selenium and BeautifulSoup for the tasks they were comparably better at allowed us to flexibly handle both dynamic and static elements of the website.
Putting it all together
In this exploratory web scraping analysis, we were successfully able to compile course descriptions for the state of Florida using Selenium and BeautifulSoup. The lack of other readily available, standardized state-level sources prevented us from scaling up this methodology nationwide, but we see this pilot as a promising first step for collecting course description data. In the next part of this series, we will explore how we analyzed this data using R’s quanteda package to learn more about the prevalence of work-based learning in Florida.