Illustration by Rhiannon Newman for The Urban Institute

How Urban Piloted Data Science Techniques to Collect Land-Use Reform Data

10 min readMay 5, 2020

The Urban Institute’s Metropolitan Housing and Communities Policy Center and the data science team have developed a dataset that allows researchers to systematically study land-use reforms in select US metropolitan areas. Land-use reforms imply a change in an existing regulation that governs what can be built, and where and how to build it. They also govern local affordable housing laws, permitting fees, and development review procedures. The database contains information on land-use reform timing, type, and direction of restriction (tightening or loosening), among other variables, that can help researchers understand the effects of land-use reform at a larger scale than was previously possible. This database is different from our National Longitudinal Land-Use Survey, which represents a three-wave survey of land-use officials in local governments in 1994, 2003, and 2019.

The research and data science teams developed this database by applying natural language processing and machine learning techniques on tens of thousands of local news articles to parse the variety of land-use regulation reforms across the country and over time.

News articles from Access World News

Urban purchased access to US newspapers from 1980 to the present from Access World News, a comprehensive resource that includes a variety of news publications worldwide. After working with Access World News, we used information about their newspaper coverage to choose publications in 40 major metro areas with relatively better news coverage in general and higher population growth since 1980. These are areas that might have faced pressure to expand housing supply and were home to news outlets that covered land-use changes. Because there was much less news coverage before 2000, we decided to create the land-use reform database based on news articles published from 2000 to the present.

After selecting newspapers, we needed to search millions of articles to remove those irrelevant to land use. In consultation with the research team and a group of external experts in land-use research and policy, we chose 60 keywords and their variations (shown below) that would restrict the results to a reasonable number of articles. Some of the most frequently used words and phrases related to land-use reform included “zoning,” “legalizes,” “vote,” and “neighborhood.” We also combined all policies into a single, large search term and added an “AND” statement to restrict found articles to those that also included language indicating some sort of administrative or legislative action. For example, this keyword combination — ((“zoning” OR “land-use”) AND (“single family homes” OR “single family dwellings” OR “single family residences”)) — indicates that either “zoning” or “land-use” needs to be in the article with one of the words from “single family homes,” “single family dwellings,” and “single family residences.” After applying the keywords selected, we retrieved more than 76,000 news articles related to land-use from the full database.

Manual labeling

To produce a useful dataset of reforms across 40 metro areas and more than 76,000 news articles within the time and budget constraints of this project, we initially relied on a machine learning (ML) algorithm to automatically tag the articles, rather than using a manual, expert-driven process.

The ML algorithm did need a “training set” of expert-labeled data to learn to properly tag the full set of news articles. We trained a team of four manual taggers with a background in housing or land-use policy analysis to tag as many randomly selected news articles as possible (568 in total) within the project’s budget constraints. For each news article, the taggers determined if it indicated an actual land-use reform. If the article represented a true reform, the taggers would record when and where the reform took place, the geographic extent to which the reform applied, if the reform was major, if the reform made the regulation more or less restrictive, and the specific topic of the reform. Topics included single-family housing, mixed residential and nonresidential development, accessory dwelling units and granny flats, and other topics that the research experts determined had significant policy implications. Although to some degree labeling a “major” reform relied on taggers’ subjective judgement, we did train the taggers on how to differentiate among reforms. Generally, a major reform either significantly affected the whole city or was an extreme departure from existing practices. For example, reforms that unprecedentedly disallowed or allowed a whole type of housing, or doubled or halved housing height or density would be considered major.

Natural language processing

After manually tagging a subset of news articles for our training set, we needed to perform text preprocessing for all news articles before we put them into our ML algorithms. Because of the requirements of the data vendor, the news article analysis was completed on a remote Access World News server named the “walled garden” environment. As part of the research team’s agreement with Access World News, no news content was to leave the walled garden environment. The Access World News team provided preinstalled Python and Docker utilities we used to complete the analysis.

· CoreNLP

The research team used Stanford CoreNLP (version 3.9.1), a natural language software tool, to analyze and parse the news text. The Stanford CoreNLP server was provided by Access World News technical staff as a Docker container within the walled garden environment. In Python, we read in each news article from the search results and sent the text to the CoreNLP server. CoreNLP analyzed each article word by word and returned the base forms of the words, their parts of speech, and named entity recognition type (whether the words were names of companies, people, dates, and so on). We cleaned the text of the articles based on the CoreNLP results, which included removing punctuation, labeling words by parts of speech, and labeling words that were numbers and dates. The following code was built for conducting CoreNLP analysis in Python:

· TF-IDF feature extraction

Term frequency-inverse document frequency (TF-IDF), a method often used in textual analysis, evaluates how important a word is to an article in a collection of articles. In our case, we used it to extract important keywords to differentiate various news articles. After parsing the text with CoreNLP, we used TF-IDF to generate an initial list of keywords. We calculated the difference in the TF-IDF statistic for each word between one class label and the remainder of the class labels to choose top keywords as independent variables for our machine learning model.

For example, when choosing keywords to help us predict whether an article was referencing a land-use reform, we conducted TF-IDF analysis separately for articles labelled “reform” and “not reform” to create two keyword lists. We then analyzed each keyword from the two lists and counted the number of articles that include the keyword in each group of articles — reform versus not reform — dividing that frequency by the total number of articles in each group. Next, if the word appeared in both article groups, we ranked the words by frequency difference from highest to lowest.

We took the top 15 percent of the keywords (we chose 15 percent given the trade-off between being able to differentiate the news articles and overfitting) with the highest frequency differences, combined with the words that only appear in one group of the articles — reform versus not reform — and included them as independent variables in the machine learning model. For each independent variable (reform or not, reform type), we generated a different list of top keywords to use as independent variables in our model. In addition to keywords, we repeated this same method for parts of speech for each predicted variable to generate a separate list of parts of speech. The independent variables for parts of speech and keywords were then combined and used to predict each dependent variable in our chain of models. We used scikit-learn, a most commonly used machine learning module in Python, for our machine learning analysis. The module also offers easy-to-use tools to perform TF-IDF analysis. The following code is a Python function that we wrote based on scikit-learn for TF-IDF:

Machine learning

· Creating machine learning features (variables)

Our goal was to use a machine learning model to tag news articles without the need to manually review more than 76,000 articles. In our case, the machine learning model consisted of the tag as the dependent variable — for example, if the news article indicated land-use reform (1 if yes, 0 if no) — and hundreds of feature variables (independent variables), which we generated from several sources.

The number of independent variables varies for each dependent variable predicted in our chain of models. The sources from which we generated independent variables include the following:

1) Using the top keyword part of speech combinations and parts of speech we generated from TF-IDF, we created independent variables indicating how frequently a word appeared in an article (the number of times the word appeared divided by the length of the news article in words).

2) We used the list of manually selected keywords from the search query related to specific housing reforms — single-family home, minimum lot size, subdivision rules, and so on — and created dummy variables that take the value of 1 if they appear in the news article and 0 otherwise. This technique helped us to predict the tag of the reform.

3) We also included dummy variables that indicate if certain named entities, which include organization names, dates, location names, and ordinal numbers, appear in the article.

4) Using the part of speech variables from CoreNLP, we calculated the percentage of words in the article that were past-, present-, or future-tense verbs.

· Machine learning models

We randomly divided the 568 news articles into a training set and test set; 80 percent of the observations into training set and 20 percent into the test set. Tests were performed on the training set using 10-fold cross validation, a procedure by which we split the dataset into 10 random pieces, each with the same number of rows, and generated accuracy scores by iterating through each piece. The first time around, we treated the first piece as a “test” dataset with which we tested the accuracy of our model, and we treated the other 9 pieces as our “training” dataset, from which the machine learning model would learn the patterns in the data and predict the results for the “test” piece. We then repeated this procedure for the remaining 9 pieces, ran this process 10 times, and calculated the average accuracy across all 10 runs to test out of sample accuracy and avoid overfitting. The following Python code built based on scikit-learn was used to conduct cross validations and train the models:

For binary predictions where the dependent variable takes the value of 0 or 1 — for example, when we predicted if a news article is relevant to land-use reform or not — we used a random forest algorithm, a common machine learning algorithm for classification. Random forest, as its name implies, consists of a large number of decision trees. Each decision tree operated in a relatively unrelated manner with each other, and gave its own prediction result as a “yes” or “no”. The random forest model then consolidated all the answers from each tree to produce a final prediction. In this case, we used a random forest model with 2,000 decision trees, with default scikit-learn parameters from version 0.20. For multiclass predictions where the dependent variable can take more than two values — for example, where we predicted which specific type of land-use reform discussed in a news article — we used the same random forest algorithm with the same parameters. Both classification accuracy and confusion matrix, which are tables that report the number of false positives, false negatives, true positives, and true negatives from the prediction results, indicated good performance of our models.

· Manually cleaning the data

Ultimately, we determined the machine learning algorithm performed well, but it was not accurate enough to inform policy analysis directly. Because we wanted to predict several outcomes from the articles, even our relatively high 75–91 percent accuracy rate meant the odds that all fields in a given row were accurate were still quite low. For instance, if we predict six fields in a given row, each with a 75 percent accuracy rate, (and assume prediction errors are independent), the odds that every field is correct in that row is 0.75⁶, or just 18 percent.

But we determined that the model did work well enough to narrow down the set of articles that we could have experts manually tag to a much smaller number: thousands as opposed to tens of thousands. Our experts manually tagged these remaining articles for a smaller number of metro areas with the best coverage, correcting the machine learning model as necessary or leaving the predictions intact and moving on quickly where it had gotten the predictions right.

Future work and lessons learned

The land-use reform database is currently being used successfully in Urban’s research. Christina Stacy, a senior research associate in the Metropolitan Housing and Communities Policy Center, is studying the effects of reforms on housing supply to better equip government leaders with evidence to help them understand how to address housing supply and affordability issues.

Leveraging natural language processing with machine learning is challenging. CoreNLP, a collection of algorithms in and of itself, is not 100 percent accurate, especially when it comes to differentiating news articles under complicated policy contexts that even expert humans may not agree on. In future work, applying more advanced NLP algorithms, such as transfer learning and related deep-learning techniques provided by companies like fast.ai, could increase accuracy and automate the process more fully, if not completely. We are optimistic that these new models, which were nascent technologies or simply unavailable at the time we began this work, combined with more hyperparameter tuning and better feature selection, could provide much more accurate predictions in the future.

Using news articles to collect land-use reform information also has its downsides. Coverage for a given metro area is incomplete because of increasing news coverage over time and lack of newspapers covering certain cities and unincorporated county areas, both generally and within the Access World News database. Additionally, land-use reforms usually don’t make the news, except as a notice of public hearings, so the validity of news coverage as a consistent indicator of a land-use reform is relatively low. Creating accurate public coverage statistics to better understand this mismatch and how it affects our results will help in future work.

-Vivian (Sihan) Zheng

Want to learn more? Sign-up for the Data@Urban newsletter.

How Urban Piloted Data Science Techniques to Collect Land-Use Reform Data

Written by Data@Urban