How to Categorize Large Amounts of Text Data
In the Urban Institute’s Connecting Police to Communities project, we’ve been working to identify best practices for law enforcement use of social media by examining tweets from police and the public that document their interactions and behavior on Twitter. We collected millions of tweets containing the words “cop” or “police” to help us determine public sentiment toward police. We also collected more than 500,000 tweets from over 300 law enforcement agencies to determine how police use social media.
Once we collected all the tweets, we needed to categorize them by either sentiment or topic. Traditionally, senior social science researchers would assign one or more research assistants the unenviable task of manually categorizing each tweet. At the scale of tweets we collected, we quickly realized that we needed a less manual, more technologically oriented solution. We simply didn’t have the time or resources for someone to read and categorize each tweet manually, so we decided to develop supervised machine learning models to perform the same task automatically. In this post, we explain how we developed a model to classify text data, along with some troubleshooting tips and the lessons we learned.
What Is Machine Learning?
The rise of online publications, reviews, and social media has created a trove of easily accessible text data for researchers and data scientists to use to study human behavior and social phenomena.
For researchers used to analyzing standardized tabular data, deriving meaning from unstructured and sometimes messy text is difficult. For example, movie studios looking for early audience feedback may want to categorize the topic of tweets or the sentiment of reviews, but manually reading and sorting them would take too long and would be prohibitively expensive. Instead, analysts typically use supervised machine learning techniques to classify each piece of text data in a fraction of the time.
In machine learning, systems learn from and detect patterns in data without being explicitly programmed by a researcher. Despite the programming process’s similarity to traditional regression models, machine learning practitioners typically use different language than social scientists — the biggest distinction being the use of the word “features” to describe variables in a model.
There are generally two types of machine learning models — unsupervised and supervised. In unsupervised machine learning, no outcome is identified, and the model clusters observations together based on patterns, similar to a factor analysis. In supervised machine learning, the researcher specifies the outcome and uses regression or classification algorithms to predict this outcome variable using the input features. For our text classification tasks, we developed two supervised machine learning models built for classification. One model classified the sentiment of public tweets toward police, and the other classified the topic of law enforcement tweets.
Developing a Text Classification Model
Text classification is a multistep process, and the steps included in your final process depend on the specific problem you are trying to solve. In other words, you may not get the best possible results by simply re-creating our features and models. This is because the performance of any given model depends on its ability to detect patterns in the input text, and different collections of text typically have different patterns. The following steps formed the basis for our text analysis and provide general guidelines for any text classification process.
Step 1: Develop features for the model. In machine learning, features are the independent or explanatory variables used to predict the outcome variable. The features we created for analyzing Twitter data (and the features created for similar analyses) will typically fall into one of two categories: content or metadata.
Content: Many features in the model will be based on the words in the text, but the text often needs some processing before the words can be converted into features. Analyzing text is typically a two-step procedure:
1. Preprocess the text data. Preprocessing standardizes your text as much as possible so that the model can more easily recognize general patterns. For example, you may want to remove “stop words,” common words that may add unnecessary noise to the data and don’t have much meaning on their own — for example, “and,” “in,” “or,” and “the.” Similarly, you can stem or lemmatize words to convert them to their root form (e.g., “walks,” “walked,” and “walking” become “walk”) so that the model recognizes them as the same word. Other preprocessing steps include converting the text to lowercase letters and inserting placeholders for or removing punctuation, numbers, and symbols. Rather than removing numbers, replacing them with a placeholder was helpful for categorizing police tweets by topic, because the presence of numbers helped predict certain categories (e.g., when addresses were included in crime incident updates).
2. Create a document-term matrix. In a document-term matrix, each row in the dataset represents the unit of text (e.g., a tweet or a review), and each column represents a word used in the collection of text data. The two tweets, “Police training downtown tonight” and “Police out here with tasers tonight” would lead to a document-term matrix that looks like this table:
Metadata: External information about the text often improves the model’s performance as much as the text itself. When working with Twitter data, it may be helpful to include information about the time of day or day of the week the tweet was posted or the number of retweets the post received. You can also extract information from the text using advanced text analytics tools like CoreNLP, which allow the user to extract information about named entities, overall sentiment, and parts of speech. All this information can be made into variables, or predictors, that the model will use to identify patterns among the different categories.
Step 2: Select a random sample of the data that will be used to develop the model. The more observations included, the better the algorithm will be able to identify patterns in the data. In our experience, analyzing text data often requires a large amount of computing power, so we recommend you include only as much data as your system can support.
Step 3: Manually determine the outcome of interest for each observation in the sample dataset if your outcome measure is not already available. We were interested in both the sentiment of public tweets about police and the topics law enforcement agencies discuss on Twitter. For the sentiment classification task, we manually coded 4,050 public tweets about police into four sentiment categories: positive, negative, neutral, and not applicable. For the second question, we coded about 800 tweets into five topics: crime or incident information, department or event information, traffic or weather updates, person identification, and other. Although it helps to set some guidelines for what is included in each category, sometimes (as with sentiment) subjectivity is unavoidable. If possible, have more than one person review the same tweets to ensure the coding is consistent with your guidelines.
Negative sentiment:
Positive sentiment:
Crime or incident information:
Department or event information:
Person identification or request for information:
Step 4: Split the data sample into a training set and a test set. Developing and testing your models on separate datasets reduces the chance of picking a model that works well only on a specific set of text and does not generalize to the broader collection. As is typical in machine learning tasks, we used a randomly selected subset of the manually coded data as a training set to develop our models and then used the remainder of the data to test the performance of those models “out of sample.” There is no formal rule on what proportion of data should be in each set, but a common divide is 70 to 80 percent in the training set and 20 to 30 percent in the test set. Our training set included 80 percent of the observations, while the test set had 20 percent. Ideally, if the data are split randomly, the outcomes will be balanced across these sets. For example, if 60 percent of the training set contains negative tweets, the test set would have a similar share of negative tweets.
Step 5: Develop multiple models using different algorithms on the training set. There are many machine learning algorithms available to test. Depending on the text classification task, some algorithms may perform better than others, so it is worth testing several. Packages like caret in R and scikit-learn in Python make this process straightforward. For both of our classification tasks, we tried several algorithms, including random forest classification, logistic regression, k-nearest neighbors, and naïve Bayes. The final model in both tasks ended up using gradient boosting, but other algorithms may work better for you. Each model typically has hyperparameters (values that are set before the model is run and place limits on the model structure) that you can adjust to improve the model’s accuracy based on your data. For example, you may want to test whether having different numbers of trees in your random forest model improves performance.
Step 6: Pick the top-performing model and test it on the test set. Many performance metrics, including overall accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve can be used to assess model performance. If the model performs about the same on the test set as it did on the training set — in other words, if the results generalize — we select the model as our final production model. If the model performs far worse on the test set, it likely picked up on patterns specific to the training set, called “overfitting.” In this case, a new model needs to be developed, and we provide some troubleshooting suggestions in the next section.
Step 7: Predict the outcomes of the remaining observations using the final model. Using the full dataset that does not contain manual labels — in our case, hundreds of thousands of tweets for topic classification and millions for sentiment classification — run the final model to predict the classification for all tweets.
Troubleshooting Model Development
In our task, as in many modeling tasks, there is no single strict cutoff for an acceptable level of accuracy. If the outcome being predicted is complex or the text data are extremely noisy, an accuracy level that is only slightly better than random chance might still be a major accomplishment. To save time and frustration, it’s helpful to specify an acceptable performance level at the outset of model development or troubleshooting. As reference points, Stanford University’s sentiment classifier of movie reviews (PDF) has an accuracy of 81 percent, and a sentiment classification of tweets about the 2012 US presidential election achieved an accuracy of 59 percent.
Because tweets tend to be short and noisy, and because we were predicting four possible outcomes rather than two, we decided we would accept a model if its accuracy was above the level we would get if our model classified every tweet as the most common category. In our case, the most common sentiment category was neutral, with 59 percent of tweets. Our final model achieved an overall accuracy rate of 63 percent. This was an improvement over simply predicting all tweets as the most common class, which achieved an accuracy of 59 percent.
If the accuracy of the final model is insufficient, you may be able to improve the performance with the following set of steps, which we used to improve our model’s accuracy.
Step 1: Make the training and test sets bigger by manually coding additional observations. More observations provide the algorithm more opportunities to detect patterns in the data, especially if certain outcomes appear infrequently. In our sentiment classification task, positive tweets were rare, representing fewer than 5 percent of all tweets. To improve the classification of positive tweets, we manually coded a larger sample of positive tweets so that the model could detect stronger patterns.
Step 2: Create more features. Text data are rich with information that can be transformed into features, and this is where researchers can get creative. After reading many tweets about the police, we detected several patterns in “not applicable” tweets that we transformed into features. We discovered when the word “cop” was used as a verb — “anybody wanna cop these for me?” — the tweet was usually not referencing the police. To account for this, we included a variable in our model indicating whether “cop” was used as a verb, using CoreNLP to identify the part of speech of each word in the tweet. In general, the researcher’s expertise in a subject and observations about the data are important to developing features that help accurately predict the outcome.
Step 3: Try other algorithm types or specifications. No single algorithm works best for every problem, so try several algorithms to find the one that works best. Support vector machines, naïve Bayes, random forests, and gradient boosting are often used for machine learning and text classification tasks, and they are good first steps for any text classification analysis. We also recommend adjusting each model’s hyperparameters to optimize performance.
Lessons Learned
We learned three key lessons for running large-scale machine learning and natural language processing analyses.
Lesson 1: Stop somewhere. Machine learning is faster than manually coding millions of observations, but it too takes time. Preprocessing text data requires many steps, especially for less structured text, such as tweets. Feature creation can continue as long as a researcher has new ideas for variables that might predict the outcome. Selecting and calibrating a final model can go on indefinitely, and memory and computing limits can present constraints. We had to stop somewhere and complete our work, but learning and mastering these methods saves time in subsequent tasks.
Lesson 2: Expert knowledge is key. There is no one-size-fits-all approach to text classification. The preprocessing, feature development, and model decisions you make will vary depending on the problem you’re trying to solve. This is because the performance of your final model depends on its ability to detect the patterns in your text. But different collections of text have different patterns. Knowing your subject matter and thinking creatively are crucial for developing a well-performing model.
Lesson 3: You can’t be perfect. The final model will not be perfect. Text data, especially tweets, are messy, and deriving meaning from sentence fragments or short bits of commentary can be difficult. Classifying something subjective like sentiment can be even more challenging. A tweet might be sarcastically expressing a negative opinion, but in the absence of knowledge about that user or the context of the tweet, a researcher or a model might classify the same text as positive sentiment.
Despite these limitations, we have found that text classification is helpful and cost-efficient for identifying general themes and patterns in text. We hope these tips help social scientists take a more direct route to unlocking the value of text as data to inform policy.