R Is the Best Programming Language for Innovation at Urban
I’ve placed my bet on R as the future of research innovation at the Urban Institute.
Here at Urban, we analyze data to create innovative, timely, and impactful research on social and economic policy, and programming is at the core of this data analysis. Recently, because of the fantastic efforts of our R Users Group, the R programming language surpassed SAS to become our second-most-popular programming language among our most active programmers at Urban, behind only Stata.
As the leader of the Data Science team, I don’t write this lightly, though I’m not above a GIF to lighten the mood. It took a lot of thought, good advice from trusted colleagues across disciplines, and failed experiments to get here.
But if Urban is to continue delivering top-quality social science research and transforming the way we deliver evidence to change makers and the public to elevate the policy debate, we need to have the best tools at our disposal.
Though I believe that researchers should always choose the right programming language for the right job (a phrase I borrowed from Urban’s research programming director, Jessica Kelly), I’ve concluded that R is the programming tool researchers most need in their toolkits.
Why is R the best tool for research innovation at Urban?
For more information on how I describe transforming social science and innovation, please see my posts on data science and exciting developments.
Of course, there are other ways R enables innovation. And many of the benefits listed here are related to the incredible rate of package development in R, not necessarily the programming language itself. The list here represents the most popular ways R, and its package ecosystem, has provided immediate benefits to Stata and SAS users at Urban.
- Data visualization. R allows our researchers to create customizable visualizations quickly in the Urban style, and they can easily create effective visualization types, such as small multiple charts and charts with custom annotation. Although possible in SAS and Stata, these visualizations typically take more work — though Urban has developed programs to make chart production in the Urban style much easier. In a great example, common Stata user and data viz expert Jon Schwabish recently learned R over two days to expand his knowledge of programmatic visualization.
- Big data and big processing efficiency. Every week, the Technology and Data Science teams answer Stata and SAS users asking how to speed up their programs. Whether it’s using big data or conducting a large array of regression analyses, these programs have their limitations. We always try to find solutions within the language first (such as redirecting to the Stata temp directory or having them use our more powerful AWS instances), but the most transformative change is to use R and, specifically, the incredible array of packages available to R users that connect to external services to enhance its capabilities. We can use R and Apache Spark to process terabytes of data in minutes. We can use R and our elastic AWS environments to parallelize a 30-hour job to 30 minutes. In Stata, we can only realistically reduce this 30-hour job to 5 to 10 hours, at best. And SAS often requires an additional license — more money — to improve performance.
- Geospatial analysis. The suite of geospatial visualization and analysis tools in R is head and shoulders better than SAS and Stata and often easier and more reproducible than ArcGIS. Urban researchers often switch to R to use urbnmapr, our mapping package, solely to create maps of the US in an easily reproducible way. And complex geospatial analyses that you would have to take out of Stata or SAS, into ArcGIS or QGIS, and back into Stata or SAS again can all be integrated into a single script, reducing the potential for error and increasing efficiency.
- Accessing and analyzing nontraditional data. Stata has trouble accessing APIs, though SAS’s recent updates finally make it possible, if not always simple. On the other hand, most web scraping, or automating the collection of data from the web, is not really feasible in SAS and Stata. And although Stata and SAS have some limited capability to process text, R and Python have a lot more flexibility for most tasks. A team of Urban researchers proficient in Stata, R, and Python chose to use R and Python for a recent text analysis and machine learning project in part because of Stata’s text analysis and big data limitations. And APIs make it more useful to download data from many sources, such as the US Census Bureau and our Education Data Portal, and conduct reproducible research from the first step — collecting the data from the original source.
- In general, we’ve found learning R to be a relatively smooth transition for SAS and Stata Users. You can do all these tasks in Python, and we are big fans of Python on the Data Science team. But in conducting trainings for both R and Python, we found that parts of R are very easy to pick up for existing Stata and SAS users. RStudio makes R a much smoother transition from Stata and SAS, whereas Python does not yet have a similar environment with such universal uptake and support. And although many research assistants come to Urban with some R skills, few have a background in Python.
Reasons why you shouldn’t switch to R
R is great, but it’s not the best language for everything. The following are arguments I’ve heard from people promoting R, both at Urban and externally, that wind up sounding like counterarguments to R to our typical SAS and Stata users. Because they end up turning off potential R converts, and because they’re an important part of the decision to add R to a research toolset, I’d like to address them here.
- It’s free. True, and this is a very effective argument for people who see the budgets, like me — I mean, have you seen your organization’s SAS bill? But for many large organizations like Urban, SAS and Stata licenses are covered by the institution and researchers don’t directly see these costs. And free comes with its own cost — although we can call SAS or Stata support, there is no similar alternative for R (though our R Users group has worked hard to create this option here at Urban).
- It’s the hot new language everyone is using, so you should too. What’s that about your old codebase again? As my colleague Jon Schwabish pointed out, the move toward R in universities may simply be the next step in the evolution of statistical programming — moving from Fortran to C to SAS to Stata and now to R and Python. And as a result, finding young SAS programmers is proving more difficult for researchers here at Urban. But on the other hand, will R or Python code be backward compatible in five years, when the principal investigator (PI) needs to run the code again? Will there be another hot new language in five years for which the PI will have trouble finding research assistants to refactor the original code?
- You need to switch to R completely and you’ll see major benefits. Although I think this is true for most researchers, just adding R to their toolsets, on top of existing Stata or SAS outputs, provides major benefits. Tools such as ggplot2, geospatial functions, or reproducible factsheets often provide a lot of value at lower costs than refactoring code and spending time learning an entirely new language.
- R is better for reproducibility. There are a lot of tools for reproducibility, especially in Stata — such as dyndoc and markstat. And reproducibility is much more about documentation, data management, preregistration, version control, and tools that exist outside of programming languages. Although R makes connecting to these tools easier — and I would argue that R Markdown is a more powerful tool than most — this is not a large and immediate benefit that our researchers see. The one caveat here is that, because we’ve built such a great internal ecosystem of R Markdown packages in the Urban style, this sentiment has begun to change.
I personally started programming in Stata and SAS
Though my preferred language is Python, I started as a SAS and Stata programmer.
I was trained in Stata as an undergraduate economics major and picked up VBA for Excel along the way as part of my dabbling in modern portfolio theory and finance. When I got to Urban, I programmed daily in SAS and Stata and became quite proficient at both — I know my way around egen
and preserve
as well as I do around a DATA
step or PROC SQL
command.
During that time, I began learning both Python and R — R for data visualization and Python for simulations and web scraping. Since then, I’ve learned a lot more — from text analysis to machine learning to reproducible workflows and many others. So I’ve gone through a similar progression in programming and have some experience making the transition.
Now, my Data Science team programs across all of these languages (and more), but if we get to choose, we typically use Python or R for most projects. We produced both an R and Stata package for our recent Education Data Portal, in which we programmed over 1,000 lines in Mata to allow Stata users to easily access our API. We’ve released a responsible web scraping tool in Python and supported the analysis of joined Home Mortgage Disclosure Act and American Community Survey data in SAS.
Why do I mention this? Because I strongly believe that…
…R is not the best language for everything
But I believe it is the right language for many researchers doing work at the intersection of data science and social science here at Urban.
We support projects in all languages and choose the right language for the job. Often that’s SAS, Python, Stata or Mata, or something else, like Fortran, and we have no problem with that. But I have found that R is both the easiest language to learn and the quickest and most effective path to getting researchers to adopt new tools — such as text analysis, big data, geospatial analysis, new visualizations, and new analysis techniques.
As SAS and Stata innovate in response to competition from R, my opinion could change. In any case, as the pace of innovation accelerates, enabling researchers to access new tools and techniques will help Urban produce new, timelier, better, and transformative data and analysis that will better inform policymaking, elevate the debate, and help social and economic policies achieve their highest potential.