Illustration by Alysheia Shaw-Dansby

How We’re Bringing R to Serverless Cloud Computing

Data@Urban
5 min readJan 11, 2024

Over the past several years, two technologies have undergirded many of the most innovative research projects and tools developed by the Urban Institute: the R programming language and Amazon Web Services (AWS) Lambda. Despite their respective prevalence, R and Lambda have largely existed in isolation because Lambda does not natively support R. To address this barrier and allow researchers to leverage the capabilities of serverless cloud computing, we’ve adopted a simple framework for running R code using AWS Lambda. In this post, we describe our framework and share how we’ve applied it across projects.

What are R and AWS Lambda?

R is a programming language designed for statistical computing that’s quickly become ubiquitous among researchers and data scientists within and beyond Urban. In previous Data@Urban posts, Aaron Williams discussed efforts to build an R community at Urban, and Graham MacDonald predicted that R would be the future of research innovation at Urban. Since then, this prediction has borne out — Urban staff have used R to build websites measuring upward mobility across communities, parameterized reports sharing technical results with local government officials, fact pages reporting state fiscal briefs and the effects of COVID-19 on state budgets, dashboards exploring food system resiliency, and more.

AWS Lambda is an event-driven, serverless cloud computing service that lets users run code without having to spend time managing costly and unwieldy servers. Lambda enables Urban’s developers to quickly build and deploy scalable applications in a highly cost-effective manner, as the service uses a pay-per-use model, so Urban only pays for the actual number of requests and runtime (not for idle time like with traditional server-based applications). AWS also includes one million Lambda requests and 400,000 GB-seconds of compute time per month in its free tier, which covers Urban’s current use across all ongoing projects.

AWS Lambda is essential to several tools that Urban has developed, including the Education Data Portal summary endpoints, where API requests trigger a Lambda function to put aggregated statistics at the fingertips of users within seconds. Lambda also powers the Spatial Equity Data Tool, where users can upload data files to trigger a Lambda function that computes estimates of demographic disparities. Urban has also used Lambda to automate data quality checks, run microsimulation models, expand access to confidential data, provide researchers with on-demand access to cloud computing resources, and more.

To date, Lambda natively supports only a handful of programming languages — Java, Go, PowerShell, Node.js, C#, Python, and Ruby. As a result, Urban researchers typically leveraged Lambda using the Python programming language, which poses a barrier to staff who are more comfortable using R or to ongoing projects with existing codebases written in R.

How can you integrate R with AWS Lambda?

At a high level, we’ve built a custom Lambda runtime (or execution environment) with both R and Python installed and leveraged the rpy2 package to bridge the two languages. We walk through this process below, and you can find a full demo in this GitHub repository that uses Amazon’s Serverless Application Model (SAM) to define all AWS resources as code.

In 2020, Amazon announced support for container images to let users create custom Lambda runtimes through Docker images. For our use case, we create a Dockerfile, starting with the AWS Lambda base image for Python 3.10. In the Dockerfile, we then install R along with any R and Python dependencies. In the example below, this includes the aws.s3 and dplyr R packages. These are installed using precompiled binaries from Posit Public Package Manager, which drastically reduces installation times. Through a standard requirements.txt file, the Dockerfile also installs Python dependencies, importantly including rpy2, an interface to R embedded in Python processes. The Dockerfile then copies the application code and dependencies to the Lambda task root directory.

Once this Dockerfile is defined, we can build the image and deploy it to Amazon’s Elastic Container Registry. At that point, it will be available as a runtime for Lambda functions.

In the example below, an R script, called utils.R, contains a single function called parity that takes an integer as input and returns a string indicating if the integer is even or odd.

Then, the Lambda handler, written in Python, leverages the rpy2 r instance to read the utils.R file, pass the Lambda event payload into the parity function, and return the output.

By updating the R script and relevant function arguments in the example above, virtually any R code can be packaged into a Lambda function to allow R programmers to leverage the benefits of serverless cloud computing without having to overhaul their existing workflows or codebases.

We also experimented with the lambdr R package (and include syntax for using this package with the SAM framework in our GitHub repository), but we ultimately found the framework described above to be more flexible and robust for our use cases, which often involve using various invocation types and leveraging Python’s rich ecosystem of AWS integrations and developer tools alongside R. The example parity R function is borrowed from the lambdr documentation.

How have we applied this framework across projects?

We initially developed this framework while building an updated validation server that implements a computationally intensive differential privacy algorithm to expand access to confidential data. Improving upon an earlier prototype, the updated tool allows users to submit R scripts with tabular and regression analyses and receive results that are slightly altered to automatically preserve privacy. After a user submits their R script to the system, hundreds or thousands of Lambda functions are simultaneously triggered to implement the algorithm, which requires running the R script hundreds of thousands of times on subsets of the data. Through the framework described in the previous section, this tool leverages the scalability and efficiency of Lambda while still letting researchers develop their analyses using R.

We’ve also applied this framework to help automate data collection tasks. Each month, Urban research assistants manually check various government websites to see if the relevant agency released new data files. If so, they run R scripts to download the new data, perform simple transformations, and generate summary tables. Now with our R to Lambda framework, a set of Lambda functions are automatically triggered each month to run these R scripts and send the researchers an email verifying that they ran successfully with links to the data files and summary tables stored on S3, AWS’s object storage. This process required minimal changes to the research team’s existing code, and additional R scripts can be incorporated with just a few updates to configuration files. In the future, the research team can also extend their R scripts to add data quality checks, data visualizations, analysis, modeling, and more — all without having to learn new programming languages or cloud frameworks.

We believe that bringing R to serverless computing allows more programmers to benefit from advances in cloud computing, and we’re excited about the growing number of opportunities we’re seeing for research innovation at the intersection of these technologies.

-Erika Tyagi

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.