In a partnership between Urban Institute researchers and the Tech & Data team, we have started moving some of the Urban’s many microsimulation models into the cloud. Instead of running a few alternative policy scenarios, cloud computing enables us to run more models faster, longer, and with more data. But making the transition from the standard desktop approach to the cloud system was not without its challenges.
Recall from a previous post, where we discussed the importance of scaling microsimulation models, that we created a cloud-based system to enable running the Urban-Brookings Tax Policy Center’s (TPC) microsimulation model thousands of times. In this post, we will dig into the challenges we faced as we underwent this work.
For more details on how we implemented the full cloud workflow, see the technical paper.
Challenge 1: Creating stability and being cost-effective with compute resources
One of the most appealing advantages of the cloud is how easy it is to change the type of instance being used; users can select from a whole list of options, from nano (very small) to 2xl (very large). The team went through several phases of dialing in the compute instances used as a result of bottlenecks and the speed of the TPC model itself.
When we first implemented the cloud-based approach, the model would fail seemingly at random. As we ran the model simultaneously, we were also trying to read the same input files. In theory, this is fine, as read access doesn’t change the file contents, but, in practice, doing this many times caused instability. Looking deeper, we discovered that multiple reads of the input data files were sometimes causing file locks. To remedy the bottlenecks, we made some simplifications to the code structures that call the TPC model and included a copy of the input file into the Docker containers themselves, but this also meant that we needed a more powerful, and expensive, compute option.
When it comes to the speed of the model code itself, one of our major goals was to minimize code changes in the model as we moved from dozens to thousands of model runs. To that end, we settled on a maximum number of runs per day while examining the trade-offs in cost. We opted for slightly more powerful compute options with enough memory and bandwidth to read input and run the TPC model, rather than optimizing the model code. Were we to further optimize the model itself, we could likely use smaller instance types and reduce compute costs. Yet, this could make the TPC model more difficult to maintain, as it would introduce more complicated code than the straightforward Fortran currently used, and the cost of modifying the code could end up costing more than what we would save in compute resources.
Challenge 2: Defining the model options
To define all the options and levers in a single model run, a researcher must supply a large set of parameters. For the TPC model, these are specified in a comma-separated file that the researcher will manually fill in. The researcher would open a file in Excel, modify the parameters, save it, and then supply it to the model program. We found that doing this even 20 times by hand was inefficient and error prone.
One of our first steps, therefore, was to create code that would programmatically generate a set of parameter files. Instead of manual edits, a researcher simply writes a small loop of code in Python to generate the set of parameters they require. And by generating parameter files through this process, it’s easier to see which parameters have been changed when reviewing results and when checking that all the required changes have been made.
A small example of this
`ParamUpdater` code in a GitHub repository and the code snippet below illustrate an example of how this might look. The code below shows a simplified example of code a researcher might write to create a set of modified parameter files. In this case, the code increments the value of the standard deduction (the parameter
`STANDARD`) by 100, from its original value to 2,000. This code is extremely short and readable; all the code that handles reading and writing to a CSV or database are abstracted from the portion that a researcher would write and called automatically via functions inside the
`ParamUpdater` module in the call to
`params.write_modified` at line 11 in the example below.
Two other important things are completed at this stage. First, the associated parameter files are saved to a cloud-based location, and the underlying
`ParamUpdater` code keeps track of the changes made to each parameter file (relative to the provided base parameter file). Second, these changes are added to a database table, which doesn’t require anyone to do anything or write any special code. Again, this is done systematically, without the need for a researcher to write specialized code. A description of how we use this resulting table follows.
Challenge 3: Creating analysis datasets
The final step of model analysis is to evaluate the model’s outcomes. Here, we faced two simultaneous challenges: developing code that would create the summary results and keeping those results linked to the policy input parameters.
To remedy the first challenge — creating summarized results — we moved the code out of the model itself, where it had historically resided, to a separate Python routine. Here, the Tech & Data team worked with the research team to ensure the code was efficient and the outputs were accurate. You might notice that this broke our rule of not changing the model code, but both teams felt that creating this new code base was a relatively light lift and would result in a significant gain to the research team. Using Python for this task meant we could take advantage of multiple cores on the computers handling the processing, significantly reducing processing time.
The second challenge, keeping results linked to the parameter file options, is accomplished in three ways. First, we maintain naming conventions on the sets of model runs so they can be easily organized and retrieved. In the
`ParamUpdater`, we add a prefix to the group of runs. For example, a group of runs might be called
` tcja_ctc _run_1000`. Second, we built a database to store information about the runs and indicate which parameters were changed. Finally, summary results are calculated for each simulation. In the end, researchers receive a single dataset that contains all the information about each run. These data were then used to build the visualizations in our new web feature.
There are three steps to creating and completing these database tables:
Step 1: Create the table and insert values where parameters are different
`Inputs` table is created in the
`ParamUpdater` step, and values are inserted any time a value is changed from the base table. In this example, we have changed the value of
`StandardDeduction` only in each run.
Step 2: Upon simulation completion, calculate summary measures
Using the information from the
`Inputs` table to create the column(s) based on
`Input Variable` and
`Input Value`, a
`Results` table is created, and summary measures
`AverageTax` are calculated and saved.
Step 3: Return analysis dataframe
`Results` table is returned to the researcher for download. This enables the researcher to easily analyze thousands of runs, as each run is simply one row in the final table. If any additional detail is needed, then the researcher is able to go back to the microdata or revise the summary output measures as needed.
We had many lessons-learned throughout this work, but perhaps the most important lesson is to remember the project’s objectives. Here, our guiding principle was to keep the model code stable, and our secondary priority was to make the analysis easy to use. These guided our decisions for selecting compute resources and designing the analysis step. By understanding how the results would be created and used, we were able to design a system that is effective, meets the researcher's needs, and is easy to use.