What is Data Science @ the Urban Institute?
What would you say… you do here?
— Bob Slydell, Office Space
As chief data scientist at the Urban Institute, I get asked this question a lot, from both internal and external partners. It typically takes a few forms:
What exactly is a data scientist? What is it you do? Can you give me some concrete examples?
These are great questions. I think because the term data scientist is so new, and encompasses so much, data science has come to mean so very little to many people. For example, even with years in the field, I still struggle to comprehend the meaning of the data science Venn diagram.
And I’m not alone — in 2013, Forbes acknowledged that, “There is more or less a consensus about the lack of consensus regarding a clear definition of data science” and more recent articles tend to sing the same tune. When writing this blog, for example, I conducted a simple Google search for the data science diagram, which recommended three very different alternative searches — “data engineer,” “data analyst,” and “unicorn.”
Though many have made credible attempts, we are no closer to a clear definition across the field. In this post, I do not attempt to solve this issue by suggesting a clear definition of data science or data scientists in general. What I will do, however, is describe how I define data science at the Urban Institute, how our team works, and what it is we say… we do here.
Isn’t Everyone at Urban a Data Scientist?
Although there’s no real consensus on what it means to be a data scientist, when describing my role at Urban, I find it helpful to first describe Urban itself because I believe it’s different from many organizations with data science teams.
Urban is currently home to approximately 500 people, about 300 of whom are researchers working across different policy areas, ranging from justice to health to taxes and so on. With so many data savvy researchers, you could say that most Urban staff are data scientists. Most researchers check the boxes of domain knowledge, programming skills, and advanced statistical training that are pervasive across other organizations’ data science teams.
Then What is Data Science at Urban?
So what does it mean to be a data scientist in an organization full of folks who could be data scientists, but who go by the more traditional title of “researcher”? I define data science a bit differently (in a purposely circular fashion), as follows:
Data science (n.), Urban Institute definition: Skill sets at the intersection of computer science, programming, technology, and statistics that have traditionally been outside of the standard set of training for many social science researchers.
Currently, this list of skills includes technologies and techniques such as:
- Leveraging cloud computing
- Big data analysis (e.g., Apache Spark)
- Unsupervised learning, such as cluster analysis (PDF)
- Fuzzy matching techniques
- Task automation, such as reproducible fact sheets in R
Clever readers have no doubt noticed that by defining data science as a set of skills most researchers don’t possess, data science here at Urban must by definition change over time as researchers absorb some data science skills. This is absolutely the case, and it is, in fact, one of the goals of our data science team.
We want to do one thing really well — to help researchers become more effective when working with data. Becoming more effective with data means generating more timely insights, exploring completely new datasets in new ways, revolutionizing simulation modeling, and so on. We want researchers to learn new data science skills so they can improve the quantity and quality of evidence necessary to create better policy, strengthen communities, and improve people’s lives.
Ok, but you don’t do all that data science work yourself, do you?
No, interest in data science at Urban is increasing rapidly, and thankfully, I have a growing team that helps me apply these skills institute-wide.
At Urban, the Data Science team I lead sits separately from the research centers, and exists in the Office of Technology and Data Science (the department formerly known as IT). We partner with researchers across the Urban Institute’s many policy centers using the methods and techniques listed in the previous section on various research projects. Each member of the data science team doesn’t need to have all of the skills listed, but having at least a couple of them coming in, and learning the rest on the job over time, is crucial.
Because it’s difficult for one person to keep up with cutting edge developments in all of these fields, team members generally have complementary skillsets. As team leader, I must be familiar enough with all of these skills to talk about them with researchers and external partners, though, fortunately for me, I don’t always have to be the best at them. I just have to understand them and how they might be applied to solve the problem at hand.
Generally speaking, some of the most important foundational skills for data scientists at Urban include the ability to code in at least two programming languages (typically Python, R, or both), good organizational skills, clear and constant communication, and the ability to master new concepts and techniques quickly.
Depending on the project, my team advises, teaches, programs, and/or reviews results in partnership with the research teams. We also build cutting-edge tools in the cloud that simplify researchers’ access to new tools or techniques, such as Spark for Social Science, single-compute big data processes, and natural language processing systems.
Whenever possible, we disseminate these tools to teams outside Urban by providing introductory blog posts and making code and tools open-source. We believe these technologies, methods, and tools have incredible potential to transform social science research, but I’ve written and spoken about them elsewhere, so I won’t go into further detail here.
What are Some Concrete Examples?
At a high level, this definition of data science, Urban’s Data Science team, and the list of the types of projects we work on usually makes some sense to people. But I typically get these follow up questions:
So what does this look like in practice? What are some examples?
Besides the projects I’ve linked to above (which I recommend you look at!), here are three additional examples of projects the data science team is working on, or has worked on recently:
1) Using natural language processing and machine learning on news articles to identify major local reforms: As part of an ongoing project, researchers in Urban’s Metropolitan Housing and Communities Policy Center want to estimate the impact of zoning reform on housing supply across a number of metropolitan areas in the US. Getting this historical data from the thousands of municipalities in these metro areas is virtually impossible, so we are using data from over 2,000 local news sources to identify local reforms. Applying natural language processing and machine learning, we can tag articles that mention major reforms and add relevant metadata–such as whether the article mentions parking, height limits, or other characteristics. Using these methods, we hope to generate a first-of-its-kind dataset that will unlock researchers’ ability to answer a number of crucial but as yet unexplored effects of zoning reforms.
2) Scraping court records to inform criminal background check policy: Researchers in our Justice Policy Center wanted to produce an estimate of the number of people in Washington, DC, who might have a record appear in their criminal background check. The only issue was, they didn’t have the data they needed. The Data Science team worked with the research team to collect data from the DC Superior Court’s online search tool, which allowed the researchers to produce a first-of-its-kind statistic about criminal background checks in DC.
3) Using big data systems to quickly analyze large administrative data: I love this example, as it received a number of audible “Wow!” responses from the audience when I told them how Apache Spark can take a 500-hour single-threaded programming job and speed it up to 10 minutes in a large, optimized cluster (to be fair, I’ve presented this concept at places like the Administrative Data Research Facilities conference, which typically included likeminded groups of fellow data enthusiasts). Using our internal Spark for Social Science system, I have taught, and provided example code to Housing Finance Policy Center researchers, who regularly process, merge, and extract large US property-level datasets that are hundreds of gigabytes in size in just minutes.
We have worked, and are working on, many more projects than just these three, too many to fit into a short blog! However, I hope these give you a flavor for the work we do and the exciting possibilities the field is just starting to discover at the intersection of data science and public policy.
Stay in touch, Data Science @ Urban is constantly evolving
We find the constantly evolving nature of data science at Urban exciting. It means that the Data Science team is essential to Urban’s core mission, and that we are constantly innovating, learning new skills, and adding to (and subtracting from) our definition of data science over time. As that definition changes, we hope you’ll tune in to Data@Urban to learn about our latest methods, technologies, and processes, and learn along with us.