Three Ways to Link Disparate Data Sources to Better Understand People’s Financial Well-Being
A person’s financial well-being is nuanced, encompassing many different metrics and situations. A single dataset rarely paints a complete picture of people’s financial lives. Therefore, building a holistic understanding of financial well-being often requires linking data from several disparate sources.
But making robust connections between discrete datasets is rarely straightforward. Inconsistencies in the populations of interest, survey sampling, timing of data collection, units of analysis, and variable definitions complicate the smooth linkage of datasets. To address these challenges and ultimately inform the development of policies, products, and services that improve people’s financial lives, the Urban Institute launched the Financial Well-Being Data Hub, which leverages the power of linked data across disparate data sources.
Here, we explain three ways Urban researchers have linked or supplemented disparate datasets through the Financial Well-Being Data Hub to unlock new insights about households’ financial lives. The examples show how these strategies have been tailored to answer specific research questions and can serve as guideposts to applying the methodologies in your own work.
1. Individual linkage: Unique IDs
Generally considered the gold standard of data linkage, unique identifiers shared across sources offer a straightforward way to join datasets. These identifiers take the form of alphanumeric sequences that replace an individual’s personally identifying information. Unique IDs offer the highest level of data-linkage quality and versatility and should always be taken advantage of when present.
However, different data sources generally don’t collect information from the same individuals, meaning one-to-one matching with unique IDs is often impossible. Access to data that can be linked by unique IDs presents an opportunity for researchers to explore more-granular questions and understand how policies and other factors affect individuals over time.
Research Spotlight: Impacts of the Illinois Regulation on Consumer Credit and Debt
In Urban’s 2023 report on the effects of the Illinois Predatory Loan Prevention Act (PLPA), researchers Kassandra Martinchek and Noah Johnson used data linked by unique IDs to assess how state-level policies on alternative financial service (AFS) loan APRs (annual percentage rates) shape consumer credit health. The Illinois PLPA set a cap of 36 percent on APRs for specific types of consumer credit products under $40,000 offered by AFS providers, like payday and auto-title lenders. Fewer than 5 percent of consumers nationwide use AFS providers, so the study needed to identify how these policies affect those who use AFS providers, as estimates for all Illinois residents would obscure the policy’s true impact. Such analysis required a dataset that could identify these individuals and the financial outcomes affected by the PLPA policy change.
Urban researchers negotiated access to and used unique IDs created by the data providers to link two datasets: credit bureau data purchased from one of the three major reporting agencies and data aggregated from AFS providers. By linking these datasets using unique IDs, researchers found that the Illinois PLPA was associated with a short-term increase in debt in collections among people who used AFS providers, but the increase dissipated one year after implementation. This finding suggests some consumers may have struggled with their finances without access to specific AFS products in the short term but found alternate means of support in the long term.
2. Aggregated linkage: Key variables
When it is unrealistic or impossible to link data using unique identifiers, some level of aggregation or anonymization can be useful. Variables like school district, occupation type, and educational attainment can link disparate datasets in a similar fashion to unique IDs. This method offers a simple way to link datasets that share variables but comes with drawbacks. Observations cannot be made at more-granular levels than the shared variable (e.g., if datasets are linked by school district, student-level observations cannot be made), and comparisons are often made between similar types of individuals rather than the same individual over time.
Research Spotlight: Black Women and Vulnerable Work
In a report on Black women and vulnerable work, Urban researchers Ofronama Biu and Afia Adu-Gyamfi linked data using occupational codes to understand if Black women were under- or overrepresented in occupations with worse hours, more scheduling unpredictability, worse benefits, and lower pay. To measure vulnerability, researchers identified variables in the American Community Survey, the Current Population Survey: Contingent Workers Supplement, and the Current Population Survey: Annual Social and Economic Supplement. They then aggregated data across occupational codes, race, and gender. (Occupational codes characterize jobs into a standard set of highly specific categories. A full breakdown of the occupations analyzed can be found in the technical appendix of this report [PDF].)
The researchers merged the three aggregated datasets to analyze the vulnerability characteristics (e.g., wages, health insurance, part-time status) from different sources together. After removing groupings with limited sample sizes, the team measured Black women’s occupational crowding in 308 occupations, finding that — irrespective of educational attainment — Black women were underrepresented in high-paying occupations compared with white men, white women, and Black men.
3. Statistical linkage: Predictive and probabilistic techniques
If unique or aggregate linkages are unavailable or ineffective for a particular research question, then more-complex approaches are required. Here, we explore one example of statistical linkage.
When two datasets share many of the same variables but one lacks data subgroups (e.g., race or geography) and the other lacks key variables to analyze across those subgroups, one dataset can train a statistical model and impute new variables onto the other dataset. In this method, the model is first trained to identify relationships between the key variables in the first dataset and any shared predictor variables across the datasets. Then, the trained model can make predictions for the values of key variables by subgroup in the second dataset. Through this process, the advantages of one dataset can be maintained while adding essential information from the other dataset.
Research Spotlight: Financial Health and Wealth Dashboard
Publicly available data for American families’ net worth and liquid assets are not widely available at geographies below the state level, even though these variables are essential to understanding families’ resiliency to financial shocks and ability to build wealth for the future. Urban’s Financial Health and Wealth Dashboard addresses this gap by providing estimates on median household net worth and the percentage of households with at least $2,000 in emergency savings at smaller geographic levels.
To create these estimates, Urban researchers Aaron Williams, Mingli Zhong, and Alexander Carther used the Survey of Income and Program Participation (SIPP) to impute information onto the American Community Survey (ACS). The ACS microdata includes more than 1 million individual-level observations at granular geographies but lacks data on net worth and liquid assets, while the SIPP contains detailed data on individual net worth and liquid assets but does not include geographic identifiers below the state level.
To link data from these sources, the team identified wealth and socioeconomic predictors — like income, home value, and employment status — present in both the ACS and SIPP. These predictors were then used to train machine learning models on the SIPP data and generate estimates for net worth and liquid assets at the household level in the ACS. These estimates were aggregated to the public use microdata area level, a geographic unit defined by the US Census Bureau and only present in the ACS. As a result, practitioners and research could target policies to improve households’ financial well-being using two imputed variables that benefited from the SIPP’s detailed financial information and the ACS’s more-granular geographic data.
Further connections
Linking datasets using the methodologies described above has allowed Urban researchers to unlock novel insights and inform solutions to improve households’ financial well-being. We plan to continue this work through the Financial Well-Being Data Hub by pursuing opportunities to fill gaps in public datasets with private data shared by financial service providers. We will also continue to update the Financial Well-Being Data Catalog with new variables and sources available in public datasets to help other researchers understand households’ financial lives.
Further, our colleagues on Urban’s Technology and Data Science team are pioneering sophisticated methods like probabilistic record linkage to explore linking granular-level data without the need for unique identifiers. Pairing expert domain knowledge with new tools like the Splink Python package developed to harmonize large, limited datasets, Urban researchers are currently exploring creative new ways to tackle data linkage. These strategies and the ambition of our research questions are continuously evolving, and we invite you to reach out to our team (njohnson@urban.org, jaxelrod@urban.org, and tgaron@urban.org) with any questions or feedback.