Page 23 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

2

Technical Approaches to Managing Risk When Sharing Blended Data

The data-dissemination policies of many statistical agencies have relied on technical approaches to statistical disclosure limitation such as top-coding, aggregation, sub-sampling, controlled rounding, data swapping, and cell suppression (Federal Committee on Statistical Methodology, 2005, 2022) in combination with Federal Statistical Research Data Centers (FSRDCs) and the release of microdata samples and tabulations. It is unclear whether such data-dissemination policies are up to the task in a data infrastructure with blending at its core. The desire to publish many detailed statistics and summaries coupled with advances in hardware and algorithms augment the challenges to disclosure limitation in blended data products. It is now recognized that all published results, whether aggregated tabulations, regression coefficients, or microdata, contribute to disclosure risks. In theory, there is an absolute limit to the number of statistics agencies can publish before the confidential data underlying those released statistics are at risk of reconstruction (Dinur & Nissim, 2003). As deanonymization and reconstruction algorithms have improved at rapid rates, this risk has moved beyond the hypothetical (Cohen & Nissim, 2018; Dick et al., 2023; Kumar et al., 2007; Narayanan & Shmatikov, 2006). In one prominent example, the Census Bureau showed that published 2010 Census summary files could be reconstructed into microdata, resulting in reidentifications obtained by matching reconstructed microdata to external files (Long, 2020; State of Alabama, et al., v. United States Department of Commerce, et al., 2021).

Disclosure is not a “yes/no” concept but rather a sliding scale. Every additional statistic computed from a confidential dataset leaks some amount

Page 24 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

of confidential information. Given enough statistics, the cumulative effect results in a disclosure. This cumulative effect is known as composition (Fluitt et al., 2019; Ganta et al., 2008), and it is particularly concerning for blended data comprising multiple sources. Examples of ingredient data sources that could be used repeatedly for data blending, as well as for stand-alone data products, include records from the Social Security Administration for identity validation (Wagner & Lane, 2014), unemployment insurance data for information about wages and employment in establishments (Department of Labor, n.d.), Internal Revenue Service (IRS) tax data (Internal Revenue Code, 1986; Statistics of Income, 2023), and commercial data including Equifax data (Cordell, 2023), as discussed below.

DISCLOSURE RISKS CAN BE MAGNIFIED WITH BLENDED DATA

Privacy and confidentiality risks are present at every stage of the blended data lifecycle. For example, consider a canonical setting in which a group of organizations in the public and private sectors would like to pool or link their data, and then create and disseminate data products based on those blended data. Pooling or linking data raise concerns about privacy of the inputs—that is, data elements from the ingredient files that feed into the blending procedure (Varia, 2023). Federal agencies are generally prohibited by law from sharing data in a form that could be used to identify respondents or entities (E-Government Act, 2002; Foundations for Evidence-Based Policymaking Act of 2018, 2019; Internal Revenue Code, 1986; Privacy Act, 1974; Title 13—Census, 1954). Meanwhile, private-sector organizations may be wary of sharing their data with federal agencies (National Academies of Sciences, Engineering, and Medicine, 2023c). Additionally, the blending process sometimes involves sharing identifiers, such as individuals’ tax identification numbers, names, or addresses, that may present disclosure risks (and harms), as well as sharing potentially sensitive information like medical records, income, or personal data that may result in disclosure harms. Furthermore, ingredient data may have been obtained from data subjects with promises regarding the purposes for which the data would be used; these promises need to propagate to blended data derived from ingredient data.

There is also the issue of privacy of the outputs—that is, the end products produced from blended data (Varia, 2023). Publicly disseminated data products created from blended data may be used to (partially or fully) reconstruct individuals’ records from ingredient data files. By their nature, subsets of blended data are available in multiple data sources, which creates additional mechanisms for adversaries to mount record-linkage attacks and provides more information for reconstructions.

Data holders can require that their data users not attempt disclosures via blended data products (e.g., via a policy control). However, ingredient

Page 25 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

files for blended data often have multiple data holders, any of which may release products from their ingredient data files without consideration for past or future blending activities, possibly though the use of a stand-alone disclosure-avoidance strategy. Indeed, ingredient data products valuable for one blending project may naturally be valuable enough to be used in other data-blending projects.

Such compositions make it difficult both to manage disclosure risks stemming from multiple outputs and to provide confidentiality. Agencies can no longer estimate disclosure risks by analyzing their data-blending scenarios in isolation (as in classical reidentification studies); the result could be an underestimate. For example, the combined outputs might facilitate reconstruction attacks or record-linkage attacks that are more effective than those based on only a subset of the data. It is also unrealistic to expect agencies to fully analyze every use of the underlying datasets. In a future data infrastructure, one possible approach to this problem is for a coordinating mechanism—like, for example, the Confidentiality and Data Access Committee (CDAC) of the Federal Committee on Statistical Methodology, which has historically helped facilitate the exchange of disclosure-limitation methodologies among agencies—to ensure that agencies use compatible disclosure-avoidance mechanisms, so that disclosure risks increase in a predictable manner as datasets are reused. Such coordination could go beyond the current exchange of practices and trainings offered through the Data Protection Toolkit (Federal Committee on Statistical Methodology, 2022) to develop and implement a disclosure review board leadership team. In this scenario, the panel believes it would be necessary for the Office of Management and Budget (OMB) to designate this authority from the Paperwork Reduction Act (1995) and the Foundations for Evidence-Based Policymaking Act of 2018 (2019; hereafter, Evidence Act) to CDAC, subject to OMB’s review and approval.

Each of these problems (privacy of ingredient files, privacy of output files, and composition) can potentially be addressed through a mix of technical and policy approaches. Below, we review existing technical approaches and comment on their suitability for various data-blending tasks. We also comment on data enclaves, which, although considered a policy control in this report, the panel views as a worthy strategy for providing access to blended data products.

TECHNICAL APPROACHES FOR MANAGING RISKS IN BLENDED DATA

Specific aspects of the data-blending task determine which, if any, technical approaches can help to limit disclosure risks generated by blending. Here, we consider secure multiparty computation (SMC), synthetic data

Page 26 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

with verification/validation servers, classical statistical disclosure-limitation methods, and formal privacy. These techniques are currently deployable to varying degrees, but not all can support privacy and confidentiality protection with blended data on their own. They are not mutually exclusive and can form the basis of tiered access, especially in conjunction with data enclaves. In the panel’s view, tiered access offers a promising overall approach for providing access to blended data (see Chapter 3).

We do not thoroughly review each technical approach, as many authors have done so previously (e.g., Cramer & Damgård, 2015; Drechsler, 2011; Duncan et al., 2011; Dwork & Roth, 2014; Willenborg & de Waal, 2001). Instead, we focus on their applications for blended data tasks.

SMC

SMC is a set of techniques¹ that permit multiple parties to blend and analyze data without allowing unencrypted records to leave any party’s computer server. For example, with SMC, each party in a group can provide their contribution—for example, the number of people with a certain sensitive condition—to a total of all contributing parties’ data, without revealing their individual data records (or even their individual contributions) to the other parties. In the ideal case, each party learns only the final output of a computation and nothing else about the underlying ingredient data files.²

As the privacy provided by SMC is limited by the computation being performed, SMC is ideal for data-blending scenarios in which agencies do not wish to share their microdata but instead wish to perform some specific computation with the combined records among ingredient data files. SMC can be used to conduct specific analyses that would otherwise require producing a record-level blended data product that is shared across agencies. Only the results of such analyses are revealed, and the disclosure risks are reduced.

For example, SMC can be used to compute summary statistics over multiple databases. Varia describes the Boston Wage Gap project, whereby participating employers used SMC to share their workforce employment data (aggregated by sex, race/ethnicity, and occupation) with the Boston Women’s Workforce Council in a way that enabled that Council to compute only summary statistics describing occupational data for the city as a whole (Varia, 2023).³

___________________

¹ Related methods include data virtualization and data visitation (see Glossary of Selected Terms).

² In practice, a small amount of additional information may be leaked (Evans et al., 2018).

³ See Kamm and Laud (2015), Qin et al. (2019), and Zhao et al. (2019) for other examples.

Page 27 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Private set intersection (PSI) is a type of SMC that can be used to share information needed for linking records across ingredient data files (Freedman et al., 2004). Techniques for PSI are relatively mature and are in commercial use (Walker et al., 2019). Existing systems can process millions or billions of records at a time (Raghuraman & Rindal, 2022). PSI is a version of privacy-preserving record linkage (PPRL)⁴ with the stronger security guarantees that come with SMC (Gkoulalas-Divanis et al., 2021). In another common approach to PPRL, participating agencies “hash”⁵ exact identifiers using a one-way filter for each approved project. This prevents relinking in unauthorized ways. The accuracy of the analysis produced using PSI or other PPRL techniques can be compared to analysis results obtained without using PPRL (e.g., clear text matching or probabilistic record linkage with indirect identifiers). In one such examination, the National Center for Health Statistics concluded that PPRL performed the linkage nearly as effectively as a standard record-linkage algorithm (Mirel, 2023; Mirel et al., 2022).

SMC has some limitations for blending tasks. First, computations performed via SMC can be much slower than computations performed directly on the underlying data (Varia, 2023). Current techniques may not be operationally practical for very large datasets, depending on the complexity of the task being performed (Varia, 2023). Second, SMC techniques currently work best when the computations involved are linear (i.e., involving addition and multiplication), which excludes some objectives of blended data analysis (Varia, 2023). Third, SMC requires a prespecification of the analysis and assumes the prespecification is valid for the data. SMC does not easily allow for model diagnostics and data exploration. Optimally, data formats are standardized with common fields to facilitate these computations (Varia, 2023), but this is not always the case. Finally, since the output computed with SMC is the same as the output computed from blended data directly, the result still needs to undergo disclosure review or needs to be protected with statistical disclosure-limitation technology.

Synthetic Data with Validation/Verification

When blended data come from multiple data holders, agencies seeking to create public-use microdata files may determine it necessary to perturb nearly all of the variables in the blended data file. Otherwise, adversaries

___________________

⁴ PPRL is a class of techniques for linking records across two or more databases that use secure protocols to avoid directly sharing records’ identifiers (Hall & Fienberg, 2010, p. 277).

⁵ Hashing converts names, addresses, and dates of birth into unique encrypted codes that protect the original values. These encrypted codes, or groups of encrypted codes, can then be linked. It is important to employ vetted and secure hashing techniques that cannot be easily reversed, especially for small inputs like ID numbers.

Page 28 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

with access to ingredient data may be able to match on unperturbed values. Synthetic data can be employed in such cases. Synthetic data are generated “[…] to create a dataset that is similar to the original [sensitive] data, but where some or all of the resulting data elements are generated and do not map to actual individuals” (Garfinkel, 2015, p. 8). For example, the Census Bureau, IRS, and Social Security Administration released a synthetic public-use microdata file generated from the Survey of Income and Program Participation blended with data on individuals’ benefits (Abowd et al., 2006). Synthetic data, suitably constructed, offer the potential to preserve distributional features in blended data while reducing disclosure risks (Reiter, 2005b). In particular, the synthesis process can create records that do not correspond to actual individuals, which makes record-linkage attacks difficult and potentially nonsensical. However, synthetic data are not immune to disclosure risks. For example, models used to create synthetic data may generate specific records only when certain combinations of values are in the confidential data, which may present disclosure risks (Reiter et al., 2014). In practice, disclosure risks for synthetic data have been assessed on an ad hoc basis; additional work is needed to quantify these risks, especially for fully synthetic data. To this end, methods are emerging for generating synthetic data with formal privacy guarantees (Garfinkel et al., 2018).

Several statistical agencies are coupling the use of synthetic data with validation or verification servers (Abowd et al., 2006; Barrientos et al., 2018; Kinney et al., 2011). These were discussed, for example, in a 2017 report from the Commission on Evidence-Based Policymaking (CEP), in the Advisory Committee on Data for Evidence Building (ACDEB) Year 2 Report (Advisory Committee on Data for Evidence Building, 2022), and in a recent study by the National Academies (2023b). With a validation server, the secondary data analyst requests that the agency run their statistical analysis on the confidential data, and the agency provides outputs from the analysis, with disclosure protection, back to the analyst. With a verification server, the agency instead provides disclosure-protected measures of the similarity of confidential and synthetic data results (e.g., how much confidence intervals from the confidential and synthetic data analyses overlap) back to the analyst.

As of this writing, validation and verification services have been provided manually by data holder agencies, in that agency staff run analyses supplied by the data user on the confidential data. With high-use blended data applications, such services could become costly to sustain without additional resources for the data holder agencies. One potential approach is to develop automated validation or verification systems. A challenge here, as with all query systems, is that the release of validation or verification measures leaks information about confidential data (McClure & Reiter,

Page 29 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

2012; Reiter et al., 2009). Research on how to provide disclosure-protected versions of validation or verification measures is ongoing (Barrientos et al., 2021, 2023; National Academies, 2023b; Taylor et al., 2021).

Blended data provide an additional challenge when agencies are not permitted to place data on servers outside of their control. In this case, it may be possible to use SMC to perform some verification or validation computations across servers belonging to separate agencies, with the guarantee that data do not leave servers unencrypted (and are not decrypted elsewhere). Additional research is needed to investigate and develop this potential application of SMC. Regardless, SMC still needs to be combined with other disclosure-avoidance methods to ensure that results do not leak confidential information.

Reconstruction attacks in theory and in practice have revealed that all data products, including synthetic data, leak information that can be used for disclosure attacks (Cohen & Nissim, 2018; Dick et al., 2023; Dinur & Nissim, 2003; State of Alabama, et al., v. United States Department of Commerce, et al., 2021). This supports the view that confidentiality protections arise from the way data products are produced rather than the way they are formatted or presented (Dwork et al., 2006).

Naïve Applications of Classical Statistical Disclosure Limitation

Statistical disclosure limitation has two components: a quantification of disclosure risks and a mechanism for perturbing data to lower the disclosure risk measure. Some methods, like k-anonymity (Samarati & Sweeney, 1998), inextricably link the mechanism to the disclosure risk measure: the parameter k, in this example. Other methods, like cell suppression and noise addition, provide more flexibility. For instance, cell suppression allows one to define a sensitive table cell, and then perform suppression so that the value of that cell cannot be inferred precisely from other tables (Adam & Worthmann, 1989; Cox, 1980). It is worth emphasizing that any measure of disclosure risks is interpretable as such only relative to an underlying threat model. Statistical disclosure-limitation methods may not prevent significant disclosure risks when the underlying threat models are not aligned with reality.

Classical statistical disclosure-limitation methods (e.g., aggregation, data swapping, cell suppression, recoding) are generally not well suited on their own for managing disclosure risks with blended data. Applied at low intensity (e.g., a small swap rate or minimal aggregation), they may not be adequately protective. For example, suppose 1% of individuals have their data swapped; this means 99% of records are not subject to swapping. Because the variables are spread over several ingredient data files, keeping sets of variables together risks linkage attacks from adversaries who have

Page 30 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

access to records from one or more ingredient files.⁶ Similarly, blended data records may be unique across sets of variables in the ingredient files, or unique in the combined data, even after aggregation—and therefore at risk. On the other hand, classical methods applied at high intensity could result in blended data with limited and perhaps inadequate usefulness for some purposes. For example, if agencies swap a high fraction of values independently across sets of variables, they destroy the correlational structure between the swapped and unswapped variables, which could defeat the initial purpose of combining variables from multiple databases.

Classical disclosure-limitation methods can also be ineffective because of composition. The measures of disclosure risks used in conjunction with those methods generally do not account for the combined leakage of confidential information across multiple blending scenarios involving the same underlying datasets (Fluitt et al., 2019; Ganta et al., 2008). In other words, if each blending application has its disclosure risks quantified separately, the joint disclosure risks cannot be predicted from the collection of individual analyses.

Disclosure Limitation Using Formal Privacy

Formal privacy definitions, such as pure differential privacy (Dwork et al., 2006), concentrated differential privacy (Bun & Steinke, 2016), Gaussian differential privacy (Dong et al., 2022), and Rényi differential privacy (Mironov, 2017), define confidentiality guarantees mathematically. In contrast to classical disclosure-limitation methods, the threat models underlying most variants of differential privacy are general: they place no limits on the resources or outside knowledge available to the adversary. One then seeks to design protection methods that can be proven to satisfy the mathematical definition (Dwork & Roth, 2014). It is worth emphasizing that differential privacy and its variants are criteria for defining confidentiality, not the methods for perturbing data themselves. Colloquial uses of the term differential privacy often conflate the way the definition is achieved with the definition itself (Reiter, 2019).

An appealing feature of formal privacy definitions is that they can account for combined privacy leakage, in particular when the various blending scenarios all use formal privacy (Dwork & Roth, 2014). Another benefit of formal privacy is that the code for the associated mechanisms can be published without compromising confidentiality protections. This is important for achieving transparency.

___________________

⁶ Similar risks can apply when all the variables are in a single data product. Blended data can imply that multiple parties have access to parts of the data, which could increase the potential for successful disclosure attacks.

Page 31 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Formal privacy techniques are available for many, but not all, types of purposes. They are particularly well suited for producing cross-tabulations and contingency tables (Abowd et al., 2022). Additional research is needed to produce differentially private algorithms that offer low errors for reasonable privacy guarantees for inferences based on survey weighted data and that account for imputation due to missingness and editing.

One criticism often levied at differential privacy and its variants is that with a finite privacy-loss budget, no new analysis results can be released once the budget is exhausted (Hotz et al., 2022). This is inherent to any data-release strategy, whether using differential privacy or not: given enough accurate statistics, adversaries can compromise confidentiality—for example, via reconstruction attacks (Dinur & Nissim, 2003). The privacy-loss budget is also a policy decision; the agency can always decide to increase the budget if the additional data usefulness generated warrants the additional disclosure risks. However, setting privacy-loss budgets is a complex endeavor with relatively few genuine examples on government databases that will be repeatedly queried and put to many uses, including blending. It is unclear, for example, what the implications for disclosure risks are when blended data satisfy formal privacy guarantees but individual parties release products derived from their ingredient data that do not satisfy these guarantees; for example, the composition properties of formal privacy may no longer apply. Much research is needed to realize the benefits of formal privacy for a future data infrastructure based on blended data.

Data Enclaves

Agencies can also take advantage of data enclaves, both physical and virtual, to share confidential blended data. A data enclave is a physical or virtual segregated space that “restricts the export of the original data and instead accepts queries from qualified researchers, runs the queries on the de-identified data, and responds with results” (Garfinkel et al., 2023, p. 33). Examples include secure locations in FSRDCs (U.S. Census Bureau, 2023) and virtual servers hosted by the National Opinion Research Center (2023). Data enclaves may allow researchers to perform analyses directly on confidential or lightly redacted data, which is quite useful in terms of accuracy, although potentially burdensome to access. As with SMCs, statistical results obtained in enclaves need to undergo (often manual) checks for potential disclosures. Tables, regression coefficients, and other results produced inside a data enclave generally still require some form of disclosure limitation before those results can leave the data enclave. Indeed, this is current practice inside FSRDCs, which tend to rely on ad hoc rules for reducing disclosure risks from outputs.

Page 32 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Query auditing (Nabar et al., 2008) is an important feature of the controlled environment in a data enclave. Auditing tracks what a researcher accesses and checks access patterns for potential disclosure of confidential information. Auditing can be used for retroactive detection of potential disclosures or for real-time monitoring to discourage bad behavior and reduce disclosure risks by denying a researcher’s queries. Denials themselves may come with some disclosure risks when the denial of a query depends on confidential information (Kenthapadi et al., 2013). Nonetheless, logging user actions and auditing are recommended even for the most trusted users.

Quantifying the disclosure risks of an analysis performed inside a data enclave is an open problem, as it requires detecting whether an exploratory analysis was unduly (often unintentionally) influenced by a small number of records (Dwork & Ullman, 2018).

QUANTIFYING AND MANAGING DISCLOSURE RISK/USEFULNESS TRADE-OFFS

No nontrivial technical approach guarantees zero risks to privacy and confidentiality, and every nontrivial technical approach has implications for data usefulness. In general, increasing privacy and confidentiality protection requires decreasing the quality of results from data analyses. This trade-off has long been recognized among practitioners of statistical disclosure-limitation techniques. It underlies many agencies’ decision-making processes when selecting technical and policy approaches and is a recommended practice for data dissemination (Advisory Committee on Data for Evidence Building, 2022; Commission on Evidence-Based Policymaking, 2017). Indeed, ACDEB’s Year 2 Report explicitly called for disclosure risk/usefulness trade-off evaluations (Advisory Committee on Data for Evidence Building, 2022). The panel emphasizes that data equity is an important consideration and an integral part of data usefulness (National Academies, 2023a).

Blended data are also subject to disclosure risk/usefulness trade-offs, but the number of decision points is augmented. Agencies need to make decisions about trade-offs when procuring ingredient data files, performing blending, and disseminating blended data products (statistics and microdata). Choices imply risk tolerance and usefulness requirements. For example, to manage disclosure risks in a blended data product, an agency may need to link records from multiple sources, using coarse variables like demographic characteristics (i.e., not social security numbers or even names) to satisfy confidentiality requirements. If the linking variables differ in quality across demographic groups, the confidentiality restrictions could result in linkages that are less accurate for some groups than for others. As another example, confidentiality concerns may lead an agency to perturb collected data using a statistical disclosure-limitation method. In many

Page 33 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

cases, as with additive noise, cell suppression, and other methods that target protection of small populations and population uniques (i.e., persons who have unique combinations of characteristics in the population), distortions due to confidentiality have lower relative error for large populations than for small populations, raising concerns about usefulness and equity.

Conclusion 2-1: Trade-offs in disclosure risks, disclosure harms, and data usefulness are unavoidable and are central considerations when planning data-release strategies, particularly for blended data. Effective technical approaches to manage disclosure risks prioritize the usefulness of some analyses over others.

Quantifying Disclosure Risks

Determination of acceptable risk is a policy decision, not a technical matter, but quantifying risk is a technical problem—and a challenging one, both for blended and nonblended data products. Naïve applications of classical methods for quantifying disclosure risks, like those reviewed above, generally presume adversaries that have limited computational resources and limited information about data subjects. Neither of these presumptions may be safe for blended data. As these data come from multiple sources, many people—some of whom may find motivation to attempt disclosures, especially when blended data augment their available information with sensitive values—have access to detailed subsets of data subjects’ records. In addition, computational resources and algorithms can increasingly operate inexpensively on massive data files, making computational limitations less defensible than they once were. Furthermore, as a measure of disclosure risks, reidentification itself can be insufficient. As examples, attribute disclosure can occur even if the exact record belonging to an individual cannot be identified (Machanavajjhala et al., 2007); and deep-learning models potentially reveal significant amounts of sensitive information contained in their training data, but this would not be called a “reidentification” (Carlini et al., 2019).

Research has identified several scientific principles that would ideally be useful for measures of disclosure risks to follow. These include transparency, postprocessing invariance, and composition (Dwork et al., 2006; Jarmin et al., 2023; Kifer & Lin, 2012; McSherry, 2009). Transparency, also known as Kerckhoffs’s principle, is the requirement that implementation details of a disclosure-avoidance mechanism, along with its privacy parameters, be made public. Postprocessing invariance is the requirement that making those details public does not increase disclosure risks because the disclosure risk management strategy should have already accounted for public knowledge of that strategy. Methods that do not account for an

Page 34 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

adversary’s knowledge of the strategy can sometimes be reverse engineered for this reason (Cohen, 2022; Wong et al., 2007). Additionally, transparency and postprocessing invariance are highly beneficial for usefulness, as they allow data analysts to account for the disclosure-limitation treatments in inferences—for example, by incorporating the distribution used for perturbation in a measurement error modeling framework.

The property of composition is especially appealing for blended data. Composition is the cumulative disclosure risks that result from multiple releases of confidential information, even when disclosure-avoidance methods are applied. Generally, it is impossible to analyze all current and planned uses of datasets to quantify disclosure risks. Instead, it is more practical to analyze each blending scenario separately, and then aggregate disclosure risks among all the blending scenarios. This is not possible with reidentification studies and classical measures of disclosure risks—each individual analysis of a blending scenario may conclude that the scenario is safe, but the overall risk may be high (Fluitt et al., 2019; Ganta et al., 2008).

To date, the primary definition of disclosure risks that satisfies composition, transparency, and postprocessing invariance is differential privacy (and its variants). As such, differential privacy offers advantages for quantifying disclosure risks. As noted previously, not all blended data tasks have algorithms that achieve formal privacy, at least with agreeable disclosure risk/usefulness trade-offs. In such cases, one approach is for agencies to use the best scientific methods available⁷ given their resources to create data products that meet the data-blending purpose. For example, SMC could be used to compute distributed statistics, or synthetic data could be used to create public-use microdata files coupled with policies that provide approved users with access to confidential data. To quantify risks in such situations, current disclosure risk quantification technology forces agencies back into the situation of modeling specific adversaries—for example, performing reidentification studies assuming adversaries’ knowledge sets. With blended data, it is prudent for agencies to model adversaries that have access to all the information in ingredient data files that have been or ultimately might be (to an agency’s best prediction) released in a data product. Additional research is needed to advance techniques for disclosure risk assessment in such settings.

Even when formally private approaches are available and used, classical disclosure-assessment methods, like reidentification studies, still can be useful. These assessments can provide information on the effects of disclosure-limitation treatments (e.g., comparisons of risks before and after application of protection methods) that decision makers can use to help interpret the

___________________

⁷ A good example is the Opportunity Atlas, which used a differential privacy–inspired approach in lieu of cell suppression (Chetty & Friedman, 2019).

Page 35 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

effectiveness of alternative disclosure-limitation methods. These assessments can also provide insight on risks for small populations or other high-risk clusters of data, which can help address data-equity concerns. For these reasons, some experts see classical risk assessment methods as remaining in the statistical disclosure-limitation toolbox going forward (Shlomo, 2019; Slavković & Seeman, 2023).

Blended data present additional practical challenges to agencies seeking to account for composition in their data releases. First, as suggested previously, composition may make it difficult to track privacy leakage—for example, when agencies that contribute ingredient data files do not adhere to formal privacy for all their data products or do not disclose data products they hold or plan to release. Second, decision makers need to determine the amount of privacy leakage they are willing to tolerate for blended data—both for the specific product and for future products that use ingredient data—and establish processes for deciding whether their privacy-loss budgets need to be increased as additional statistics or data products are released. Much research is needed to help agencies deal with these issues in transparent, interpretable, and defensible ways.

Recognizing that that no single measure of disclosure risks is appropriate in all contexts, one line of emerging research seeks to formally reason about the alignment of technical approaches with confidentiality requirements, without fully characterizing those requirements. Instead, such research proposes technical requirements that are either necessary or sufficient (but not both) for meeting particular confidentiality requirements. Lower bounds for disclosure risks are necessary conditions for confidentiality: they can be used to show that a technique may fail to meet the confidentiality requirement. For instance, Cohen and Nissim (2020) and Altman et al. (2021) argue that k-anonymity does not meet European anonymization requirements. Sufficient conditions for confidentiality are upper bounds for disclosure risks: they can be used to show that a technique meets a confidentiality requirement across contexts. For instance, Nissim et al. (2017) argue that differential privacy achieves the confidentiality required by the Family Educational Rights and Privacy Act. This work demonstrates that formally defined and checkable technical requirements can be used in concert with flexible, context-dependent analyses (Nissim, 2023).

Assessment of Harm

Assessing harm from potential disclosure is a difficult task. Harm can arise when an adversary can determine how a specific individual differs from the “typical” person (Machanavajjhala & Kifer, 2015). However, the specific groups targeted for harm can change over time. Harm can arise when sensitive information is revealed, such as a person’s health status,

Page 36 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

financial resources, or even personal opinions, that might be used for ill-intentioned purposes. However, information sensitivity is in the eye of the beholder. Information about an individual is both context and time dependent, and sensitivity can be difficult to predict in advance (Nissenbaum, 2010). Sometimes harm is restricted to consequences that follow from disclosures of information that is true. Even decades ago, some researchers argued that if the respondent is not in the sample, then no harm can be done (Bethlehem et al., 1990), whereas others argued that untrue disclosures can be as damaging as true disclosures, as in the case of incorrect reidentification of criminal records (Ladd, 1989).

In the case of blended data, agencies have the best chance of assessing potential harm by consulting with stakeholders about the various ingredient data files, especially representatives of data subjects, legal experts, and domain experts. This could be done, for example, with advisory boards, as suggested by ACDEB (Advisory Committee on Data for Evidence Building, 2022).

Altman et al. (2015) provided one characterization of harm based on the impact a disclosure could have on a person’s life, with a scale ranging from “life threatening” to “negligible.” Such judgments are necessarily coarse, but they can guide disclosure-protection plans. For example, for blended data for which disclosures can cause life-threatening harm, it is essential to take extra care to reduce disclosure risks as much as possible via technical approaches, and agencies need to add policy controls (e.g., significant penalties for misuse) to disincentivize bad actors (National Research Council, 2005). On the other hand, for blended data for which disclosures cause negligible harm, agencies may be willing to weigh data usefulness more highly and thus accept greater disclosure risks. Chapter 4 illustrates a model framework for these decisions.

Quantifying Usefulness

Many of the measures of and considerations for data usefulness for blended data mirror those described in Chapter 1 for general datasets. Determining the primary purpose of blended data is a key step in defining usefulness measures. For example, suppose blended data are intended to inform policy decisions via specific statistical estimates. If so, usefulness measures can focus on how those specific results might be affected by the disclosure-limitation treatments, such as comparisons of sign, magnitude, effect sizes, or confidence intervals (Karr et al., 2006). Policymakers can then view reductions in accuracy against the degree of confidentiality protection, to arrive at acceptable compromises. When possible, it is advisable to incorporate the confidentiality-protection mechanisms in statistical inferences. For example, if the released estimate, say a percentage or an average,

Page 37 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

is perturbed by additive noise with a published distribution, measurement-error models can be used to obtain principled inferences (Li & Reiter, 2022); comparisons for usefulness evaluations can utilize these inferences too. Finally, when assessing the impacts of confidentiality-protection procedures, it is important to recognize that other errors (e.g., linkage error, measurement error, imputation for nonresponse or editing, or incomplete population coverage) also can have negative impacts on the accuracy of results. Research is needed to develop inferential procedures that account for uncertainties introduced by the multiple sources of error.

Defining usefulness is particularly challenging when blended data are intended to be released as research datasets, which are expected to be widely used to address many questions. As such, there may not be a set of key estimates to which accuracy evaluations can be pegged. This is the case, for example, for the creation of the synthetic Longitudinal Business Database (LBD; Kinney et al., 2011). The synthetic LBD has dozens of variables (comprising several annual measurements over more than 30 years) measured on millions of U.S. establishments. It is used for analyzing job growth, creation, and destruction; measuring productivity; and identifying overall trends in employment and payroll in the private sector. To assess the quality and limitations of the synthetic data, the Census Bureau developed a suite of representative analyses—specified by a team of domain experts who were not involved in developing the synthetic data—and compared results from the synthetic and original (confidential) data, mainly on percent differences in point estimates (Kinney et al., 2011). One potential risk is that synthetic data may be tuned too closely to the analyses used in these intermediate evaluations. With current technologies, this may be inevitable in any disclosure-limitation procedure for creating research datasets. Additional research is needed on multidimensional measures of data usefulness.

We note that verification and validation servers, along with tiered access, represent one path forward to allow users to obtain information on the quality of their particular analyses, as measured by differences in confidential and synthetic data-analysis results (National Academies, 2023b). These methods can reduce the pressure on synthetic data models to capture all analyses faithfully. As noted earlier, for validation and verification servers to be maximally useful, research and engineering are needed to develop automated, highly flexible versions of these services, while also managing disclosure risks when these services are used many times for the same blended data product.

The usefulness of a potential blended data product is affected by the types of analyses it permits. For example, if the goal of a blended data exercise is to enable evaluation of an observed intervention, the exercise may have low usefulness if ingredient data files do not include important

Page 38 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

confounding variables. Indeed, absence of such variables may not make the blending exercise worth the disclosure risks. Stakeholder input can be helpful for identifying priority variables that need to be part of blended data, ideally before any actions to obtain ingredient data files take place.

Data equity is also a key component when evaluating data usefulness. Blending data can increase the coverage of the population (National Academies, 2023a). However, some of the procedures necessary to protect confidentiality may have disparate impacts on certain groups (Bowen, 2021; Bowen & Snoke, 2023; Garfinkel, 2015; Garfinkel et al., 2023), creating data inequities and reducing data usefulness. Without specific investigations of the effects of confidentiality-protection procedures on data for various subsets on individuals—including effects on data availability and any data linkages, as well as the effects of statistical disclosure limitation on analyses and data quality for subsets of individuals—agencies could miss impacts of confidentiality-protection strategies on data quality. For example, when linking records across datasets, some procedures may (unwisely) skip records with missing attribute values in blended data. Attribute values are rarely missing at random: marginalized members of society may be more likely to have missing fields. For example, a person experiencing homelessness does not have an address, and a nonbinary individual may be more likely to leave a gender field blank. In consequence, blended data may underrepresent some groups. Agencies can look for such issues by comparing distributions of demographic variables in blended data with those in ingredient data files.

ENGENDERING TRUST THROUGH TRANSPARENCY AND COMMUNICATION

CEP called for “[t]he Congress and the President [… to] enact legislation to codify relevant portions of Office of Management and Budget Statistical Policy Directive 1 to protect public trust by ensuring that data acquired under a pledge of confidentiality are kept confidential and used exclusively for statistical purposes” (Commission on Evidence-Based Policymaking, 2017, p. 47). This was reflected in the Evidence Act, which underscored the responsibilities for statistical agencies regarding public trust (Foundations for Evidence-Based Policymaking Act of 2018, 2019). To implement these Evidence Act provisions, ACDEB opined that “OMB should apply practical principles for promoting public trust and should empower all federal agencies to participate in this process” (Advisory Committee on Data for Evidence Building, 2021, p. 9). Because no data-release method offering nontrivial usefulness has zero disclosure risks, care needs to be taken when releasing data; proper release, enforced by coherent policies and procedures, is advisable (Sweeney, 1997).

Page 39 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Trust in confidentiality protections and in data quality is a desirable outcome. In the panel’s view, such trust can be engendered only when an agency emphasizes transparency and communication throughout the development phase of disclosure-limitation technology. Such transparency enables advocates for confidentiality, data quality, and underrepresented groups to engage in public scientific discussions about disclosure risk/usefulness trade-offs. For example, for the disclosure-avoidance system used in the 2020 Census redistricting data,⁸ some analyses of confidentiality protections and data quality suggest that the quality of the final product is improved over versions from the development phase (Hollowell, 2022; Kenny et al., 2023; Li et al., 2023). These improvements were facilitated by the Census Bureau’s efforts to seek feedback from stakeholders.

Effective communication with stakeholders can be challenging for data holder agencies. Although most data holder agencies engage with stakeholders in some way, some may not leverage this engagement as a strategy to support the transparency and usefulness of their data products (The Data Foundation and the Center for Open Data Enterprise, 2023, p. 4). Best practices in stakeholder engagement have been described in detail by others (The Data Foundation and the Center for Open Data Enterprise, 2023). Indeed, some researchers have suggested that agencies need to use such strategies to overcome “statistical illusions and epistemic disconnects” (boyd & Sarathy, 2022, p. 17) that surround data products, notably that released data are the “objective ground truth” (boyd & Sarathy, 2022, p. 28). In other words, agencies need to help data users understand that the data reflect various sources of measurement error and manipulations (e.g., editing and imputation), even without the application of confidentiality-protection procedures.

Discussions regarding disclosure risk/usefulness trade-offs in blended data products can be especially complex given the range of stakeholders and legal requirements pertinent for particular ingredient files. Additionally, agency communication strategies may need to be reviewed and updated periodically to reflect improved understanding of disclosure risk/usefulness trade-offs. In the panel’s view, when managing disclosure risk/usefulness trade-offs in blended data, effective communication strategies are centered on several goals, which are delineated in the sections that follow.

End-to-End Data-Quality Studies

Many factors affect data quality, from the coverage of the sampling frame all the way through to the disclosure-limitation system. Examples of end-to-end quality studies include postenumeration surveys (U.S. Census

___________________

⁸ See P.L. 94-171.

Page 40 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Bureau, 2022b) to measure undercount and other coverage errors, as well as to report on data-cleaning metrics such as the pattern of missing data and subsequent imputation performed; and simulations to determine sampling and nonsampling errors. Other possibilities include summary metrics on the number of statistical edits performed to correct responses with obvious errors (Schafer & Bell, 2021). External researchers can also study the effects of data on policy and relative magnitudes of various error sources (Steed et al., 2022). If researchers are encouraged to perform such analyses, external assessment can inform discussions about disclosure risk/usefulness trade-offs.

In the panel’s view, external studies should be promoted to raise awareness of data-quality issues and to set the stage for meaningful discussions of the effects of disclosure-limitation technology on data usefulness. Procedures used to protect confidentiality may have minimal impact on the quality of blended data for their primary purposes, or they may significantly degrade data quality. For example, when the goal of a blended data product is to derive national- or state-level estimates (i.e., estimates derived from many data points), the effects of noise introduced for disclosure limitation may be swamped by the effects of nonsampling error, and these latter errors may affect policymaking or decisions more than the disclosure treatments do (Steed et al., 2022). On the other hand, when the goal of blended data is to characterize very small geographic regions or populations, such noise could substantially distort estimates (Kenny et al., 2023; Li et al., 2023). Knowledge of such effects, obtained by engaging stakeholders in the development process when feasible, can help agencies increase the usefulness of blended data products. For example, if accuracy for small geographic regions is essential to the usefulness of blended data, agencies may decide to adjust the noise distribution and accept higher confidentiality risks in exchange for usefulness, or they may decide to make greater use of policy approaches (e.g., data enclaves).

Early and Continual Engagement with Stakeholders and the Public

Three major types of stakeholders include usefulness advocates (e.g., policymakers, researchers), privacy advocates, and advocates for marginalized groups. Stakeholders can raise concerns about the type of information most vital to their interests and the harms that may result from disclosure. The earlier stakeholder groups can be engaged, the greater the opportunity to refine disclosure-avoidance technologies. Early engagement is especially important in the case of blended data for at least two reasons. First, as described in the preceding sections, use of privacy-protection methods on blended data is less well understood across stakeholders (Hunter Childs et al., 2020). Second, a signature feature of blended data is that they enable

Page 41 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

reuse of data for purposes other than those specified at the time of collection. This feature counters the (often) context-driven determination of acceptability to potential respondents (Hawes, 2023).

In the panel’s view, stakeholder input should be taken into account, and the agency’s response should be communicated to stakeholders. Communication efforts could include reports on issues raised and on the creation of data metrics (U.S. Census Bureau, 2022c), with explanations of how concerns were measured. Simplified communications, with fewer technical details, could be prepared for the public. Elaborate confidentiality assurances may unintentionally signal that the requested information is more sensitive, incriminating, or otherwise potentially harmful than is the case (Singer et al., 1992). As described above, to the extent possible, it is advisable that communications regarding blended data also consider the context and purposes of the blending activity. Of course, care needs to be taken to avoid an undue burden on stakeholder communities, especially data subjects themselves, when seeking feedback.

External Analysis of Work in Progress

Ideally, technology is tested as it is developed, and disclosure-avoidance methods are no exception. To move forward, agencies could create intermediate data products that allow external researchers to monitor the progress of developing disclosure-limitation technologies, to identify data-quality issues, and to evaluate whether issues are adequately addressed in subsequent phases of development. Agencies themselves can use intermediate data products to explore disclosure risk/usefulness trade-offs.

Based on the experience of its members, the panel offers several suggestions for creating intermediate products. First, intermediate products need clear descriptions and labels to avoid confusing stakeholders about the purpose of those products. For example, names such as “Data Demonstration Products” (Abowd & Velkoff, 2020) could be viewed by some stakeholders as indicating a finalized technology rather than a work in progress. Second, the version number and date need to be prominently included in the name of a blended data product, and external researchers need to be encouraged to refer explicitly to the specific data product analyzed—a process known as data citation (Martone, 2014). This can preempt situations in which researchers mistakenly assume that flaws in earlier data products are flaws in the final product (Lane et al., 2023). Third, ideally, agencies can base intermediate data products on historical datasets with fewer disclosure risk concerns, or they can make such datasets available only to approved researchers in strictly controlled settings, to minimize disclosure risks to current data subjects. Finally, ideally, the source code of the disclosure-avoidance system would be made public, as done by the Census Bureau for

Page 42 Cite Bookmark

Suggested Citation: "2 Technical Approaches to Managing Risk When Sharing Blended Data." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

the disclosure-avoidance system used for 2020 decennial census files. Public scrutiny can help detect problems that might otherwise be detected only by chance (Alexander et al., 2010).

Conclusion 2-2: Effective communication with data holders and data users can help agencies understand and better manage disclosure risk/usefulness trade-offs.