This chapter provides a broad overview of the disclosure limitation approaches that have been suggested in the literature on privacy protection. The panel also notes which ones are currently used for the Survey of Income and Program Participation (SIPP). These approaches are not mutually exclusive but rather may be used in combination. The panel divides them into two broad classes. One class involves modifying data to protect confidentiality and disseminate widely. This typically involves statistical modeling to preserve important population-level signals and relationships when modifying sample data to yield valid inferences. The second class imposes restrictions on how the data are accessed and how they are approved for release. Typically, there is a tradeoff between these two classes of approaches: data may be modified so that they are considered safe from disclosure and are widely disseminated, or controlled access is provided to data in ways that are safe from disclosure. There may also be intermediate levels that combine the approaches from the two classes, in which some access controls are relaxed while some data modifications are also imposed.
A major consideration, whether modifying the data or offering multiple modes of access, is the impact of such decisions on data usability. Usability includes not only the ability to obtain valid inferences across a wide range of applications but also issues such as equity in access. Maintaining usability is a theme that appears throughout this report because there is a tradeoff between data privacy protection and data usability, and neither can be considered in isolation from the other. The panel discusses usability in depth in Chapter 9 after the various disclosure avoidance approaches have been discussed.
Because the appropriate disclosure avoidance system depends on the mode of access, the panel first lists the possible modes of access. The value of creating multiple tiers of access is motivated by three factors: the risk elements mentioned in Chapter 3 (e.g., longitudinal structure and integrated administrative data), the balancing of risk reduction with data usability, and satisfying existing users while also broadening the user base. The panel uses the term access somewhat liberally in this report to refer to the entire process of working with and releasing the data: getting approval for access to the data, the means used to work with the data, verifying and validating the data, and the process of publicly releasing the data. These are all both important and interrelated, and different tiers can be expected to impose different conditions.
The most common mode of access is through the provision of a public-use file, with records on each individual or household released together with a set of survey-reported variables on each record. A public-use file is typically provided without any check on the identity of the user, although some form of registration for its use can be required. Users typically download these files to their own computing environments for use. While this mode has the widest accessibility, it also carries the greatest threat of disclosure if too many (or highly disclosive) variables are released on each record. However, there are ways of seeking to further protect confidentiality even with public-use files. One way is to require data users to agree to a data usage agreement. For example, the National Center for Education Statistics posts the notice reproduced in Box 4-1 when users seek to access certain of its public-use files.
An agreement of the type shown in Box 4-1 has no formal approval process that a user or the Census Bureau must go through, and the file is available for use immediately by clicking to show agreement to the terms. Given the sensitive nature of much of SIPP data, the panel does not view this approach as providing adequate protection and at the least would like to see some type of registration process so that the name and institution of the data user are identified. Also, a registration process would not be fully adequate in the long term but might be part of a temporizing approach while the Census Bureau develops more elaborate precautions. As a variation on this approach, the Census Bureau could also ask people to register online before downloading the file; this would further emphasize the importance of maintaining confidentiality.
A second mode is through the provision of synthetic data for variables that are not included in the public-use file. Synthetic variables are constructed using statistical or machine learning models that emulate the
distributional and relational properties of the original variables. These synthetic variables can be augmented to the data that have been made accessible under the first mode. Sometimes synthetic variables are created only for a subset of respondents instead of all respondents. An extended version is to create a fully synthetic dataset of all variables for all respondents in SIPP survey and then release them as public-use files. That is, the original data are completely restricted and never given to the user. The panel discusses this last mode in Chapter 6 and notes there that users usually have the option to query the data custodian for statistics on the original data to ensure that the statistics they obtain on the synthetic data are close to those
on the original data. In this mode, the data custodian controls both the construction of the synthetic data and the nature of the statistics released in response to such a user query.
A third mode is secure online data access. The Census Bureau offers one type of such access through the Federal Statistical Research Data Centers (FSRDCs; see Chapter 2). FSRDCs offer a highly restrictive form of such access, while Chapter 5 presents a much more general (and accessible) type of data access. In this latter model, users must be licensed, but they have online access to selected SIPP variables that are deemed too sensitive to be released or that are highly coarsened in the public-use file. Access requirements or proposal approval processes are set to limit disclosure, but these can offer wider access than is currently offered through FSRDCs, whose use requires considerable processing time, a difficult approval process, and added expenses. Access through FSRDCs may still be needed for SIPP data that are comingled with administrative data, which have other legal restrictions on data access.
A fourth mode is what is known as query-based mode, made available to the general public and without any restriction or screening of users. A specific type of query-based access provides users the ability to request a table of statistics, using a table builder constructed by the data custodian. This mode is sometimes used by the Census Bureau to allow the general public to obtain national summary statistics on demographic and economic variables. As with public-use files, use of this mode is typically unrestricted to the general public, but it allows the data custodian (i.e., the government agency) to restrict the statistics released to provide sufficient privacy protection, and it may include the addition of noise to the estimates as a way of protecting privacy. The panel discusses table builders in general and for SIPP in Chapter 7.
Currently, SIPP is primarily provided through the first mode, a public-use file. Users in the general public, without restriction or registration, can download the data to their own computing environments from SIPP website. The disclosure limitation techniques described below and in Chapter 2 are applied to the dataset before it is posted on the website. However, the Census Bureau also provides data through FSRDCs, which provide greater access to the original data but with different disclosure restrictions (see Chapter 2). The Census Bureau also for several years provided a synthetic data file in which administrative earnings were matched to a subset of survey variables (as the panel discussed in Chapter 2 and will discuss further in Chapter 6). As of September 30, 2022, access to the synthetic data has been temporarily shut down until a new server location is established, leaving FSRDCs as the only currently available alternative to the public-use files.
The classical, or traditional, approaches in statistical disclosure limitation (SDL) to render sensitive data confidential include these five: (1) top-coding, where all values above (or below) a given threshold are set to a fixed value; (2) recoding, where variables are coarsened into a smaller number of categories; (3) variable suppression, where a variable is deleted from the dataset (e.g., when smaller geographical areas are not included); (4) data swapping, in which selected values are swapped between similar respondents; and (5) noise infusion (e.g., by adding a random increment to the original data) added to selected sensitive variables (Hundepool et al., 2012). These methods historically have been most commonly applied to data released as public-use files. As discussed in Chapter 2, SIPP currently uses these types of methods for disclosure avoidance (except for data swapping). Some benefits of these approaches are that they are relatively easy to implement (e.g., without requiring new technology or software) and that the Census Bureau and other statistical agencies have a long track record of experience in using them. They have long been considered as sufficient to protect privacy (with the proviso that data are never completely safe from disclosure).
A problem with these traditional SDL approaches is that it is not easy to evaluate how protective they are, using the concepts and approaches described in Chapter 3, and it is sometimes unclear whether the disclosure risk is below a tolerable risk threshold after an SDL technique is applied. Additionally, the impact of these approaches on data utility may not always be clear or fully assessed. In principle, it is possible to conduct multiple reidentification exercises to assess the likelihood of disclosure for a given traditional SDL treatment, but this is costly and is not completely reliable unless a formal risk metric is computed. Consequently, the nature and severity of the specific traditional SDL methods used tend to be ad hoc and best guesses by the data custodian on how large the risk is. In addition, most often these traditional methods, when applied to longitudinal datasets, are imposed wave by wave and ignore risk from the use of multiple waves of data and therefore of the longitudinal nature of the data. (In fact, SIPP data collect longitudinal data within each wave since much of the data have monthly values.)
A method of disclosure avoidance that has the capability of being assessed with formal privacy measures is the general method of noise injection or perturbation, particularly for the sensitive variables. In general, perturbation approaches to protect microdata involve applying a controlled stochastic treatment procedure to replace a subset of the original variable values by perturbed values, with the aim of introducing just enough noise or uncertainty into the microdata to reduce the disclosure risk to an acceptable level. Perturbation also can be applied to aggregate statistics and their generation
procedures such as table generators; see Chapter 7. Perturbation methods control change from the original values depending on whether the original variable is categorical or continuous. For continuous variables, perturbation methods generally take the form of generating noise from a given distribution having a mean of zero and a predetermined level of variance and adding (or multiplying) the noise to the original values of the variable. For categorical variables, perturbation methods take the form of changing values of the variable to other values according to a probability transition matrix and the outcome of a random draw. If perturbation methods are applied to the indirect identifiers, they should be applied prior to calibrating the survey weights.
One approach for noise injection or perturbation is to generate synthetic data. The basic idea of synthetic data approaches is to randomly generate data from a statistical or generative model so that the aggregate measures and relationships in the original data are well preserved. Rubin (1993) first proposed the idea of creating fully synthetic data based on the multiple imputation framework. For example, Shlomo and De Waal (2008) use a multivariate or sequential approach to maintain multivariate associations during data synthesis. Yu et al. (2022) generate synthetic data using semi-parametric and non-parametric models. Besides full synthesis, there is also partial synthesis (Little, 1993), which selects a subset, either quasi-identifiers or sensitive variables or both, to be synthesized. Although both fully synthetic data and partially synthetic data are generated under the multiple imputation framework, the variance estimation approaches differ (Reiter, 2003, 2005a). It should be noted that the disclosure risk is higher for partially synthetic data, since true values remain in the dataset, though disclosure risk is not completely removed even for fully synthetic data. Reiter and Mitra (2009) described methods to assess the remaining disclosure risk in the partially synthetic datasets. It is equally important for partially synthetic data to satisfy both the standards on data utility and those on disclosure risk. The synthpop R package (Nowok et al., 2016) on the CRAN network and IVEware (Raghunathan et al., 2016) offer several ways to employ synthetic data methods. The panel discusses synthetic data more in Chapter 7, where it notes that one of the key questions about synthetic data is whether they can faithfully capture all the relationships in the original data that users are interested in.1
Other perturbation methods include swapping (Dalenius & Reiss, 1982; Fienberg & McIntyre, 2005). Swapping has been used in some public-use microdata released by government agencies as a means to reduce disclosure risk by introducing “uncertainty” into the microdata. It is designed to switch values of certain fields in a record with data values of the corresponding fields of another record, while controlling the swap pair within strata
___________________
(“boundary” variables) related to the survey outcome. This creates uncertainty by creating false positive matches that are indistinguishable from true positive matches by the data intruder. Some argue that while the intruder can match the data with the known key characteristics or with an external file, the intruder can never know for sure if the matched data are true when the perturbation rate and variables impacted are kept secret, as they typically are. Often geographies are swapped because it is believed that, conditional on geography, many of the survey target variables are independent.
In general, a combination of classical disclosure limitation approaches is carried out. Often, deterministic perturbation methods are applied to the identifiers and random perturbation methods are applied to the sensitive variables. With the movement toward transparency, which would call for the parameters of the data swapping (e.g., perturbation rates and swapping variables) to be released to the public, swapping might be evaluated further for data protection and data utility.
Output may need to be reviewed for accuracy or disclosure risk avoidance. When synthetic data are used, it is possible that statistical results from the synthetic data will not be the same as those from the original data. In the case of SIPP data, synthetic data users are given the option to have Census Bureau personnel replicate the analysis on what is called the Gold Standard File to determine if the results have changed. Second, agencies may require that output be reviewed by specialists or by a Disclosure Review Board to determine what data can be publicly released.
Differential privacy is an alternative framework that proposes a specific metric for evaluating disclosure risk (see Chapter 3) but is also commonly associated with a specific method that involves noise injection, making a statistical release “indistinguishable” from one in which any one individual’s data are removed from the data.2 This approach carefully injects random noise into statistical computations so as to ensure that the effects of each individual’s data are obscured. This leads to a privacy-utility tradeoff, whereby more noise provides stronger privacy protections but less accurate statistics, and less noise provides more accurate statistics but weaker privacy protections. A large research literature has developed around mathematically proving that appropriately perturbed statistical methods satisfy the requirement of differential privacy with a specified level of privacy loss.
The initial focus of the differential privacy literature was on descriptive statistics (Blum et al., 2005; Dwork et al., 2006), but over time the literature turned to more ambitious tasks such as synthetic data generation (e.g., Blum et al., 2013; Bowen & Liu, 2020; Snoke & Slavković, 2018; Tao et al., 2022; Vietri et al., 2022), machine learning (e.g., Abadi et al., 2016;
___________________
2 Absolute and relative disclosure risk are alternative metrics. See Chapter 3. Noise injection can also be used to apply to those metrics.
Chaudhuri et al., 2011; Kasiviswanathan et al., 2011), and statistical inference (e.g., Duchi et al., 2018; Kamath & Ullman, 2020; Vu & Slavković, 2009).
Here the panel briefly mentions some of the areas of the differential privacy research and practice that are most salient for SIPP, some of which are covered in more detail in later chapters.
Managing the privacy-loss budget. Mathematical theorems about reconstruction attacks and membership inference attacks (see Chapter 3) indicate that there is an inherent tradeoff between privacy, usability, and the amount of statistical information released: if too many statistics about the same set of individuals are published with too much accuracy, then the data will necessarily become vulnerable to reconstruction or membership inference attacks (see Dwork, Smith et al., 2017). The composition theorems of differential privacy allow one to measure and control the accumulation of risk as more statistics are published. A common practice is to preset an overall “budget” for the maximum privacy loss one is willing to incur, and then manage the releases so that the budget is never exceeded. Alternatively, each extra release over the initial budget has to be evaluated for whether the benefit of the additional release warrants increasing the budget and hence the disclosure risk. This is particularly salient for SIPP, as many data releases are made over time on the same panel. Managing the privacy-loss budget exposes difficult choices about which statistics are the most important to include with high accuracy and which can be omitted or released at a coarser level.
Differential privacy for longitudinal data. The simplest way to manage the privacy-loss budget for longitudinal data is to perform each release independently and use the composition theorems for differential privacy to bound the accumulated privacy loss. However, it is possible for the privacy-utility tradeoff to be improved by carefully correlating the releases (i.e., using past releases to help construct new ones), especially for statistics that are not expected to change dramatically between releases. So far, the differential privacy literature has only developed methods for tracking simple statistics (e.g., counts) over time (e.g., Chan et al., 2011; Dwork et al., 2010; Jain et al., 2022), and more research along these lines would be beneficial for SIPP.
Differential privacy for complex survey designs. A common intuition is that using a random sample when collecting or releasing data aids in privacy protection. For differential privacy, there are theorems that demonstrate that indeed the privacy loss is reduced when a differentially private algorithm is applied to a simple random sample (Balle et al., 2018; Beimel et al., 2014; Kasiviswanathan, 2011). However, for complex sampling designs, the situation is more subtle, and in some cases, the differential privacy protections may even degrade (Bun et al., 2022). Thus, more research is needed to understand the impact of the SIPP sampling design on differential privacy, or how the sampling design can be modified to interact better with privacy.
Differential privacy for relational and joined data. The most standard model and implementation of differential privacy applies to performing releases based on a tabular dataset, where each record corresponds to one individual’s data (so the opt-out counterfactual amounts to removing one record from the dataset). However, for more complex data (such as from SIPP) that also encode relationships between individuals or between individuals and other units that need privacy protection (like households), it is better to use software that is designed to express and provide privacy protections on relational databases. There is a growing research literature and set of software tools for this setting (see the survey in Near & He, 2021). Some such tools can also support joins between different tables and control the impact of the joins on the privacy loss; this functionality is important for releases that involve joining SIPP data with administrative data.
Differential privacy synthetic data. There are striking theoretical results showing that, in principle, differential privacy can support very rich synthetic data generation. In particular, it is possible to produce differentially private synthetic data that support accurate answers to exponentially many more queries than would be possible if each query were answered independently with differential privacy (Blum et al., 2013; Hardt & Rothblum, 2010). There has been a substantial body of work aimed at making these differentially private methods more practical, but research challenges remain in tackling high-dimensional complex-structured data like SIPP (see Chapter 7).
Differential privacy query systems. In addition to being used to perform statistical releases (tables, synthetic data, model parameters, etc.), differential privacy can also be used in an interactive query system, through which analysts can issue custom queries (e.g., regressions or models they want to fit) and receive differentially private results. Such systems have an advantage in that the agency holding the data does not need to anticipate what analyses data users will want to perform. However, query systems do raise additional challenges. The most notable is managing the privacy-loss budget. To ensure a given level of privacy protection, the system would track the accumulated privacy loss as each query is issued and stop answering queries once the budget is expended. For this reason, differential privacy query systems may only be practical in more controlled environments, where the number of queries per analyst can be limited and there can be assurance (e.g., through user agreements) that multiple analysts do not collude to exceed their individual limits. In the context of SIPP, a differential privacy query system could be deployed using secure online data access (see Chapter 5) or in a verification and validation server to accompany a synthetic data release (see Chapter 6), or it could be transformed into a noninteractive flexible table builder (see Chapter 7).
Several alternative methods for increasing privacy protection and reducing disclosure risk are available. However, it is important to consider usability when adopting methods for SIPP data. Usability can be identified through three different dimensions—how accurate results are in a basic statistical sense (accuracy), whether the data can be used to answer significant questions (feasibility), and how accessible the data are to users (accessibility). It is important to assess usability by measuring various approaches for reducing disclosure risk across these three dimensions.
While most people have an intuitive idea of what usability is—the ability to use a dataset and conduct the kind of analyses on it that they want to—the concept is actually fairly complex and requires careful consideration. For example, for the purposes of considering what disclosure avoidance methods to use for SIPP, it is important to quantify usability to the maximum degree possible. This is because all disclosure avoidance methods, including those discussed in Chapters 4–8, will reduce the usability of the data in at least one dimension and possibly more. Because of this, it is important, when considering any disclosure avoidance system, to assess how much, and in what ways, it reduces usability. If one has knowledge of the exact nature of this privacy-utility tradeoff, a decision can be made on whether the benefits of the reduction in disclosure risk associated with any particular disclosure avoidance system are too costly given the expected reduction in usability.
The panel considered three dimensions of usability when evaluating the impact of disclosure avoidance methods on usability:
The first dimension of usability, “accuracy,” refers to the accuracy of parameter estimates generated from the data to answer specific scientific questions. It is perhaps the narrowest and most traditional definition of usability, and it builds on basic concepts of statistics. This dimension is affected by the degree and way noise is added to the data, the way the data are coarsened, and (or) the way synthetic data are produced. Accuracy has implications for the believability of estimates generated from the data.
The second dimension of usability, “feasibility,” has multiple aspects. One is that feasibility refers to the ability of analysts to use the data to investigate the range of scientific questions that the data were designed to answer, which is a much broader concept than accuracy per se. Feasibility is affected
by the particular variables made available in data releases, the granularity of such measures (such as whether state identifiers are provided), and the facility with which users can manipulate and analyze the data. But feasibility also includes the ability of users to obtain accurate and valid estimates from the protected files with easily available software. This has consequences for the range and types of substantive questions that may be answered with the data. A final dimension of feasibility, related to the second, is the requirement that users be informed of the nature of any disclosure avoidance systems that have been implemented so that they can take measures to conduct valid estimations and reach valid inferences from the data.
Finally, the third dimension of usability, “accessibility,” refers to how difficult it is for potential data users to access the data and generate results. Accessibility may be reduced by policies that make it more difficult or time consuming for users to access the data, such as background checks, prolonged proposal requirements, “seat fees” to conduct research in an FSRDC, and requirements to conduct research in person in an FSRDC. Accessibility is also affected by whether data users need specialized training or software to generate results from the data, which could reduce the number and type of users who can analyze SIPP data. Still another aspect of accessibility involves the disclosure review process. A major advantage of a public-use file is that once a file has been approved for release, no additional clearance process is required either of the Census Bureau or data users. To the extent that some data users may need to shift their work to other tiers of access, data users will benefit from having a clear and transparent process for data review, knowing what limitations will be placed on the data release, how the disclosure review package should be prepared, and what time constraints will be involved. Overall, reductions in accessibility will likely increase inequality in data access. As accessibility decreases, potential SIPP users with the least resources (in terms of time, money, institutional support, and expertise) will be the most likely to stop using SIPP or be deterred from using it altogether. Details about the usability of various methods suggested in this report are discussed in Chapter 9.
Conclusion 4-1: Multiple tools for disclosure avoidance are available, requiring differing levels of sophistication among data users and imposing differing levels of control.
Conclusion 4-2: To the extent that changes to the database are not sufficient to protect confidentiality while also providing reliable statistics, providing multiple tiers of access can be a solution.
Conclusion 4-3: For datasets with large numbers of variables, like the Survey of Income and Program Participation, just adding noise to each sensitive variable prior to public release may bias the analytical results with unknown magnitude and direction. Analysis correcting for additional noise may require information about the noise distribution and special statistical expertise.
Conclusion 4-4: Users will benefit by having information on disclosure limitation methods applied to the data so they will better understand how to interpret results and make valid inferences with the data.
Recommendation 4-1: The Census Bureau should modernize and strengthen disclosure limitation methods applied to the Survey of Income and Program Participation to satisfy standards on both data utility and disclosure risk while moving toward methodological transparency. In particular, the Census Bureau should continually assess how the science and technology for privacy are advancing and consider implementing new tools when advantageous.
Recommendation 4-2: Based on the results from the disclosure risk assessment exercises recommended in Chapter 3, the Census Bureau should evaluate what disclosure avoidance methods are needed to achieve acceptable levels of disclosure risk for each mode of access.
Depending on what data may be available to an intruder, respondents are never completely safe from exposure, and the only completely safe option is to not release the data. However, the panel views SIPP data as being extremely valuable and views the release of the data as providing important public benefits. Statistical agencies historically have sought a compromise—providing strong protections of confidentiality while recognizing that there remains some risk of disclosure. There is no consensus on how to define an acceptable level of risk. Even when the level of risk has been defined through formal privacy, that level has sometimes changed.3 The acceptable level of risk may vary depending on the potential uses of the data, the sensitivity of the data, and the potential harm resulting from disclosure.
Recommendation 4-3: The Census Bureau should evaluate using additional modes of access to provide products with more disclosure protection than public-use files offer but less onerous access requirements than Federal Statistical Research Data Centers impose.
___________________
3 For example, see https://www.census.gov/newsroom/press-releases/2021/2020-census-key-parameters.html