Page 171 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

10

Conclusions and Recommendations

While personal data were once scattered in paper forms across multiple locations, now large amounts of data can be harvested online and combined, making it possible to develop powerful databases that can be highly comprehensive. Several commercial organizations make a practice of collecting and selling such information, and data obtained by federal agencies are also aggregated and used for a variety of other reasons, including research and criminal activities. In this context, those responsible for performing surveys face increasing challenges in protecting the identity of survey respondents and maintaining the confidentiality of their data while allowing the data to be used for their intended purposes.

THE SURVEY OF INCOME AND PROGRAM PARTICIPATION (SIPP) AND THE RISK OF DISCLOSURE AVOIDANCE

The nature of SIPP data greatly complicates approaches to disclosure avoidance. For example, SIPP collects much more data than the decennial census, with approximately 2,900 variables released. Some SIPP data are highly personal (such as financial information), and the data are also highly detailed, including continuous data among the financial measures. Furthermore, the data do not just describe individuals but link each individual as part of a larger household, and they are longitudinal as well, with four years of monthly data. The measurement of substantial changes, such as a change in jobs or loss of a job, a change in marital status, or a change in the number of children, could provide key information that would be helpful in identifying a respondent.

Page 172 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

A recent re-identification study conducted by the Census Bureau provides encouraging findings about the level of disclosure risk that SIPP poses. However, the panel views that study as too limited to provide a definitive measure of disclosure risk. It focused only on two federal databases (Internal Revenue Service [IRS] and Social Security data); though those are among the best databases, they lack information contained in commercial and state databases that also would be useful for identifying SIPP respondents. Furthermore, the study took only a limited look at how the concatenation of data from multiple people in the same household might influence disclosure risk, and it took a limited view of how measures of change over time might affect disclosure risk. Thus, there is not yet an accurate assessment of the level of risk in the release of SIPP data.

Recommendation 3-1: The Census Bureau needs to establish a systematic and ongoing program for assessing the disclosure risk of Survey of Income and Program Participation data products, especially microdata public-use files. This program should include both empirical and analytical approaches to assessing the risks of re-identification attacks, comparisons with a variety of external databases, probabilistic models, and the risks present in both cross-sectional and longitudinal data. All forms of data releases should be evaluated.

Recommendation 4-2: Based on the results of the disclosure risk assessment exercises recommended in Chapter 3, the Census Bureau should evaluate what disclosure avoidance methods are needed to achieve acceptable levels of disclosure risk for each mode of access.

SIPP USE AND USABILITY

SIPP is widely used, including by academics, policy researchers, and multiple departments within the federal government. Its use spans multiple disciplines, including economics, political science, sociology, anthropology, and demography, and multiple policy areas, such as employment, disability, health care, and retirement. A Scopus search identified 711 journal articles using SIPP data, although this is likely to be an underestimate because Scopus checks only the title and abstract, not the full text. A Google Scholar search (which is a full-text search) produced more than 20,000 results, though this is likely to be an overcount because Google Scholar counts different versions of an article separately, and texts may mention SIPP without using SIPP data.

To assess usability, the panel chose the following key components: enabling researchers to make valid and reliable inferences, having sufficiently comprehensive data to support the desired analyses, providing access in a way that is reasonably timely and does not impose hardship, and limiting the costs

Page 173 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

or difficulties of using the data. Built within these elements is also the concept of equity: if data access depends on what resources are available to researchers, then there is a lack of equity in access. Providing access through a public-use file comes the closest to meeting these criteria and appears overwhelmingly to be the most widely used means of access to SIPP data. In the panel’s opinion, diminishing that capacity would both undercut valuable research and lessen the value of SIPP to researchers and policymakers, possibly threatening the survival of SIPP as an ongoing survey. By contrast, the primary other means of accessing the bulk of the data is through Federal Statistical Research Data Centers (FSRDCs),¹ and obtaining access through FSRDCs involves the need to get appropriate approvals, financial costs for using the facilities (even for online use), and a background check. The Census Bureau recently expanded virtual access to Title 13 data at FSRDCs, helping to lower the logistical barriers, but the approval process still remains time-consuming and difficult.

As discussed in Chapter 1, previous National Academies of Sciences, Engineering, and Medicine reports also stated the importance of releasing a public-use file containing microdata. See Recommendations 5 (National Research Council, 1989), 6-3 (National Research Council, 1993), 3-6 (National Research Council, 2009), and 9-1 (National Academies of Sciences, Engineering, and Medicine, 2018).

Recommendation 9-3: The Census Bureau should seek to continue providing public-use files for Survey of Income and Program Participation users, assuming that appropriate disclosure avoidance techniques can be adopted and disclosive and sensitive variables are treated for disclosure avoidance. The variables to be included should depend on the results of disclosure risk studies.

The panel also considered how best to maintain usability if later re-identification studies indicate that the public-use file presents too much disclosure risk. In the panel’s opinion, one part of that answer is to continue producing a public-use data file, even if some data must be removed or otherwise altered to protect respondents’ confidentiality. For those researchers who need more data than can be provided in the public-use file, the panel concluded it is important to provide one or more alternative modes of access to the data. Those modes are discussed later in this chapter, but following are the recommendations relating to usability criteria that should be applied to those alternative tiers.

Recommendation 9-1: The Census Bureau should conduct regular assessments of the validity and reliability of estimates generated from

___________________

¹ Some SIPP data are also available through the Synthetic Beta file, but it has only a limited number of variables.

Page 174 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Survey of Income and Program Participation (SIPP) data products treated for disclosure protection and communicate the results to SIPP users.

Recommendation 9-2: When considering which access modes to prioritize, the Census Bureau should evaluate feasibility for the most common Survey of Income and Program Participation (SIPP) uses and those that exploit the unique characteristics of SIPP and could not be obtained from other datasets.

Recommendation 9-4: Given the differences in user needs and approaches, the Census Bureau should offer multiple tiers of access, with approaches for confidentiality protection applied.

Noting that the SIPP public-use files are difficult for many to use, the panel also considered whether a new tool—a table generator—should be considered. Such a tool ultimately would not meet the needs of most current SIPP data users, who tend to perform complex analyses. However, if needed measures are dropped from the public-use file for disclosure avoidance purposes, a table generator may provide a useful tool for preliminary planning and for preparing research proposals, providing easier access than secure online data access (SODA). Additionally, given the complexity of working with the SIPP public-use file, the availability of a table generator might create a new audience for the data and thus increase the total number of SIPP users. The Census Bureau’s past experience with the American Factfinder system and an abandoned longitudinal tool for SIPP, along with its current experience with query systems for the Decennial Census and American Community Survey, may be helpful in creating a table generator for SIPP. However, the complex structure of SIPP data combined with the highly customized programming often performed by SIPP data users make the development of a table generator more difficult than would normally be the case.

Recommendation 7-1: The Census Bureau should assess the demand for an initial flexible table generator as a simple tool, with a specific purpose, that is designed to gauge value and provide direction for further development, and proceed with the development of one if there appears to be sufficient demand.

CURRENT SIPP APPROACHES TO DISCLOSURE AVOIDANCE

SIPP has two primary approaches to protecting respondents’ confidentiality. The first, labeled as traditional or classical disclosure limitation, involves modifying the data by suppressing some variables; top-coding, bottom-coding, or recoding data to combine relatively unique responses with responses that are more common; and adding noise to the data (in SIPP’s

Page 175 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

case by modifying the birth month). (Another traditional approach is to swap particular data values among selected respondents; however, swapping is not currently used by SIPP.) The second approach to protecting confidentiality in SIPP is to control researchers’ access to the data by offering the original (unmodified) data in a controlled environment (FSRDCs), in which both the researchers and the research topic must be approved and analytical results cannot be removed from the secure environment without approval by the Census Bureau’s Disclosure Review Board. A third approach, offering online access to a synthetic data file, is now used as a means to allow the blending of administrative data with selected data from SIPP, but this could be generalized as a tool for disclosure avoidance for SIPP data as well.

In the view of the panel, this largely dichotomous choice between a public-use file and FSRDCs is inadequate for the following reasons.

The public-use file already has important limitations. For example, it suppresses data on country of birth and immigration status, and it seems likely that more extensive re-identification studies would find disclosure risks both from highly granular data (such as state identifiers) and the provision of data on changes over time and on household combinations; addressing these risks will limit the value of the public-use data further. Given current trends, the amount of data available from external sources that might be matched with SIPP data can be expected to increase, creating new disclosure risks that are not present today and resulting in new limitations on the public-use file.
FSRDCs are difficult and costly to use. For that reason, they would not be a practical option for many researchers. Furthermore, if a large number of current public-use file users were to shift to FSRDCs, this would be likely to create a burden on the Census Bureau in terms of reviewing the data for public release, along with possibly requiring changes to the infrastructure for making the data available.

Protecting SIPP data from disclosure while supporting current levels of research will require creating one or more alternative approaches that provide more access to the data than will be available in the public-use file but are less onerous than FSRDCs (which still will be important for protecting the most sensitive data and for supporting the blending of data from other sources).

The remaining sections of this chapter provide a roadmap for future actions.

TIERS OF ACCESS AS A TOOL FOR PROTECTING PRIVACY

In the panel’s view, a solution can be found by providing additional tiers of access. There are a variety of models for how data are being safely released in restricted-use environments without requiring FSRDCs. The Census

Page 176 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Bureau’s SIPP Synthetic Beta file is one example, though its primary purpose is to blend in administrative data, not to expand the data available from SIPP (which in fact are highly limited). The National Center for Education Statistics, Bureau of Labor Statistics, U.S. Department of Agriculture, National Center for Health Statistics, and National Center for Science and Engineering Statistics all are releasing at least some restricted-use data in ways that are less onerous than the use of FSRDCs. Their approaches require signed user agreements on how the data will be used; they also establish restrictions on where and how the data may be stored and accessed and on what data may be released. Tiered access is a way of both limiting access (not giving researchers more data than they need) and promoting access (providing data that are too disclosive to be in a public-use file, but in a more controlled environment). Creating two additional tiers would be valuable.

A remote table generator would provide users with ways of producing their own tables. At the same time, it would limit what data could be output (e.g., cells with too few observations) and add noise to protect confidentiality. Most current users would probably not be completely satisfied with a remote table generator because they often engage in complex analyses that such a generator would not support. However, they may still find a table generator useful for preparing research proposals. Additionally, the availability of a remote table generator may result in new SIPP data users among those who find the current files too difficult to work with. Potentially, the table generator could evolve into a remote analysis system with more capabilities, such as regression analysis.
SODA would provide a way of making data available in a controlled environment. This would be useful for those situations in which data are not available (or not at sufficient granularity) in the public-use file. It could be designed to prevent the printing of output, requiring that data users first go through a disclosure review process. It also typically involves user agreements in which users accept limitations on what data can be released and acknowledge penalties for the improper release of data. Note that this is different from the online access that the Census Bureau currently provides through FSRDCs. As compared with FSRDCs, SODA would offer simpler, faster, and less costly processes for applying for access, getting approvals for access, and conducting disclosure review for public data release.

As discussed in Chapter 1, the creation of a secure remote access that does not require visiting an FSRDC was recommended in a previous report: Recommendation 3-8 in National Research Council (2009).

Page 177 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Recommendation 9-4: Given the differences in user needs and approaches, the Census Bureau should offer multiple tiers of access, with approaches for confidentiality protection applied.

Recommendation 4-1: The Census Bureau should modernize and strengthen disclosure limitation methods applied to the Survey of Income and Program Participation to satisfy standards on both data utility and disclosure risk while moving toward methodological transparency. In particular, the Census Bureau should continually assess how the science and technology for privacy are advancing and consider implementing new tools when advantageous.

Recommendation 4-3: The Census Bureau should evaluate using additional modes of access to provide products with more disclosure protection than public-use files offer but less onerous access requirements than Federal Statistical Research Data Centers impose.

Recommendation 5-1: If disclosure risk assessment studies find that the current public-use file does not provide adequate disclosure avoidance, the Census Bureau should consider secure online data access as a mode likely to support both access and security.

In the short term, SODA seems to offer the most practical alternative; it would require a change in the infrastructure but not special software or database development. Furthermore, in the current proposed budget for Fiscal Year 2024 the Census Bureau is already proposing to “implement a robust virtual access program,”² and there also are outside organizations that could help to provide the infrastructure needed, having the experience in doing so for other federal agencies.

ADDITIONAL TOOLS TO SUPPORT DISCLOSURE AVOIDANCE

This section addresses four specific tools/approaches for disclosure avoidance: synthetic data, differential privacy, small area estimation (SAE), and disclosure review procedures.

Synthetic Data

Synthetic data consist of surrogate data that have been created through statistical modeling to preserve the statistical properties of the original data. When synthetic data are well designed, users can perform analyses on the

___________________

² https://www.commerce.gov/sites/default/files/2023-03/Census-FY2024-Congressional-Budget-Submission.pdf

Page 178 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

synthetic data in the same way they had been using the original data, making valid inferences while protecting confidentiality. Synthetic data are also a useful tool for blending other data with SIPP, providing a means of benefitting from external data while keeping the external data separate. Data synthesis is a valuable tool with considerable potential to make SIPP data both accessible and safe from disclosure. However, important limitations should be noted:

There is no off-the-shelf software currently available for creating a synthetic data file with the size and complexity of SIPP data. Thus, it is only practical to synthesize a select number of variables—namely, those that are most sensitive or that are based on blended data.
It is possible that the synthetic models will simulate the data so accurately that the resulting data may contain disclosure risks. Thus, a synthetic data file should itself be examined through tools such as re-identification studies.
The validity of inferences from synthetic data depends on the modeling process used to create these data and how well that process supports a particular type of analysis. Thus, data users need guidance on when to use synthetic data and may need ways to validate their findings based on the original data.

Recommendation 6-2: The Census Bureau should develop a systematic protocol and a set of metrics for assessing and measuring the validity of inferences constructed from synthetic data.

Recommendation 6-3: To support the use of synthetic data, the Census Bureau should develop a routine automated system to perform user-submitted analyses on the internal file and send the results back to the user after they have been treated for disclosure avoidance.

Recommendation 6-4: The Census Bureau should consider creating partially synthetic datasets to allow analyses of sensitive or disclosive Survey of Income and Program Participation survey data variables that are not on or may need to be dropped from the public-use file. These synthetic variables might be added to the public-use file in place of the original variables.

Differential Privacy

Differential privacy is both a metric for measuring disclosure risk and a framework for limiting disclosure risk under this metric. With differential privacy, disclosure limitation is achieved by injecting noise into statistical

Page 179 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

computations in a way that obscures the effect of each individual data point while still allowing statistical signals about the population to come through. While differential privacy has a rich and rapidly advancing academic literature and a number of large-scale deployments in industry and government (most notably for disclosure avoidance in the 2020 U.S. Decennial Census), the statistical tools have not yet advanced to the point that they could create a differentially private synthetic dataset for the size and complexity of SIPP. In the short run, differential privacy may be most useful as a component of a table generator; in the long term, one might work toward the creation of a differentially private public-use data file.

Recommendation 4-1: The Census Bureau should modernize and strengthen disclosure limitation methods applied to the Survey of Income and Program Participation to satisfy standards on both data utility and disclosure risk while moving toward methodological transparency. In particular, the Census Bureau should continually assess how the science and technology for privacy are advancing and consider implementing new tools when advantageous.

Small Area Estimation

Geography-based variables present a particular challenge because of their high granularity and thus potential disclosiveness, combined with their substantial analytic value for policy studies such as through comparisons of policies across states. SAE is a model-based approach to combining survey data with other information to create more accurate, and less individually identifiable, estimates for small areas (typically small geographic areas). Chapter 8 presents three levels of data availability depending on what users may need from such data.

Recommendation 8-1: The Census Bureau should continue to pursue the development of a small area estimation program to meet the needs of Survey of Income and Program Participation users for geography-based analysis that preserves confidentiality and limits disclosure risk.

Disclosure Review Board

The Census Bureau has a Disclosure Review Board that reviews all data products based on confidential survey data to make sure data are not released until they satisfy all agency-confidentiality requirements. This is the final step in confidentiality protection. This can be a labor-intensive process, and the efforts to continue to protect SIPP data as confidentiality research progresses, developing new data products and new data-protection

Page 180 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

approaches, are likely to make this effort more burdensome and time consuming. There are potential ways for limiting this burden:

The Census Bureau might design a table generator that automatically implements disclosure protections to protect the data so that they can be released.
The Census Bureau may consider the relatively quick review procedures that have been developed by some external organizations.

The Census Bureau also encounters a burden when seeking to validate the accuracy of analyses based on synthetic data, and that burden has the potential to increase if some synthetic data are added to the public-use file. Again, there are potential ways for limiting this burden:

Potentially, the Census Bureau could provide sufficient guidelines on the validity of synthetic data in various situations so that some categories of investigation would not need validation against the original data.
During the validation process (when used), the Census Bureau may be able to confirm that the synthetic data fall within acceptable parameters, without replacing the synthetic data estimates with estimates from the original data.

TITLE 13 AND ITS REQUIREMENTS FOR DISCLOSURE AVOIDANCE

One ambiguity with regard to creating SODA is the role of Title 13 in establishing conditions for access. Here it is important to distinguish between Title 13 and the regulations that are used to implement Title 13. Title 13 itself is relatively ambiguous, prohibiting the release of identifying information and requiring that those who are given access to confidential information must be sworn to observe the limitations it specifies. The regulations that interpret Title 13 impose far stricter requirements than seem to be required by Title 13 itself—greater than those often used by federal agencies for protecting confidentiality.

Furthermore, the Census Bureau now faces a different context from the period when the regulations interpreting Title 13 were first formulated, with both the Information Quality Act and the Evidence Act making access to data an important priority. This new context is relevant when interpreting Title 13. The panel believes that it is possible to create tiered data access that is consistent with Title 13 but that accounts for differing levels of sensitivity for different data and that promotes evidence building as an appropriate goal for obtaining access. Such a reinterpretation might simplify

Page 181 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

what criteria both people and projects must meet to obtain access, what procedures are used to examine whether people and projects meet those criteria, and what levels of data protection are required depending on the specific data involved.

Recommendation 9-5: The Census Bureau should modernize its interpretation of Title 13 consistent with changes in technology, policy guidance, and legislation (i.e., the Evidence Act and the Information Quality Act). Doing this should enable the development and operationalization of tiered data access in which (a) the types and levels of protection and access vary across tiers; (b) the requirement of benefit is redefined to include evidence building, the productive use of Census Bureau–developed data, and other statistical purposes; and (c) the types of individuals who are eligible to access the data in these tiers are broadened.

THE BENEFITS OF ENHANCED COMMUNICATION

As discussed in Chapter 2, SIPP data files are complex files to work with, and data users often have expressed difficulty in working with them. Data users’ needs for clear communication have been noted frequently in previous National Academies’ Committee on National Statistics (CNSTAT) reports, as listed in Chapter 1. The Census Bureau’s general communications concerning SIPP are outside of the charge given to the panel and thus are not discussed here. However, the use of disclosure avoidance techniques magnifies users’ needs for clear and comprehensive training for the following reasons.

First, both data users and the general public have shown little awareness of the Census Bureau’s efforts to protect confidentiality. Some users of the decennial census data appear not to realize that the Census Bureau has long been altering data to protect confidentiality, leading to complaints after the adoption of differential privacy that would have also applied to earlier decennial censuses as well (boyd & Sarathy, 2022). A recent survey of economists shows that researchers often are unfamiliar with current approaches; for example, 54 percent had never heard of the concept of differential privacy (Williams et al., 2023). The general public is unlikely to be interested in the details of disclosure avoidance, but potential respondents will wish to know that the responses they give will be kept confidential. To avoid misperceptions of disclosure avoidance techniques used by SIPP, it is important to document both past and new procedures.

Conclusion 10-1: Many current users do not understand privacy protection procedures and how they affect the reliability of their statistical output or the methodological procedures necessary to make valid inference.

Page 182 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Second, to the extent that new approaches are taken, data users will need information on how to analyze and interpret the data.³ Differential privacy, if used, requires different approaches for computing standard errors, and data users will need both training and tools to perform the appropriate calculations. Most disclosure avoidance methods will require users to use new software to obtain unbiased and reliable estimates. Depending on the complexity of the disclosure avoidance method, this software could be difficult to use and not readily available. Most of the existing software “packages,” which are used by the vast majority of users of SIPP, will no longer be usable, yet it is unrealistic to think that all users will start programming new statistical procedures in Fortran, R, Python, Matlab, or whatever their favorite programming language is. In some very simple cases, where noise infusion is minimal or where synthetic data are produced using simple models, software is easily available (e.g., to use multiple imputation or correction for errors-in-variable). However, for the vast majority of methods that are likely to be used, the necessary programming will be much more complex, and data users will need to be trained in the use of these methods.

As discussed in Chapter 6, users of synthetic data need to have communicated to them the nature of the statistical generating model, including which relationships the model captures and which it cannot, to enhance their understanding of what relationships are likely to be accurate and reliable and which are not. Also, users will need to be trained on appropriate methods for analyzing synthetic data to construct valid confidence intervals.

Third, if different levels of data are offered through different tiers of access, then data users need information to determine which tiers will be most useful for their analysis. For example, the accuracy of estimates based on synthetic data depends on how well the models used for synthesizing the data align with data users’ planned analyses. Thus, data users need information about the models that were used and the implications for their analyses.

Communication should not be in only one direction—that is, only from the Census Bureau to data users. When the panel began its work, the Census Bureau had no regular process in place for determining the needs and satisfaction of SIPP data users other than providing an email contact. For that reason, the panel conducted its own data collection. Since that time, SIPP conducted a two-day data conference in February 2023, part of which was devoted to allowing time for SIPP data users to ask questions and express

___________________

³ The need for users to be able to conduct analyses that recognize and incorporate the statistical disclosure limitation (SDL) methods embedded in the data applies to traditional SDL methods as well. It is not clear whether users fully recognize the methods that the Census Bureau currently uses.

Page 183 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

their needs. The panel commends the Census Bureau for holding the conference and particularly for allowing for user dialogue.

Communication is also important at all stages of data preparation and release. Understanding data users’ needs will help the Census Bureau to know how to prepare access modes and tools, and how best to prepare documentation that will support data users in making use of protected data. In turn, when SIPP data users are offered transparency and consultation regarding disclosure avoidance approaches, they will be better able to plan and conduct their research while mitigating tensions over disclosure avoidance approaches.

The importance of communication has been a longstanding concern, as noted in previous National Academies reports. See Recommendations 6-4 (National Research Council, 1993), 9-3 (National Academies, 2018), and 9-5 (National Academies, 2018).

Recommendation 3-2: The Census Bureau should find ways to communicate to researchers and the general public the findings of its assessments of re-identification attacks and disclosure risk so that data users will understand the need for disclosure avoidance techniques.

Recommendation 10-1: The Survey of Income and Program Participation (SIPP) program/Census Bureau should work to enhance communications and engagement with SIPP users. The Census Bureau should maintain regular contact with SIPP users through user groups, periodic surveys, and regular workshops. The Census Bureau should also solicit user input and use it in its ongoing planning for the survey.

One way of systematically seeking user feedback is through the creation of a technical advisory group. For example, the Census Bureau might request that the American Statistical Association (ASA) create a committee to partner with it; this would be a simpler and less cumbersome approach than creating a group that would fall under the auspices of the Federal Advisory Committee Act. Given that privacy-preserving techniques are rapidly evolving, such a group could also be helpful in providing guidance on new potential approaches or risks and in explaining the need for disclosure avoidance to data users.

While the Census Bureau currently has no active SIPP technical users group with which it regularly consults, at various times it has engaged with stakeholders about the quality and future of SIPP. The 1993 CNSTAT report (National Research Council, 1993) documents a few technical advisory committees the Office of Management and Budget interagency SIPP committee, Social Science Research Council committee, ASA’s Survey Research Methods (SRM) Working Group on Technical Aspects of SIPP, and the Association of Public Data Users SIPP Users group (National

Page 184 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Research Council, 1993). During the late 1990s and early 2000s, the Census Bureau engaged SIPP users to draft SIPP Quality Profiles and User Guides.⁴ During the re-engineering of SIPP in the mid-2000s, the Census Bureau continually engaged with stakeholders.⁵ In conjunction with the release of the 2009 CNSTAT SIPP report, the Census Bureau reconstituted the ASA/SRM Working Group on Technical Aspects of SIPP, which provided it with technical assistance on the re-engineering until the new 2014 CNSTAT panel was created. In recent years, the Census Bureau has sponsored user trainings, workshops, and conferences.⁶

Recommendation 10-2: The Survey of Income and Program Participation program/Census Bureau should consider creating a standing technical consulting group that can provide guidance, accountability, and justification on disclosure avoidance approaches while considering the needs of data users.

Finally, the Census Bureau is not the only federal agency dealing with privacy issues. The Census Bureau sometimes has taken the lead in developing new approaches, as when it provided a synthetic version of SIPP data and when it adopted differential privacy for the 2020 decennial census, but establishing a close working relationship with other federal agencies could help in creating a systematic body of research on disclosure avoidance as well as possibly creating a consensus on best approaches. The National Secure Data Service and America’s Datahub Consortium could be helpful in this regard. While the Census Bureau has the greatest funding of all the principal federal statistical agencies, it is far from alone. Other agencies are also dealing with statistical disclosure issues, the sharing of data, and the sponsorship of research in disclosure avoidance. Collaborating with these other agencies could help in setting federal statistical standards, creating mechanisms for data access, and sponsoring research on disclosure avoidance.

LINKING SIPP DATA WITH ALTERNATIVE DATA SOURCES

The utility of SIPP as a program may be enhanced by linking the survey data with administrative data sources. While the Census Bureau has some freedom in how to interpret Title 13, the issue is more complicated when data from other agencies are considered. The restrictions placed on IRS data are especially tight, and the Census Bureau must respect the limitations placed on it by the

___________________

⁴ https://www.census.gov/library/working-papers/1998/demo/SEHSD-WP1998-11.html

⁵ https://www.census.gov/programs-surveys/sipp/about/re-engineered-sipp.html

⁶ For the 2023 conference, see https://www.census.gov/programs-surveys/sipp/events/2023-sipp-conference.html

Page 185 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

IRS and other agencies. Thus, data merges have been restricted to data access through FSRDCs, the creation and use of SIPP Synthetic Beta, and internal use of such databases for SIPP error checking. Nevertheless, examples where researchers might wish to link SIPP with alternative data sources include these:

To improve data quality, such as when administrative data are used to correct respondent errors or omissions;
To supplement SIPP with data that otherwise would not be available on SIPP; and
Conversely, to supplement external data with data available on SIPP.

Depending on what linking data are available and what requirements are placed on the data by the organizations providing them, the linking might involve either directly inserting data taken from external sources or using the external data to impute data or correct for response errors (e.g., using statistical models). Currently, to support the linking of SIPP (or other survey) data with administrative data, the Census Bureau uses the name, date of birth, sex, and address to find a Social Security number, which it then replaces with a Protected Identification Key (PIK).⁷ This process produces a match rate of approximately 90 percent with SIPP data, while administrative record variables are imputed when the PIK is missing.

As suggested by the examples in Chapter 2, there are many potential uses of linked data, and there are many potential sources for linking. Linkages are the most powerful and privacy concerns are the greatest when there are direct linkages. Since SIPP uses statistical sampling and generally does not intentionally overlap its samples with other surveys, linkages work best when the external data comprise a census rather than a sample. Some examples include tax data, Social Security, unemployment insurance, Medicare, student loans, participation in the Supplemental Nutrition Assistance Program (available for selected states), and other databases on state-provided assistance.

If data users wish to link the data themselves using the PIK, currently they would need to work through an internal Census Bureau project or a joint statistical project and then pay for the linking process. Potentially, depending on the nature of the data sharing agreement for accessing the data to be linked, the Census Bureau might create a way in which data linkages could also be provided through SODA. Sometimes, however, there may not be adequate data for linking individuals across databases, the providers of the external data may set conditions on how the data may be used, or there may be other privacy concerns, requiring the use of imputed or synthetic data rather than inserting administrative data directly.

___________________

⁷ B. Gurrentz, Model-based imputation & administrative records in SIPP processing, presentation to panel on June 30, 2022.

Page 186 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

However, a different approach, which would enhance SIPP usability and perhaps increase the number of SIPP users, is for the Census Bureau (or its designee) to perform the linkages and provide a synthetic data file. This is what the Census Bureau currently does with SIPP Synthetic Beta, and the approach might be generalized to make more data accessible. Such data could even be included in the public-use file or, alternatively, be provided through SODA. Given the time and resources required to create such data, it would be important to first prioritize which variables would be most useful to researchers and policymakers, and only then to synthesize just those variables and augment the SIPP survey data. In generating these synthetic variables, it is important to maintain some key variables in the SIPP survey data. The importance of linking SIPP with other data has also been noted in previous National Academies reports, as discussed in Chapter 1. See Recommendations 4 (National Research Council, 1989), 3-5 and 3-7 (National Research Council, 2009), and 5-2 (National Academies, 2018).

FUTURE RESEARCH TO ADDRESS CURRENT KNOWLEDGE GAPS

Disclosure avoidance is a complex and rapidly changing field, as illustrated by the ever-expanding scientific literature on differential privacy. The Census Bureau is already conducting and funding research on disclosure avoidance, and continued research is vital. In that respect, the Census Bureau should work with researchers outside the federal government to get their perspectives and benefit from their capabilities.

Recommendation 6-1: The Census Bureau should support research that develops new approaches and techniques for generating synthetic data and assessing data utility and disclosure risk, with or without differential privacy or other privacy guarantees, and should fund work that creates and updates tools and systems that evaluate synthetic data, disclosure risk, and usability and collect data users’ feedback.

Recommendation 10-2: The Survey of Income and Program Participation program/Census Bureau should consider creating a standing technical consulting group that can provide guidance, accountability, and justification on disclosure avoidance approaches while considering the needs of data users.

Recommendation 3-3: The Census Bureau should find ways to partner with and involve external researchers and other experts in its risk assessment research program. Such involvement will expand the Census Bureau’s capacity and will allow it to tap into state-of-the-art developments in the area of disclosure risk assessment.

Page 187 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

CENSUS BUREAU RESOURCES FOR ADDRESSING DISCLOSURE AVOIDANCE

The panel recognizes that applying the full panoply of tools discussed in this report will take time, resources, and research. There is currently no off-the-shelf tool for creating either a satisfactory synthetic data file or a SIPP public-use file with differential privacy, especially given the large number of variables and the presence of longitudinal data. The Census Bureau has already experimented with a table generator for longitudinal SIPP data and decided it was too difficult to maintain. SIPP data will be available in the Microdata Access Tool soon, according to the Census Bureau, though it will only include pre-2014 SIPP data.⁸ Creating a full implementation of such tools at the current time would at best greatly delay the release of SIPP data and, at worst, may not yet be doable. The challenges do not stop with the creation of a database, either, but include how it is maintained. The validity of inferences from synthetic data varies depending on what models were used to generate the synthetic data and how well the data support a particular planned data analysis, meaning that validation against the original data often will be needed. It is to be hoped that the Census Bureau will be able to publish data that will indicate which kinds of analyses are most reliable and which are most problematic, helping to reduce the number of times validation is needed. Still, validation of synthetic data findings already is burdensome on the Census Bureau staff, and it would not be practical for Census Bureau staff if the number of users and the types of analyses they conduct were to increase due to the number of synthetic data users increasing greatly. The Census Bureau might find that the best option is to partner with another organization, one that has the capacity for supporting SODA.

Conclusion 10-2: The Census Bureau might consider using outside contractors or grantees to provide some of the infrastructure required for developing and sustaining a table generator/remote analysis system, producing synthetic data, providing validation, providing secure online data access, and/or performing disclosure review.

A ROADMAP FOR ADDRESSING DISCLOSURE AVOIDANCE IN SIPP

Conducting the research and developing the tools discussed here will take substantial time, while the timely release of SIPP data remains important. Therefore, the panel recommends phasing in additional protections and access over time. The Census Bureau should continue offering

___________________

⁸ Email from Holly Fee, Census Bureau, July 31, 2023.

Page 188 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

a public-use file, omitting items that are determined to have too high a disclosure risk or replacing them with synthetic estimates. While a long-term solution may be the production of a fully synthetic public-use file, in the short run the most practical solution is probably to produce a somewhat less comprehensive public-use file, combined with a way to provide restricted access to additional SIPP data without imposing all the requirements currently required for FSRDC access.

The Census Bureau might also explore even shorter-term changes as a way of temporizing, given that it will take time to implement additional tiers of access and given the potential for disclosure risks identified in this report. One approach is to strengthen traditional SDL methods through more coarsening of the data, additional top-codes, additional geography suppression, and similar methods. Based on the Census Bureau’s re-identification exercise and on the panel’s analysis of data granularity, age and geography might pose the greatest threats; these are the two most readily available measures on external datasets with high granularity. For example, age might be recoded into a largely categorical measure, while still retaining key age thresholds for program eligibility, and state identifiers might be replaced with measures of geographic region. However, these increased protections come at the cost of reduced data utility, and any adoption of increased SDL methods should seek to minimize the loss of that utility. Another option that has been adopted in a number of European countries is a registration process for those who download the public-use file, requiring users to give identifying information and state the intended purpose of the data analysis. Alternatively or in addition, users of public-use SIPP files could be required to sign a user agreement that requires them to agree to specific uses of the data and to not disclose any data that reasonably seem confidential. User agreements are common for restricted-use files, but they can also be applied to public-use files; for example, note how the National Center for Education Statistics imposes some conditions on the use of public-use files, as discussed in Chapter 4. These options do not represent a permanent solution, and it would still be possible for a sophisticated intruder to attempt re-identification. Still, they would enhance current protections and make the general user community aware of these issues and sensitive to the problems. It is important that any user agreements or enforcement actions be implemented very carefully so as not to have a chilling effect on privacy-related and other types of research.

Multiple National Academies reports have noted the importance of the timely release of data. Given the current state of research and technology for tools such as synthetic data and differential privacy, the Census Bureau should continue to release public-use files using current technology (which may include a limited number of synthetic variables) and with appropriate protections of confidentiality (which may require moving some data to be

Page 189 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

available through SODA). See Chapter 1 for recommendations in previous reports—that is, Recommendations 5 (National Research Council, 1989), 3-6 (National Research Council, 2009; also see Conclusion 4-7), and 9-1 (National Academies, 2018).

Considering which steps might be taken quickly and which will require more time, following is a roadmap with suggested milestones for implementing the changes recommended in this report.

Milestone 1

Write a risk assessment plan, including plans on approaches to conduct risk assessments; when, how often, and how to conduct them; and who should conduct them, including support from external researchers/experts.
Write a communication plan for disclosure risk results, which may include plans for seminars for data users, providing example text for written communication.
Assess the likely use of a simple and flexible table generator using the public-use file, and whether it would broaden the user base.
Begin planning for establishing SODA, including an evaluation of what is needed to satisfy Title 13, the Evidence Act, and the Information Quality Act.

Milestone 2

Implement an ongoing risk assessment program for SIPP, one that includes consideration of its longitudinal structure, and determine what changes may be needed to the public-use file to protect confidentiality.
Continue publishing the SIPP public-use file, modified as necessary based on the ongoing risk assessment, including partial synthesis of a limited set of variables deemed too disclosive on an experimental basis.
Depending on the results of the assessment conducted under Milestone 1, prepare the design of a flexible table generator.
Plan for and begin developing SODA to provide access to data/synthetic estimates that are omitted in the public-use files.
Consider how to work with external partners to deal with issues such as risk assessment, hosting SODA, and performing disclosure review.
Begin developing geography-level model-based estimates and evaluate them using data utility metrics; consider a differentially private framework.

Page 190 Cite Bookmark

Suggested Citation: "10 Conclusions and Recommendations." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Investigate multitiered modes of access, including the public-use file, synthetic data, SAE, SODA, and remote analysis platform.
Coordinate with other federal agencies, including the Social Security Administration and the IRS, to determine what data can be shared (including synthetic data) and through what means.

Milestone 3

Continue refining the SIPP public-use file, as necessary, based on the ongoing risk assessment. (Note: this should be a continuing activity throughout the release of public-use files.)
Implement a flexible table generator that uses the differential privacy framework if sufficient demand for it has been established.
Explore the feasibility, design, and usefulness of a partial synthetic file of a larger set of variables, including the possible incorporation of formal privacy, and evaluate its accuracy with data utility metrics.
Implement SODA.

Milestone 4

Obtain feedback from data users on the experimental partially synthetic dataset if one has been deemed useful.
Expand the flexible table generator to become a remote analysis platform based on the public-use file, possibly including regression modeling, if work on it in Milestone 3 has continued.

Milestone 5

Add differential privacy or some other formal privacy framework to the remote analysis platform, infusing noise under the framework and laying the foundation for automated verification and validation servers.

Milestone 6

Publicly release a partially synthetic file with an automated verification and validation server.
Given the creation of a verification and validation server, decide whether to adjust the remote analysis platform so the partial synthetic file can be used to provide consistency between data products, or else continue with the formal privacy version.