While personal data were once scattered in paper forms across multiple locations, now large amounts of data can be harvested online and combined, making it possible to develop powerful databases that can be highly comprehensive. Several commercial organizations make a practice of collecting and selling such information, and data obtained by federal agencies are also aggregated and used for a variety of other reasons, including research and criminal activities. In this context, those responsible for performing surveys face increasing challenges in protecting the identity of survey respondents and maintaining the confidentiality of their data while allowing the data to be used for their intended purposes.
The nature of SIPP data greatly complicates approaches to disclosure avoidance. For example, SIPP collects much more data than the decennial census, with approximately 2,900 variables released. Some SIPP data are highly personal (such as financial information), and the data are also highly detailed, including continuous data among the financial measures. Furthermore, the data do not just describe individuals but link each individual as part of a larger household, and they are longitudinal as well, with four years of monthly data. The measurement of substantial changes, such as a change in jobs or loss of a job, a change in marital status, or a change in the number of children, could provide key information that would be helpful in identifying a respondent.
A recent re-identification study conducted by the Census Bureau provides encouraging findings about the level of disclosure risk that SIPP poses. However, the panel views that study as too limited to provide a definitive measure of disclosure risk. It focused only on two federal databases (Internal Revenue Service [IRS] and Social Security data); though those are among the best databases, they lack information contained in commercial and state databases that also would be useful for identifying SIPP respondents. Furthermore, the study took only a limited look at how the concatenation of data from multiple people in the same household might influence disclosure risk, and it took a limited view of how measures of change over time might affect disclosure risk. Thus, there is not yet an accurate assessment of the level of risk in the release of SIPP data.
Recommendation 3-1: The Census Bureau needs to establish a systematic and ongoing program for assessing the disclosure risk of Survey of Income and Program Participation data products, especially microdata public-use files. This program should include both empirical and analytical approaches to assessing the risks of re-identification attacks, comparisons with a variety of external databases, probabilistic models, and the risks present in both cross-sectional and longitudinal data. All forms of data releases should be evaluated.
Recommendation 4-2: Based on the results of the disclosure risk assessment exercises recommended in Chapter 3, the Census Bureau should evaluate what disclosure avoidance methods are needed to achieve acceptable levels of disclosure risk for each mode of access.
SIPP is widely used, including by academics, policy researchers, and multiple departments within the federal government. Its use spans multiple disciplines, including economics, political science, sociology, anthropology, and demography, and multiple policy areas, such as employment, disability, health care, and retirement. A Scopus search identified 711 journal articles using SIPP data, although this is likely to be an underestimate because Scopus checks only the title and abstract, not the full text. A Google Scholar search (which is a full-text search) produced more than 20,000 results, though this is likely to be an overcount because Google Scholar counts different versions of an article separately, and texts may mention SIPP without using SIPP data.
To assess usability, the panel chose the following key components: enabling researchers to make valid and reliable inferences, having sufficiently comprehensive data to support the desired analyses, providing access in a way that is reasonably timely and does not impose hardship, and limiting the costs
or difficulties of using the data. Built within these elements is also the concept of equity: if data access depends on what resources are available to researchers, then there is a lack of equity in access. Providing access through a public-use file comes the closest to meeting these criteria and appears overwhelmingly to be the most widely used means of access to SIPP data. In the panel’s opinion, diminishing that capacity would both undercut valuable research and lessen the value of SIPP to researchers and policymakers, possibly threatening the survival of SIPP as an ongoing survey. By contrast, the primary other means of accessing the bulk of the data is through Federal Statistical Research Data Centers (FSRDCs),1 and obtaining access through FSRDCs involves the need to get appropriate approvals, financial costs for using the facilities (even for online use), and a background check. The Census Bureau recently expanded virtual access to Title 13 data at FSRDCs, helping to lower the logistical barriers, but the approval process still remains time-consuming and difficult.
As discussed in Chapter 1, previous National Academies of Sciences, Engineering, and Medicine reports also stated the importance of releasing a public-use file containing microdata. See Recommendations 5 (National Research Council, 1989), 6-3 (National Research Council, 1993), 3-6 (National Research Council, 2009), and 9-1 (National Academies of Sciences, Engineering, and Medicine, 2018).
Recommendation 9-3: The Census Bureau should seek to continue providing public-use files for Survey of Income and Program Participation users, assuming that appropriate disclosure avoidance techniques can be adopted and disclosive and sensitive variables are treated for disclosure avoidance. The variables to be included should depend on the results of disclosure risk studies.
The panel also considered how best to maintain usability if later re-identification studies indicate that the public-use file presents too much disclosure risk. In the panel’s opinion, one part of that answer is to continue producing a public-use data file, even if some data must be removed or otherwise altered to protect respondents’ confidentiality. For those researchers who need more data than can be provided in the public-use file, the panel concluded it is important to provide one or more alternative modes of access to the data. Those modes are discussed later in this chapter, but following are the recommendations relating to usability criteria that should be applied to those alternative tiers.
Recommendation 9-1: The Census Bureau should conduct regular assessments of the validity and reliability of estimates generated from
___________________
1 Some SIPP data are also available through the Synthetic Beta file, but it has only a limited number of variables.
Survey of Income and Program Participation (SIPP) data products treated for disclosure protection and communicate the results to SIPP users.
Recommendation 9-2: When considering which access modes to prioritize, the Census Bureau should evaluate feasibility for the most common Survey of Income and Program Participation (SIPP) uses and those that exploit the unique characteristics of SIPP and could not be obtained from other datasets.
Recommendation 9-4: Given the differences in user needs and approaches, the Census Bureau should offer multiple tiers of access, with approaches for confidentiality protection applied.
Noting that the SIPP public-use files are difficult for many to use, the panel also considered whether a new tool—a table generator—should be considered. Such a tool ultimately would not meet the needs of most current SIPP data users, who tend to perform complex analyses. However, if needed measures are dropped from the public-use file for disclosure avoidance purposes, a table generator may provide a useful tool for preliminary planning and for preparing research proposals, providing easier access than secure online data access (SODA). Additionally, given the complexity of working with the SIPP public-use file, the availability of a table generator might create a new audience for the data and thus increase the total number of SIPP users. The Census Bureau’s past experience with the American Factfinder system and an abandoned longitudinal tool for SIPP, along with its current experience with query systems for the Decennial Census and American Community Survey, may be helpful in creating a table generator for SIPP. However, the complex structure of SIPP data combined with the highly customized programming often performed by SIPP data users make the development of a table generator more difficult than would normally be the case.
Recommendation 7-1: The Census Bureau should assess the demand for an initial flexible table generator as a simple tool, with a specific purpose, that is designed to gauge value and provide direction for further development, and proceed with the development of one if there appears to be sufficient demand.
SIPP has two primary approaches to protecting respondents’ confidentiality. The first, labeled as traditional or classical disclosure limitation, involves modifying the data by suppressing some variables; top-coding, bottom-coding, or recoding data to combine relatively unique responses with responses that are more common; and adding noise to the data (in SIPP’s
case by modifying the birth month). (Another traditional approach is to swap particular data values among selected respondents; however, swapping is not currently used by SIPP.) The second approach to protecting confidentiality in SIPP is to control researchers’ access to the data by offering the original (unmodified) data in a controlled environment (FSRDCs), in which both the researchers and the research topic must be approved and analytical results cannot be removed from the secure environment without approval by the Census Bureau’s Disclosure Review Board. A third approach, offering online access to a synthetic data file, is now used as a means to allow the blending of administrative data with selected data from SIPP, but this could be generalized as a tool for disclosure avoidance for SIPP data as well.
In the view of the panel, this largely dichotomous choice between a public-use file and FSRDCs is inadequate for the following reasons.
Protecting SIPP data from disclosure while supporting current levels of research will require creating one or more alternative approaches that provide more access to the data than will be available in the public-use file but are less onerous than FSRDCs (which still will be important for protecting the most sensitive data and for supporting the blending of data from other sources).
The remaining sections of this chapter provide a roadmap for future actions.
In the panel’s view, a solution can be found by providing additional tiers of access. There are a variety of models for how data are being safely released in restricted-use environments without requiring FSRDCs. The Census
Bureau’s SIPP Synthetic Beta file is one example, though its primary purpose is to blend in administrative data, not to expand the data available from SIPP (which in fact are highly limited). The National Center for Education Statistics, Bureau of Labor Statistics, U.S. Department of Agriculture, National Center for Health Statistics, and National Center for Science and Engineering Statistics all are releasing at least some restricted-use data in ways that are less onerous than the use of FSRDCs. Their approaches require signed user agreements on how the data will be used; they also establish restrictions on where and how the data may be stored and accessed and on what data may be released. Tiered access is a way of both limiting access (not giving researchers more data than they need) and promoting access (providing data that are too disclosive to be in a public-use file, but in a more controlled environment). Creating two additional tiers would be valuable.
As discussed in Chapter 1, the creation of a secure remote access that does not require visiting an FSRDC was recommended in a previous report: Recommendation 3-8 in National Research Council (2009).
Recommendation 9-4: Given the differences in user needs and approaches, the Census Bureau should offer multiple tiers of access, with approaches for confidentiality protection applied.
Recommendation 4-1: The Census Bureau should modernize and strengthen disclosure limitation methods applied to the Survey of Income and Program Participation to satisfy standards on both data utility and disclosure risk while moving toward methodological transparency. In particular, the Census Bureau should continually assess how the science and technology for privacy are advancing and consider implementing new tools when advantageous.
Recommendation 4-3: The Census Bureau should evaluate using additional modes of access to provide products with more disclosure protection than public-use files offer but less onerous access requirements than Federal Statistical Research Data Centers impose.
Recommendation 5-1: If disclosure risk assessment studies find that the current public-use file does not provide adequate disclosure avoidance, the Census Bureau should consider secure online data access as a mode likely to support both access and security.
In the short term, SODA seems to offer the most practical alternative; it would require a change in the infrastructure but not special software or database development. Furthermore, in the current proposed budget for Fiscal Year 2024 the Census Bureau is already proposing to “implement a robust virtual access program,”2 and there also are outside organizations that could help to provide the infrastructure needed, having the experience in doing so for other federal agencies.
This section addresses four specific tools/approaches for disclosure avoidance: synthetic data, differential privacy, small area estimation (SAE), and disclosure review procedures.
Synthetic data consist of surrogate data that have been created through statistical modeling to preserve the statistical properties of the original data. When synthetic data are well designed, users can perform analyses on the
___________________
2 https://www.commerce.gov/sites/default/files/2023-03/Census-FY2024-Congressional-Budget-Submission.pdf
synthetic data in the same way they had been using the original data, making valid inferences while protecting confidentiality. Synthetic data are also a useful tool for blending other data with SIPP, providing a means of benefitting from external data while keeping the external data separate. Data synthesis is a valuable tool with considerable potential to make SIPP data both accessible and safe from disclosure. However, important limitations should be noted:
Recommendation 6-2: The Census Bureau should develop a systematic protocol and a set of metrics for assessing and measuring the validity of inferences constructed from synthetic data.
Recommendation 6-3: To support the use of synthetic data, the Census Bureau should develop a routine automated system to perform user-submitted analyses on the internal file and send the results back to the user after they have been treated for disclosure avoidance.
Recommendation 6-4: The Census Bureau should consider creating partially synthetic datasets to allow analyses of sensitive or disclosive Survey of Income and Program Participation survey data variables that are not on or may need to be dropped from the public-use file. These synthetic variables might be added to the public-use file in place of the original variables.
Differential privacy is both a metric for measuring disclosure risk and a framework for limiting disclosure risk under this metric. With differential privacy, disclosure limitation is achieved by injecting noise into statistical
computations in a way that obscures the effect of each individual data point while still allowing statistical signals about the population to come through. While differential privacy has a rich and rapidly advancing academic literature and a number of large-scale deployments in industry and government (most notably for disclosure avoidance in the 2020 U.S. Decennial Census), the statistical tools have not yet advanced to the point that they could create a differentially private synthetic dataset for the size and complexity of SIPP. In the short run, differential privacy may be most useful as a component of a table generator; in the long term, one might work toward the creation of a differentially private public-use data file.
Recommendation 4-1: The Census Bureau should modernize and strengthen disclosure limitation methods applied to the Survey of Income and Program Participation to satisfy standards on both data utility and disclosure risk while moving toward methodological transparency. In particular, the Census Bureau should continually assess how the science and technology for privacy are advancing and consider implementing new tools when advantageous.
Geography-based variables present a particular challenge because of their high granularity and thus potential disclosiveness, combined with their substantial analytic value for policy studies such as through comparisons of policies across states. SAE is a model-based approach to combining survey data with other information to create more accurate, and less individually identifiable, estimates for small areas (typically small geographic areas). Chapter 8 presents three levels of data availability depending on what users may need from such data.
Recommendation 8-1: The Census Bureau should continue to pursue the development of a small area estimation program to meet the needs of Survey of Income and Program Participation users for geography-based analysis that preserves confidentiality and limits disclosure risk.
The Census Bureau has a Disclosure Review Board that reviews all data products based on confidential survey data to make sure data are not released until they satisfy all agency-confidentiality requirements. This is the final step in confidentiality protection. This can be a labor-intensive process, and the efforts to continue to protect SIPP data as confidentiality research progresses, developing new data products and new data-protection
approaches, are likely to make this effort more burdensome and time consuming. There are potential ways for limiting this burden:
The Census Bureau also encounters a burden when seeking to validate the accuracy of analyses based on synthetic data, and that burden has the potential to increase if some synthetic data are added to the public-use file. Again, there are potential ways for limiting this burden:
One ambiguity with regard to creating SODA is the role of Title 13 in establishing conditions for access. Here it is important to distinguish between Title 13 and the regulations that are used to implement Title 13. Title 13 itself is relatively ambiguous, prohibiting the release of identifying information and requiring that those who are given access to confidential information must be sworn to observe the limitations it specifies. The regulations that interpret Title 13 impose far stricter requirements than seem to be required by Title 13 itself—greater than those often used by federal agencies for protecting confidentiality.
Furthermore, the Census Bureau now faces a different context from the period when the regulations interpreting Title 13 were first formulated, with both the Information Quality Act and the Evidence Act making access to data an important priority. This new context is relevant when interpreting Title 13. The panel believes that it is possible to create tiered data access that is consistent with Title 13 but that accounts for differing levels of sensitivity for different data and that promotes evidence building as an appropriate goal for obtaining access. Such a reinterpretation might simplify
what criteria both people and projects must meet to obtain access, what procedures are used to examine whether people and projects meet those criteria, and what levels of data protection are required depending on the specific data involved.
Recommendation 9-5: The Census Bureau should modernize its interpretation of Title 13 consistent with changes in technology, policy guidance, and legislation (i.e., the Evidence Act and the Information Quality Act). Doing this should enable the development and operationalization of tiered data access in which (a) the types and levels of protection and access vary across tiers; (b) the requirement of benefit is redefined to include evidence building, the productive use of Census Bureau–developed data, and other statistical purposes; and (c) the types of individuals who are eligible to access the data in these tiers are broadened.
As discussed in Chapter 2, SIPP data files are complex files to work with, and data users often have expressed difficulty in working with them. Data users’ needs for clear communication have been noted frequently in previous National Academies’ Committee on National Statistics (CNSTAT) reports, as listed in Chapter 1. The Census Bureau’s general communications concerning SIPP are outside of the charge given to the panel and thus are not discussed here. However, the use of disclosure avoidance techniques magnifies users’ needs for clear and comprehensive training for the following reasons.
First, both data users and the general public have shown little awareness of the Census Bureau’s efforts to protect confidentiality. Some users of the decennial census data appear not to realize that the Census Bureau has long been altering data to protect confidentiality, leading to complaints after the adoption of differential privacy that would have also applied to earlier decennial censuses as well (boyd & Sarathy, 2022). A recent survey of economists shows that researchers often are unfamiliar with current approaches; for example, 54 percent had never heard of the concept of differential privacy (Williams et al., 2023). The general public is unlikely to be interested in the details of disclosure avoidance, but potential respondents will wish to know that the responses they give will be kept confidential. To avoid misperceptions of disclosure avoidance techniques used by SIPP, it is important to document both past and new procedures.
Conclusion 10-1: Many current users do not understand privacy protection procedures and how they affect the reliability of their statistical output or the methodological procedures necessary to make valid inference.
Second, to the extent that new approaches are taken, data users will need information on how to analyze and interpret the data.3 Differential privacy, if used, requires different approaches for computing standard errors, and data users will need both training and tools to perform the appropriate calculations. Most disclosure avoidance methods will require users to use new software to obtain unbiased and reliable estimates. Depending on the complexity of the disclosure avoidance method, this software could be difficult to use and not readily available. Most of the existing software “packages,” which are used by the vast majority of users of SIPP, will no longer be usable, yet it is unrealistic to think that all users will start programming new statistical procedures in Fortran, R, Python, Matlab, or whatever their favorite programming language is. In some very simple cases, where noise infusion is minimal or where synthetic data are produced using simple models, software is easily available (e.g., to use multiple imputation or correction for errors-in-variable). However, for the vast majority of methods that are likely to be used, the necessary programming will be much more complex, and data users will need to be trained in the use of these methods.
As discussed in Chapter 6, users of synthetic data need to have communicated to them the nature of the statistical generating model, including which relationships the model captures and which it cannot, to enhance their understanding of what relationships are likely to be accurate and reliable and which are not. Also, users will need to be trained on appropriate methods for analyzing synthetic data to construct valid confidence intervals.
Third, if different levels of data are offered through different tiers of access, then data users need information to determine which tiers will be most useful for their analysis. For example, the accuracy of estimates based on synthetic data depends on how well the models used for synthesizing the data align with data users’ planned analyses. Thus, data users need information about the models that were used and the implications for their analyses.
Communication should not be in only one direction—that is, only from the Census Bureau to data users. When the panel began its work, the Census Bureau had no regular process in place for determining the needs and satisfaction of SIPP data users other than providing an email contact. For that reason, the panel conducted its own data collection. Since that time, SIPP conducted a two-day data conference in February 2023, part of which was devoted to allowing time for SIPP data users to ask questions and express
___________________
3 The need for users to be able to conduct analyses that recognize and incorporate the statistical disclosure limitation (SDL) methods embedded in the data applies to traditional SDL methods as well. It is not clear whether users fully recognize the methods that the Census Bureau currently uses.
their needs. The panel commends the Census Bureau for holding the conference and particularly for allowing for user dialogue.
Communication is also important at all stages of data preparation and release. Understanding data users’ needs will help the Census Bureau to know how to prepare access modes and tools, and how best to prepare documentation that will support data users in making use of protected data. In turn, when SIPP data users are offered transparency and consultation regarding disclosure avoidance approaches, they will be better able to plan and conduct their research while mitigating tensions over disclosure avoidance approaches.
The importance of communication has been a longstanding concern, as noted in previous National Academies reports. See Recommendations 6-4 (National Research Council, 1993), 9-3 (National Academies, 2018), and 9-5 (National Academies, 2018).
Recommendation 3-2: The Census Bureau should find ways to communicate to researchers and the general public the findings of its assessments of re-identification attacks and disclosure risk so that data users will understand the need for disclosure avoidance techniques.
Recommendation 10-1: The Survey of Income and Program Participation (SIPP) program/Census Bureau should work to enhance communications and engagement with SIPP users. The Census Bureau should maintain regular contact with SIPP users through user groups, periodic surveys, and regular workshops. The Census Bureau should also solicit user input and use it in its ongoing planning for the survey.
One way of systematically seeking user feedback is through the creation of a technical advisory group. For example, the Census Bureau might request that the American Statistical Association (ASA) create a committee to partner with it; this would be a simpler and less cumbersome approach than creating a group that would fall under the auspices of the Federal Advisory Committee Act. Given that privacy-preserving techniques are rapidly evolving, such a group could also be helpful in providing guidance on new potential approaches or risks and in explaining the need for disclosure avoidance to data users.
While the Census Bureau currently has no active SIPP technical users group with which it regularly consults, at various times it has engaged with stakeholders about the quality and future of SIPP. The 1993 CNSTAT report (National Research Council, 1993) documents a few technical advisory committees the Office of Management and Budget interagency SIPP committee, Social Science Research Council committee, ASA’s Survey Research Methods (SRM) Working Group on Technical Aspects of SIPP, and the Association of Public Data Users SIPP Users group (National
Research Council, 1993). During the late 1990s and early 2000s, the Census Bureau engaged SIPP users to draft SIPP Quality Profiles and User Guides.4 During the re-engineering of SIPP in the mid-2000s, the Census Bureau continually engaged with stakeholders.5 In conjunction with the release of the 2009 CNSTAT SIPP report, the Census Bureau reconstituted the ASA/SRM Working Group on Technical Aspects of SIPP, which provided it with technical assistance on the re-engineering until the new 2014 CNSTAT panel was created. In recent years, the Census Bureau has sponsored user trainings, workshops, and conferences.6
Recommendation 10-2: The Survey of Income and Program Participation program/Census Bureau should consider creating a standing technical consulting group that can provide guidance, accountability, and justification on disclosure avoidance approaches while considering the needs of data users.
Finally, the Census Bureau is not the only federal agency dealing with privacy issues. The Census Bureau sometimes has taken the lead in developing new approaches, as when it provided a synthetic version of SIPP data and when it adopted differential privacy for the 2020 decennial census, but establishing a close working relationship with other federal agencies could help in creating a systematic body of research on disclosure avoidance as well as possibly creating a consensus on best approaches. The National Secure Data Service and America’s Datahub Consortium could be helpful in this regard. While the Census Bureau has the greatest funding of all the principal federal statistical agencies, it is far from alone. Other agencies are also dealing with statistical disclosure issues, the sharing of data, and the sponsorship of research in disclosure avoidance. Collaborating with these other agencies could help in setting federal statistical standards, creating mechanisms for data access, and sponsoring research on disclosure avoidance.
The utility of SIPP as a program may be enhanced by linking the survey data with administrative data sources. While the Census Bureau has some freedom in how to interpret Title 13, the issue is more complicated when data from other agencies are considered. The restrictions placed on IRS data are especially tight, and the Census Bureau must respect the limitations placed on it by the
___________________
4 https://www.census.gov/library/working-papers/1998/demo/SEHSD-WP1998-11.html
5 https://www.census.gov/programs-surveys/sipp/about/re-engineered-sipp.html
6 For the 2023 conference, see https://www.census.gov/programs-surveys/sipp/events/2023-sipp-conference.html
IRS and other agencies. Thus, data merges have been restricted to data access through FSRDCs, the creation and use of SIPP Synthetic Beta, and internal use of such databases for SIPP error checking. Nevertheless, examples where researchers might wish to link SIPP with alternative data sources include these:
Depending on what linking data are available and what requirements are placed on the data by the organizations providing them, the linking might involve either directly inserting data taken from external sources or using the external data to impute data or correct for response errors (e.g., using statistical models). Currently, to support the linking of SIPP (or other survey) data with administrative data, the Census Bureau uses the name, date of birth, sex, and address to find a Social Security number, which it then replaces with a Protected Identification Key (PIK).7 This process produces a match rate of approximately 90 percent with SIPP data, while administrative record variables are imputed when the PIK is missing.
As suggested by the examples in Chapter 2, there are many potential uses of linked data, and there are many potential sources for linking. Linkages are the most powerful and privacy concerns are the greatest when there are direct linkages. Since SIPP uses statistical sampling and generally does not intentionally overlap its samples with other surveys, linkages work best when the external data comprise a census rather than a sample. Some examples include tax data, Social Security, unemployment insurance, Medicare, student loans, participation in the Supplemental Nutrition Assistance Program (available for selected states), and other databases on state-provided assistance.
If data users wish to link the data themselves using the PIK, currently they would need to work through an internal Census Bureau project or a joint statistical project and then pay for the linking process. Potentially, depending on the nature of the data sharing agreement for accessing the data to be linked, the Census Bureau might create a way in which data linkages could also be provided through SODA. Sometimes, however, there may not be adequate data for linking individuals across databases, the providers of the external data may set conditions on how the data may be used, or there may be other privacy concerns, requiring the use of imputed or synthetic data rather than inserting administrative data directly.
___________________
7 B. Gurrentz, Model-based imputation & administrative records in SIPP processing, presentation to panel on June 30, 2022.
However, a different approach, which would enhance SIPP usability and perhaps increase the number of SIPP users, is for the Census Bureau (or its designee) to perform the linkages and provide a synthetic data file. This is what the Census Bureau currently does with SIPP Synthetic Beta, and the approach might be generalized to make more data accessible. Such data could even be included in the public-use file or, alternatively, be provided through SODA. Given the time and resources required to create such data, it would be important to first prioritize which variables would be most useful to researchers and policymakers, and only then to synthesize just those variables and augment the SIPP survey data. In generating these synthetic variables, it is important to maintain some key variables in the SIPP survey data. The importance of linking SIPP with other data has also been noted in previous National Academies reports, as discussed in Chapter 1. See Recommendations 4 (National Research Council, 1989), 3-5 and 3-7 (National Research Council, 2009), and 5-2 (National Academies, 2018).
Disclosure avoidance is a complex and rapidly changing field, as illustrated by the ever-expanding scientific literature on differential privacy. The Census Bureau is already conducting and funding research on disclosure avoidance, and continued research is vital. In that respect, the Census Bureau should work with researchers outside the federal government to get their perspectives and benefit from their capabilities.
Recommendation 6-1: The Census Bureau should support research that develops new approaches and techniques for generating synthetic data and assessing data utility and disclosure risk, with or without differential privacy or other privacy guarantees, and should fund work that creates and updates tools and systems that evaluate synthetic data, disclosure risk, and usability and collect data users’ feedback.
Recommendation 10-2: The Survey of Income and Program Participation program/Census Bureau should consider creating a standing technical consulting group that can provide guidance, accountability, and justification on disclosure avoidance approaches while considering the needs of data users.
Recommendation 3-3: The Census Bureau should find ways to partner with and involve external researchers and other experts in its risk assessment research program. Such involvement will expand the Census Bureau’s capacity and will allow it to tap into state-of-the-art developments in the area of disclosure risk assessment.
The panel recognizes that applying the full panoply of tools discussed in this report will take time, resources, and research. There is currently no off-the-shelf tool for creating either a satisfactory synthetic data file or a SIPP public-use file with differential privacy, especially given the large number of variables and the presence of longitudinal data. The Census Bureau has already experimented with a table generator for longitudinal SIPP data and decided it was too difficult to maintain. SIPP data will be available in the Microdata Access Tool soon, according to the Census Bureau, though it will only include pre-2014 SIPP data.8 Creating a full implementation of such tools at the current time would at best greatly delay the release of SIPP data and, at worst, may not yet be doable. The challenges do not stop with the creation of a database, either, but include how it is maintained. The validity of inferences from synthetic data varies depending on what models were used to generate the synthetic data and how well the data support a particular planned data analysis, meaning that validation against the original data often will be needed. It is to be hoped that the Census Bureau will be able to publish data that will indicate which kinds of analyses are most reliable and which are most problematic, helping to reduce the number of times validation is needed. Still, validation of synthetic data findings already is burdensome on the Census Bureau staff, and it would not be practical for Census Bureau staff if the number of users and the types of analyses they conduct were to increase due to the number of synthetic data users increasing greatly. The Census Bureau might find that the best option is to partner with another organization, one that has the capacity for supporting SODA.
Conclusion 10-2: The Census Bureau might consider using outside contractors or grantees to provide some of the infrastructure required for developing and sustaining a table generator/remote analysis system, producing synthetic data, providing validation, providing secure online data access, and/or performing disclosure review.
Conducting the research and developing the tools discussed here will take substantial time, while the timely release of SIPP data remains important. Therefore, the panel recommends phasing in additional protections and access over time. The Census Bureau should continue offering
___________________
8 Email from Holly Fee, Census Bureau, July 31, 2023.
a public-use file, omitting items that are determined to have too high a disclosure risk or replacing them with synthetic estimates. While a long-term solution may be the production of a fully synthetic public-use file, in the short run the most practical solution is probably to produce a somewhat less comprehensive public-use file, combined with a way to provide restricted access to additional SIPP data without imposing all the requirements currently required for FSRDC access.
The Census Bureau might also explore even shorter-term changes as a way of temporizing, given that it will take time to implement additional tiers of access and given the potential for disclosure risks identified in this report. One approach is to strengthen traditional SDL methods through more coarsening of the data, additional top-codes, additional geography suppression, and similar methods. Based on the Census Bureau’s re-identification exercise and on the panel’s analysis of data granularity, age and geography might pose the greatest threats; these are the two most readily available measures on external datasets with high granularity. For example, age might be recoded into a largely categorical measure, while still retaining key age thresholds for program eligibility, and state identifiers might be replaced with measures of geographic region. However, these increased protections come at the cost of reduced data utility, and any adoption of increased SDL methods should seek to minimize the loss of that utility. Another option that has been adopted in a number of European countries is a registration process for those who download the public-use file, requiring users to give identifying information and state the intended purpose of the data analysis. Alternatively or in addition, users of public-use SIPP files could be required to sign a user agreement that requires them to agree to specific uses of the data and to not disclose any data that reasonably seem confidential. User agreements are common for restricted-use files, but they can also be applied to public-use files; for example, note how the National Center for Education Statistics imposes some conditions on the use of public-use files, as discussed in Chapter 4. These options do not represent a permanent solution, and it would still be possible for a sophisticated intruder to attempt re-identification. Still, they would enhance current protections and make the general user community aware of these issues and sensitive to the problems. It is important that any user agreements or enforcement actions be implemented very carefully so as not to have a chilling effect on privacy-related and other types of research.
Multiple National Academies reports have noted the importance of the timely release of data. Given the current state of research and technology for tools such as synthetic data and differential privacy, the Census Bureau should continue to release public-use files using current technology (which may include a limited number of synthetic variables) and with appropriate protections of confidentiality (which may require moving some data to be
available through SODA). See Chapter 1 for recommendations in previous reports—that is, Recommendations 5 (National Research Council, 1989), 3-6 (National Research Council, 2009; also see Conclusion 4-7), and 9-1 (National Academies, 2018).
Considering which steps might be taken quickly and which will require more time, following is a roadmap with suggested milestones for implementing the changes recommended in this report.