This chapter first provides an overview of the way disclosure risk is measured, followed by a look at the specific disclosure-risk challenges that the Survey of Income and Program Participation (SIPP) faces.
Disclosure risk can be defined in multiple ways and forms, but the most common types include reconstruction, re-identification, and membership inference. Disclosure risk is a major concern in data privacy and confidentiality when it comes to sharing SIPP data for research or public-use purposes. If not managed properly, sensitive information, such as income, health insurance status, and fertility history, could be revealed, which could have significant consequences for the individual involved.
Reconstruction. First proposed by Dinur and Nissim (2003), this type of disclosure risk occurs when an intruder takes all of the published statistics and solves a large system of equations or an optimization problem to reconstruct the underlying microdata as well as possible. Intuitively, if the number of published statistics is larger than the number of unknowns in the underlying microdata and it is sufficiently accurate, reconstruction is very likely to be effective (because it involves solving a system with more equations than unknowns). Once the underlying microdata are reconstructed, re-identification attacks can be applied.
Re-identification. When microdata (datasets where each record corresponds to an individual data subject) are released, the most immediate threat is re-identification, which can occur when an attacker can identify
a specific individual with one of the released data records. This is typically done by matching the released record on a set of characteristics that make the individual unique and are known to the attacker through auxiliary datasets or by direct knowledge about the specific individual. For example, combinations of indirect identifiers, such as state of residence, age, race/ethnicity, sex, education attainment, and employment status, could be known to the intruder and be unique in both the sample and the population. Occupation is also collected at a highly granular level in SIPP; corresponding general external databases lack the same level of granularity, but the highly granular SIPP data still may be useful in identifying individuals. This is particularly a risk when data are highly granular and some combinations are particularly rare; for example, Asian males are relatively uncommon among flight attendants (Ruggles et al., 2023), so adding just a few more characteristics may uniquely identify such a person. By re-identifying an individual, an attacker learns other previously unknown characteristics about the individual. There is a large research literature showing that reidentification risk is pervasive in datasets containing many attributes per individual, and virtually any attribute can be used as a potential identifier (Narayanan & Shmatikov, 2010). There are also computational and statistical methods for profiling individuals in data collected and released over time, which increases the risk for longitudinal studies such as SIPP (Tournier & de Montjoye, 2022). See below for more on how re-identification risk can be estimated.
Membership inference. Yet another kind of attack on aggregate statistics is a membership inference attack, whereby an attacker uses published statistics and data about a particular individual to determine with high statistical confidence whether an individual was in the dataset used to compute the statistics. The possibility of membership inference attacks demonstrated by Homer et al. (2008) led the National Institutes for Health to restrict access to genomic summaries in the Database of Genotypes and Phenotypes (Couzin, 2008). The reason that membership inference is a concern is that sometimes being in a dataset, or a particular subset of a dataset, reveals sensitive information, such as when the dataset consists of individuals diagnosed with a particular disease (a “case group”). More recently, membership inference attacks have been developed against machine learning models (e.g., Carlini et al., 2022; Shokri et al., 2017). With black box access to a machine learning model, an attacker may be able to determine, with high statistical confidence, whether an individual was in the dataset used to train the model.
Methods for performing all these kinds of attacks are becoming increasingly sophisticated with time.
A different kind of disclosure risk appears when a respondent reveals that they completed the survey. This allows a more direct search for a
person in the sample with particular characteristics, potentially removes the uncertainty often associated with statistical sampling as to whether a person with a unique combination of characteristics is actually that particular respondent, and negates the reduction in privacy-loss that sampling offers differential privacy. Such self-disclosure can have the effect of negating the disclosure avoidance protections established by the Census Bureau. Sometimes surveys have encouraged people to acknowledge their participation in a survey as a way of encouraging others to participate; in this case it may be better for the Census Bureau to advise people of the risks of self-disclosure.
Disclosure risk can be measured using a probabilistic disclosure risk measure (Duncan et al., 2011; Duncan & Lambert, 1986, 1989; Lambert, 1993; McClure & Reiter, 2012) that refers to the probability that an individual’s sensitive information is revealed or inferred, or that an individual is re-identified based on attributes and patterns in a dataset through the usage of statistical, machine learning, or computational techniques. The disclosure risk can be quantified or measured using either an absolute or a relative scale. Absolute disclosure risk measures how likely it is that the intruder learns the sensitive information about a respondent in a dataset from some released information, given the intruder’s prior knowledge (see examples given in Chapter 2). Relative disclosure risk measures the intruder’s relative gain in ability to identify the respondent, beyond the intruder’s prior knowledge. Though both measures are useful in assessing risk, the relative disclosure risk can be controlled by the data producer, who only has the decision to release or not release a particular dataset. On the other hand, the absolute disclosure risk provides additional information on what the level of risk is after release. For example, it will show the data releaser whether the risk after release is one in 100 versus one in one million.
When microdata or individual-level data are released and the intruder’s goal is to re-identify a target individual in the released data, probabilistic re-identification risk can be measured as the probability of correctly identifying a “sample unique” (i.e., an individual in the sample data who has a unique set of attribute values) given a set of variables considered as quasi-identifiers. Probabilistic disclosure risk measures, which are often estimated through statistical modeling, are interpretable, and one can estimate the risk for each individual and every sensitive attribute in a dataset. Based on probabilistic disclosure risk assessment results, individuals at high risk of disclosure can be identified and corresponding disclosure limitation methods can be developed to reduce the risk.
Assessments of both types of disclosure risks, absolute and relative, require making assumptions about the intruder’s prior knowledge and
specification of the intruder’s model for learning an individual’s sensitive information (the re-identification attack method discussed below necessarily makes just such assumptions). This implies that the probabilistic disclosure risk measurements would vary by the assumptions and the intruder model specifications. In addition, it cannot be guaranteed that probabilistic disclosure risk can be controlled at a prespecified level a priori, given a dataset when designing or implementing a disclosure limitation procedure; rather, the risk is assessed in a post-hoc manner before and after a disclosure limitation procedure is applied. In summary, probabilistic disclosure risk depends on the prior knowledge of data intruders, the type and characteristics of the dataset from which information will be released, and the data releasing mechanism itself.
Differential privacy is a broad framework for privacy protection, as formulated in 2006 (Dwork, 2006; Dwork, McSherry, et al., 2006, 2017). This topic has since been the subject of intensive research and has been used by both large technology companies (Abadi et al., 2016; Differential Privacy Team, 2017; Ding et al., 2017; Erlingsson et al., 2014; Messing et al., 2020; Wilson et al., 2020; Yousefpour et al., 2021) and by the Census Bureau for the 2020 decennial census and other data releases (Abowd et al., 2022; Foote et al., 2019; Machanavajjhala et al., 2008). Within this framework, researchers can conceptualize a relative measure of disclosure risk, whereby instead of comparing an intruder’s knowledge before and after a data release, they compare the intruder’s knowledge after the data release to a counterfactual scenario in which the release is performed with any one individual’s data removed. Intuitively, every individual’s privacy is protected because what is released is indistinguishable from what would have been released if that individual had opted out from the survey or study. For a disclosure avoidance method or mechanism to satisfy differential privacy, this “indistinguishability property” must hold for every potential dataset and for every individual represented in the dataset. This requirement is achieved by carefully injecting small amounts of random noise into data and/or statistical computations to obscure the effect of each individual data point while still allowing a statistical signal about the population to come through. This leads to a privacy-utility tradeoff (e.g., see Slavković & Seeman, 2023, for a general discussion), whereby more noise provides stronger privacy protections but less accurate statistics, and less noise provides more accurate statistics but weaker privacy protections. The level of privacy protection is measured in differential privacy (and its variants) by one or more privacy-loss parameters, often denoted by the Greek letter epsilon (ε), which bounds the relative increase in risk compared to the counterfactual opt-out scenario.
A number of variants of differential privacy are commonly used, including in the 2020 decennial census (Abowd et al., 2022). One common variant, known as “bounded differential privacy,” requires indistinguishability when any one individual’s data are changed, as opposed to being removed in the original notion described above (which is sometimes called “unbounded differential privacy” for contrast). Bounded differential privacy is suitable when the number of observations is publicly known information, and it can allow for some modest improvements in utility under this assumption. Unbounded differential privacy has the benefit of not requiring this assumption, and in particular applies also when analyzing subsets of a dataset, whose size may be unknown even if the overall size is known. Gong and Meng (2020) show that using bounded differential privacy makes it easier to formally compare probabilistic disclosure risk measures with differential privacy. Another axis along which differential privacy variants vary is how indistinguishability or privacy loss is quantitatively measured. The basic version, where privacy loss is typically denoted by a single parameter ε, is called “pure” differential privacy, with some variants known as “approximate,” “concentrated,” “Renyi,” and “Gaussian” differential privacy. Interpretations of these variants can be found in Kifer et al. (2022), and many other variants have been described in Desfontaines and Pejó (2020).
One important feature that all the variants of differential privacy share is transparency. They do not allow the privacy protections to rely on the secrecy of the noise infusion mechanism used to protect the data. Thus, the exact mechanism, including the parameters of all noise distributions used, can be made available to the public, as was done for the 2020 decennial census (U.S. Census Bureau, 2021).1 (This transparency is in contrast to some traditional approaches to disclosure limitation, such as swapping, where the swap rate is often kept as a secret parameter. Such secrecy is an intentional part of protecting confidentiality, but it also leaves the data user with a limited ability to adjust for the impact on data analysis of changes made to the data.) Transparency allows for public input into the design of the algorithm, such as what form of statistical utility or types of analysis it is optimized for, and allows data users to estimate the impact of the noise infusion on the accuracy and uncertainty of their analysis.
One can design general randomized mechanisms that satisfy differential privacy guarantees, independent of local data or data intruders’ prior knowledge or models, and then apply those mechanisms to release statistics calculated from datasets so that the privacy loss from them is automatically
___________________
1 DAS_2020_Redistricting_Production_Code, https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code
controlled at a prespecified level. By contrast, statistical disclosure limitation (SDL) procedures can provide a certain level of protection, but it is only quantifiable after they are applied. Probabilistic disclosure risk can be assessed before and after a disclosure limitation technique is applied, including a differential privacy mechanism, to evaluate the effectiveness of the procedure in a post-hoc manner.
It is important to recognize that different disclosure avoidance mechanisms, including noise infusion to attain the differential privacy metric (see Chapter 4), may differ in their limitation of the alternative disclosure risks. For example, Abowd and Vilhuber (2008) examine the relation between differential privacy and probabilistic disclosure risk in the setting of releasing multinomial data. Gong and Meng (2020) use the ε-differential privacy mechanism, which imposes the same direct bound on both the differential privacy disclosure risk measure and the relative probabilistic disclosure risk defined in McClure and Reiter (2012) but only bounds the absolute probabilistic disclosure risk measure in proportion to the disclosure risk in the absence of the released data (i.e., the prior risk). These differences are assessed by McClure and Reiter (2012) in the context of releasing fully synthetic microlevel data (see Chapter 4 for a definition of synthetic data) that satisfies ε-differential privacy at different levels of ε and under different assumptions about what information is available to potential intruders.
Based on their simulations, McClure and Reiter (2012) note several findings about the relationship between the use of differential privacy mechanisms to bound absolute disclosure risk and their use to measure relative probabilistic disclosure risk. First, these two measures of risk do not necessarily provide the same answer to the level of risk. Their simulations show that relative risk can be low while absolute risk remains high. The simulations also demonstrate that absolute risk may be quite low, in which case the data steward can allow a higher level of relative risk as long as absolute risk stays below a chosen level. Second, the authors argue that data stewards may wish to evaluate disclosure risk over a wide range of possible levels of intruder information to determine the worst-case scenarios for absolute disclosure risk. Third, given their findings, they propose a strategy for using both measures of disclosure risk, combining the assumption of strong prior knowledge by intruders (which underlies the differential privacy approach) with the interpretability and data-specific relevance of a fuller range of statistical risk measures.
Finally, the relevance of different measures of disclosure risk noted above may differ depending on the objectives of data curators, such as the Census Bureau. This matter is discussed in Hotz et al. (2022). For example, Hotz et al. note that schemes like that proposed in Abowd and Schmutte (2019)—in which a data curator seeks to design and calibrate formal privacy disclosure limitation mechanisms that trade privacy protection for
the benefits of alternative uses of the data—require assessing the absolute disclosure risks. By contrast, the differential privacy mechanisms that focus on a different privacy-utility tradeoff focus on the differential privacy measure of disclosure risk.
It is challenging to develop satisfactory definitions and measures of disclosure risk that both achieve the desired privacy protection and also allow for useful statistical releases. Here the panel enumerates some of those challenges.
Disclosure risk can be either computed analytically or estimated empirically. Analytically, measuring disclosure risk means using a statistical analysis or proof to bound the disclosure risk measures described above. For absolute and relative disclosure risk, this requires carefully modeling the data distribution, the intruder’s prior knowledge, and the inferences one seeks to prevent. Assessing differential privacy requires much less modeling, as it is primarily a matter of determining the unit of privacy protection, such as individuals, households, or firms, and then determining how those units are represented in the data; the privacy loss can then generally be bounded using a mathematical proof.
Empirically measuring disclosure risk means carrying out experiments using existing attack methods to see how vulnerable a given dataset and disclosure avoidance mechanism is to the attacks. The results obtained, while informative, are thus lower bounds on risk. That is, if the experimental attacks are successful, then one knows that the disclosure avoidance mechanism is insufficient. But if they are unsuccessful, it still remains possible that an attacker using different auxiliary data sources or attack methodology will be able to violate confidentiality. For this reason, it is important that the experiments be as comprehensive as possible, drawing on a wide range of auxiliary data sources (e.g., federal administrative data, state databases, and commercial databases) and using the most sophisticated attack methods known.
When measuring disclosure risk empirically or using statistical modeling, it is important to explicitly consider different intruder models. The auxiliary data sources, attack methodology, and level of tolerable disclosure risk may differ widely depending, for example, on whether the attacker is a foreign intelligence agency trying to manipulate the U.S. electorate, a commercial data aggregator trying to acquire data to monetize, a domestic abuser trying to learn about their victim’s behaviors, or a researcher evaluating the effectiveness of a government program under the supervision of an Institutional Review Board. For each mode of access an agency provides, the kinds of adversaries who could gain access may differ. Assumptions about the intruder are key considerations in every measurement of disclosure risk.
This section provides an overview of attack scenarios and disclosure risk measures for measuring the risk of re-identification. Disclosure risk assessment methodologies have been developed for using external data, for example through record linkage, and also through the use of probabilistic modeling without the availability of external data.
A matching exercise that uses common variables in an attempt to link the records in one file with the same cases in an external file can be used to quantify re-identification risk. For example, such an exercise might try to link the planned public-use file, along with sufficient information to determine if a match was correct, with public administrative data or commercially available databases. If exact matching does not work well, probabilistic record linkage (Fellegi & Sunter, 1969; Winkler, 2006) can replace exact matching to quantify the likelihood of successfully linking a case. Exact matching might not work well if common variables are defined slightly differently, if data points are collected at different points in time in the two files, or due to measurement or sampling error, among other things. The records and variables with high re-identification risk can be targeted for perturbation in the risk mitigation process. The percentage of correct matches or the percentage of records whose best match is the true record can be reported as the risk measure.
As an example of this setting, early work by Yancey et al. (2002) and Domingo-Ferrer et al. (2001) used probabilistic and distance-based record linkage to scope out the risk of re-identification in public-use files. The linkages were carried out between the public-use file and the original confidential microdata, and a quantitative measure, such as the distance between the public-use file record and the correct record, was used to quantify disclosure risk. This approach is generally an overestimation of disclosure risk since it assumes that intruders have access to the original confidential data. In addition, this approach does not take into account the protection afforded by the sampling, which is a benefit if the survey inclusion status is unknown. There are limitations: the approach is dependent on the choice of quasi-identifiers, and it only assesses the specific strategies of probabilistic and distance-based record linkage. However, it is a good way to measure risk in the case of a “nosy household member” scenario—that is, a situation where someone may know the true and reported values of indirect identifiers in the sample.
More recent work on measuring the risk of re-identification has been based on intruder penetration testing and simulating attack scenarios based
on freely available information open to the public, such as Facebook, Google maps, and consumer/lifestyle databases that can be purchased. This approach requires a deep understanding of the data environment and what information is available in the public domain that can lead to an attack in the public-use file through linkages. See Tudor et al. (2014) for an example of an application.
Another approach for measuring the risk of re-identification is based on the concept of the “special unique” developed in Elliot et al. (2005); see the software code titled Special Uniques Detection Algorithm (SUDA). A special unique is based on the assumption that a record in the released confidential microdata, which is a sample unique on coarser, less detailed information, is riskier than one that is unique on a finer, more detailed set of information. Extensive empirical work has shown that special uniques were more likely to be population uniques and were correlated with the reciprocal of the population equivalence classes on a set of key variables K. The SUDA software classifies special uniques according to the size and number of subsets to obtain the minimal sample uniques. Examining all possible subsets of key variables K for special uniques requires high computational time. In addition, this approach also does not take into account the protection afforded by the sampling, but it would be relevant to determining the risk in the case of a “nosy household member” scenario as mentioned above.
Another quantitative disclosure risk measure for the risk of reidentification, which was proposed in Skinner and Elliot (2002), is called the Data Intrusion Simulation measure. In contrast to the typical scenario of matching a record from the confidential data to the population, this approach uses a “fishing” scenario—that is, one where the intruder has information about a data subject in the population, then explores whether the data subject is in the confidential data, and if so, provides the probability that there is a correct identification given that it is a unique on a set of variables that are quasi-identifiers. Suppose the quasi-identifiers are used to create a cross-classified table, and the sample uniques in the cross-classified tables are identified. Suppose that population sizes in the corresponding sample unique cells are also known or estimated. The Data Intrusion Simulation is defined as the ratio of the number of sample unique cells to the total population of the corresponding cells. For more technical details, see Appendix A.
The “fishing” scenario is generally not sufficiently relevant for statistical agencies that need to assess the consequences of releasing a whole data
public-use file (Cohen, 2022). Therefore, much literature has since been devoted to estimating more relevant disclosure risk measures based on the attack scenario of an intruder having access to the confidential microdata and wanting to make an identification based on an external data source containing information about the population. Two main disclosure risk measures2 are as follows:
The panel notes that the probabilistic modeling approach accounts for the protection afforded by the sampling. Limitations of this approach are that the choice of quasi-identifiers cannot be too large and often a number of iterations are carried out based on different sets of key variables to assess the risk of re-identification. For data released in the form of households and individuals, two different models are applied, one at the household level and one at the individual level, and the results of the risk assessment are merged.
In the discussion so far, the focus of disclosure risk has been on estimating the re-identification risk from the released data. Somewhat surprisingly, even releases of what appear to be simply aggregate statistics (e.g., contingency tables) are also subject to attack. Experiments with reconstruction attacks motivated the Census Bureau to adopt differential privacy for the 2020 decennial census (Abowd, 2021), and reconstruction attacks have also been demonstrated to be effective against commercial privacy software (Cohen & Nissim, 2020).
As attack scenarios have become more advanced, there is value in more perturbative methods being applied to the outputs. In recent years, more focus has been directed at the risk of reconstruction rather than the risk of re-identification following the extensive work carried out at the Census Bureau. Even releasing data in aggregate form has an associated disclosure risk of reconstruction. For example, tabular data are susceptible to attacks by “table differencing” (i.e., when one takes the differences between two
___________________
2 These are defined mathematically in Appendix A.
similar tables to generate estimates for a small subpopulation) and “table linking” (i.e., when one strings together information from several tables in order to form a microdata record on a singleton).
The database reconstruction theorem (Dinur & Nissim, 2003) has made data producers cautious about the number of data estimates they publish per number of data values in the confidential microdata. For example, if a very large number of tables is produced from a small dataset, then it is likely that most, if not all, of the dataset can be reconstructed by the published tables. Such reconstruction attacks are a great risk for outputs based on tabular data. A reconstruction attack is one that emulates an intruder using the full set of tabular products to reconstruct the underlying confidential microdata. The threat comes from an understanding of the constraints within a series or set of aggregate releases generated from confidential microdata. Using non-perturbative disclosure limitation methods, such as cell suppression, reconstruction algorithms can easily be built as a system of equations to be solved. Garfinkel et al. (2019) provide details of a reconstruction attack as it relates to census products. Even when one limits the releases of aggregate data, for example by publishing partial tables or rounded statistics such as underlying rates in the tabular data, others can easily reconstruct the relevant tables by relying on the same ideas, using a system of equations to be solved (for a few simple examples, see Fienberg & Slavković, 2005; Slavković & Fienberg, 2004). As mentioned, using perturbative methods and the preset disclosure limitation rules in a remote table builder can mitigate the risk of reconstruction (see Chapter 7 for additional discussion of table builders).
Furthermore, attack scenarios have become very sophisticated, to the point that disclosure risk checks are important even for perturbed output data or synthetic data. For example, membership inference attacks can successfully detect the data used to create the synthetic or perturbed output by observing the output, even without the synthetic data model’s parameters. Another attack scenario, called the memory-based attack, focuses the attack on getting information in or data out of the system’s memory, which is less detectable than direct attacks on file directories (Daily, 2022; Francis, 2022; Jarmin, 2021; McKenna, 2019).
As described in Chapter 2, SIPP has characteristics that both help and hinder the task of disclosure avoidance. The use of statistical sampling is largely helpful by making it more difficult to identify whether individuals are unique within the population, though including data defining the statistical strata acts to group similar respondents in ways that present the potential for risks. SIPP’s complex sample design, including oversampling
of primary sampling units with higher concentrations of low-income households, would lessen the protection provided by sampling to some degree. The use of data imputation for item nonresponse, sometimes for a large number of cases (again, see Chapter 2), in effect introduces synthetic data even into the public-use file. Several factors complicate the task of disclosure avoidance, including the large number of variables containing potentially identifying information, the high granularity of the data, the presence of data not simply on individuals but on households, and the capacity to use longitudinal data to measure change over time.
While the disclosure risks associated with some of these features have been assessed in other data releases, both by the Census Bureau and by other data providers, the existing literature does not provide much guidance on the disclosure risks associated with the combination of these features, such as disclosure risks of longitudinal data on households, nor on the combination of the above features of SIPP. There have been two previous incidents in which disclosure avoidance became an important issue for SIPP: one occurred in 1996, when data on adolescents in a special self-administered section could not be publicly released because they could easily be linked to the household data, allowing parents to identify their children’s responses. In this case the lack of a public release disappointed many who planned to use the data. The other incident involved a husband hearing his wife’s previous responses as part of what had been standard procedures to shorten the length of the interviews, and because of this newly learning that his wife had been previously married. As a result of these two incidents, the Census Bureau modified its confidentiality protections by explaining that confidentiality between members of the same household cannot be protected (McKenna, 2019).
The ability to identify unique respondents for the purpose of matching depends on the number of variables available, the granularity or coarseness of the data, and the degree to which particular values are relatively rare (e.g., identifying someone as Native American would do much more to identify someone uniquely than identifying someone as White). All data, when used in combination, are potentially disclosive, but data that are highly granular simplify the task of intruders seeking to identify respondents.
SIPP data are highly granular, as shown below in the statistics from the SIPP 2020 file among the primary respondents in month 12.3 The dollar amounts are rounded to varying amounts. For example, the variable
___________________
3 This section contains custom tabulations provided by the Census Bureau to support the panel’s analysis, all based on the public-use file.
TJSCHKVAL, or total share of a joint shared checking account, only appears to be rounded for values of $1,000 or more, then rounded to the nearest $10, but is rounded to the nearest $100 starting with $10,000 and has a top-coded value of $85,300; with 136 observations resulting in 473 unique values, 253 of those values are associated with only a single person. The occupation data take 522 values, and the state identifiers take 52 values. Three standard variables alone (occupation by race by sex) are sufficient to produce 613 singletons in the 2020 panel, and adding state of residence as a fourth variable is sufficient to produce 5,627 singletons.
As a sample survey, however, the weights for each observation in SIPP range from 387 people to 39,951. Thus, it is not clear how many of the singletons within SIPP represent actual singletons in the population. Note that one could potentially use a different source with more observations and thus generally smaller weights, such as the American Community Survey, to estimate the number in a subgroup more precisely, but these still would be weighted statistics. The true test of uniqueness is through using the most comprehensive databases that might be available to data intruders.
The granularity of the data is increased when one considers the multidimensional nature of the data. The above frequencies are based on seeking to find a match for a single individual. However, SIPP data also include information about other people in that person’s household, as well as changes over time. If one has data about these other features, then even a relatively small number of variables applied over a full household can often identify a household uniquely. Table 3-1 illustrates how the ability to identify a respondent increases as these extra aspects of data are considered. For the 2020 SIPP file, looking for simplicity’s sake only at Wave 1 households, there were a total of 8,004 households. When looking at the primary respondent only, there were 440 different combinations of sex/race/ethnicity/state, and 99 of those combinations described only a single household in the file. Considering that SIPP is a sample-based survey, these combinations of households are unlikely to represent unique households from a truly national perspective. (As noted above, the smallest weight is 387.) Still, as the number of household members being matched is increased, the number and percentage that are unique households increase greatly. When the characteristics of the first five adults in the household are considered, then 15.8 percent of households have unique characteristics, based only on sex, race, ethnicity, and state. If two more variables are added (occupation and age), again allowing for up to five adults, then 93.9 percent of the households (or 7,513 of 8,004 households) have unique combinations of characteristics. In fact, almost the same percentage of households (93.6%) can be uniquely identified based on only two adults and the same characteristics. Again, without matching these characteristics against a national database (such as the decennial census or some commercial databases), one cannot
| Number of characteristics | Primary member only | Two adults | Three adults | Four adults | Five adults |
|---|---|---|---|---|---|
| Sex/race/ethnicity/state | 1.2 | 7.4 | 11.6 | 14.8 | 15.8 |
| Add occupation | 40.0 | 63.6 | 66.7 | 67.3 | 67.4 |
| Add age | 77.8 | 93.6 | 93.8 | 93.8 | 93.9 |
| Add number of children | 86.5 | 93.9 | 94.0 | 94.0 | 94.0 |
| Add change over time in number of children | 94.0 | 94.0 | 94.0 | 94.0 | 94.0 |
NOTE: Data are for Wave 1 of SIPP 2020. Change over time is measured within 2020.
SOURCE: Census Bureau, custom tabulations.
precisely calculate the disclosure risk to any household. However, with 7,513 households uniquely identified, it seems likely that some households can be uniquely identified in a national database, using only a small number of all of the variables available in SIPP for matching. Adding in measures of change over time within the year 2020 increases the disclosure risk even more. This analysis demonstrates how the dimensionality of the data contained in SIPP is a key consideration when evaluating disclosure risk.
One way to further examine the likely rarity of the above unique combinations of characteristics is to examine their frequency across both waves included in the 2020 SIPP file.4 If a set of characteristics appears only once across multiple waves, then its frequency is likely to be less than the minimum weight of 387. Frequencies across all waves are shown in Table 3-2.5 This results in dropping some sets of combinations that are no longer unique across all waves, while adding some that appeared only in a different wave. The percentage of cells with unique combinations of characteristics is reduced relative to Table 3-1, but the same basic pattern emerged: the likelihood of unique combinations increases as the number of characteristics increases and as the number of adults is increased. Once one has added age and includes the two primary adults, then about 87.5 percent of the cells are unique, with little gained by adding more characteristics or adults in the same household.
___________________
4 A SIPP data collection year may include up to four panels, each with a different wave. However, there was no 2017 SIPP, and the 2019 SIPP was limited to a single wave. Thus, the 2020 SIPP had Wave 1 from the 2020 panel and Wave 3 from the 2018 panel.
5 SIPP 2020 contained data only for Waves 1 and 3, reflecting the 2018 and 2020 panels. SIPP was not conducted in 2017, so there is no Wave 4, and the 2019 SIPP panel was discontinued after a single year.
| Number of characteristics | Primary member only | Two adults | Three adults | Four adults | Five adults |
|---|---|---|---|---|---|
| Sex/race/ethnicity/state | 0.4 | 4.5 | 8.5 | 11.0 | 12.0 |
| Add occupation | 28.8 | 56.9 | 60.9 | 61.5 | 61.7 |
| Add age | 64.3 | 87.5 | 88.0 | 88.1 | 88.1 |
| Add number of children | 74.6 | 88.7 | 89.1 | 89.1 | 89.1 |
| Add change over time in number of children | 89.2 | 89.2 | 89.2 | 89.2 | 89.2 |
NOTE: Data are for Waves 1 and 3 (combined) of SIPP 2020. Change over time is measured within 2020.
SOURCE: Census Bureau, custom tabulations.
Given that many databases may not contain occupation codes, particularly at the level collected in SIPP, Table 3-3 provides a similar breakdown by omitting occupation and instead including the highest level of education attained.6 Adding the level of education results in making 59.2 percent of the households unique, even when considering only the primary respondent, and over 90 percent of the households are unique if two or more adults are considered. Again, given that the smallest weight was 387, unique combinations of characteristics within SIPP may not be unique within the population. If someone knew that a particular person had participated in SIPP, then it is highly likely that person could be identified based on only a small number of variables. To determine which of these combinations were unique in the entire population, one would need to examine an alternative database.7
One factor not discussed in the overview of SIPP in Chapter 2 is the degree to which variables provided in SIPP are also available in other
___________________
6 SIPP collects considerably more detail on the education level than would appear in most databases (e.g., with categories such as “less than 1st grade”). To better simulate what might be available on external databases for matching purposes, the categories were collapsed to less than high school diploma, high school graduate, some college, associate’s degree, bachelor’s degree, master’s degree, and professional or doctorate degree.
7 An alternative approach, which may not be feasible, is to computationally consider the space of all possible tables that satisfy some of the percentages and other statistics that are reported, akin to linear or integer programming.
| Number of characteristics | Primary member only | Two adults | Three adults | Four adults | Five adults |
|---|---|---|---|---|---|
| Sex/race/ethnicity/state | 0.4 | 4.5 | 8.5 | 11.0 | 12.0 |
| Add birth year | 21.4 | 72.3 | 74.9 | 75.2 | 75.3 |
| Add highest level of education | 59.2 | 91.4 | 92.0 | 92.0 | 92.1 |
| Add number of children | 77.8 | 92.6 | 93.0 | 93.1 | 93.1 |
| Add change over time in number of children | 93.1 | 93.1 | 93.1 | 93.1 | 93.1 |
NOTE: Data are for Waves 1 and 3 (combined) of SIPP 2020. Change over time is measured within 2020.
SOURCE: Census Bureau, custom tabulations.
databases for potential matching. Table 3-4 displays some of the relevant variables available on three major commercial databases, though this list is incomplete (e.g., Acxiom says that it measures 10,000 attributes). Other potential databases that might be used for matching include federal and state databases. Collectively, these databases present considerable data available for a data intruder to use for matching and identifying SIPP respondents, as well as to identify the likely uniqueness of various combinations of attributes.
To help gauge the level and nature of disclosure risks associated with the SIPP Public Use Microdata Files, the Census Bureau completed an initial re-identification risk study for the 2014 SIPP panel, referred to as the “Re-id Study,” that involved a statistical matching process on tax filers from SIPP panel data with four years of Internal Revenue Service (IRS) data.8 In doing
___________________
8 A. Dajani, S. Clark, & P. Singer, SIPP 2014 Panel Re-identification (Re-id) Study Findings and Recommendations, presentation to the Committee on National Statistics Panel on A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation (SIPP), March 21, 2023. This briefing provided a summary of the re-identification study; the full report had not yet been released at the time of this writing. The write-up in this section is based on that briefing.
TABLE 3-4 Key Areas in Which Three Commercial Databases Have Data That Correspond to SIPP Data
| Broad category | Data Axle | Equifax | Acxiom |
|---|---|---|---|
| Demographics | |||
| Basic | |||
| Date of birth | X | X | |
| Sex | X | X | X |
| Ethnicity | X | X | |
| Race | X | X | |
| Geographic data | X | X | X |
| Educational attainment | X | X | |
| Family and household relationships | |||
| Number of adults in household | X | ||
| Number of children in household | X | X | |
| Age range | X | ||
| Gender | X | ||
| Marital status | X | X | |
| Households likely to buy items for children | X | ||
| Family wireless plan subscribers | X | ||
| Marriage/divorce | X | ||
| Birth of children | X | ||
| Household size | X | ||
| Number & ages of children | X | ||
| Adult living relatives | X | ||
| Nativity, citizenship, and parent nativity | |||
| Country of origin | X | ||
| Residence | |||
| New homeowners | X | ||
| New Movers | X | ||
| Moves | X | ||
| Assets, employment and earnings | |||
| Assets | |||
| Home value | X | X | |
| Home sales price | X | X | |
| Years at residence | X | ||
| Home age | X | ||
| Mortgage date | X | ||
| Single family home | X |
| Broad category | Data Axle | Equifax | Acxiom |
|---|---|---|---|
| Multi-family home | X | ||
| Boat owner | X | ||
| Aircraft owner | X | ||
| Mortgage amount | X | ||
| Loan-to-value percentage | X | ||
| Invested assets | |||
| Estimated household assets | X | ||
| Credit behaviors and outstanding balances | X | X | |
| Student borrowers | |||
| Net worth | X | X | |
| Homeowner status | X | ||
| Home purchase | X | ||
| Auto purchase | X | ||
| Employment and Earnings | |||
| Household income | X | X | X |
| Expendable income | X | ||
| Occupation | X | X | |
| Health care utilization and medical expenditures | |||
| Health-related measures | X | ||
| Insurance behavior |
NOTE: These firms do not publish comprehensive lists of their variables, and they may have data comparable to SIPP that are not listed here. Thus, this table may understate the degree to which the databases have information that might be matched against SIPP. The firms also have a substantial amount of data that do not directly correspond to SIPP variables and are not shown here. Jeremy Groen of Data Axle completed the first column. The other two firms were requested to complete the table but did not, and the X’s are estimates based on reviewing the firms’ websites. It is not clear from the websites what level of detail is collected and how comparable the data are to what is collected in SIPP. Thus, these columns may contain errors.
SOURCES: Data from www.dataaxle.com, www.equifax.com, and www.acxiom.com; and materials provided by Collins Warren at Data Axle.
so, high-risk areas in the data, in terms of records or groups of records in conjunction with variables or combinations of variables, can potentially be identified and explored further for a more concentrated risk assessment, leading to consideration for data treatment solutions. In this manner, reidentification studies are an iterative process.
The Census Bureau’s Re-id Study is an example of an intruder attack at one point in time with one realization of the sample/population pair of files. It assumes that the intruder does not know the sample inclusion
status of the population but knows information relating to a certain set of indirect identifiers. One external “population” file is used, the linked Social Security Administration (SSA)/IRS file, which is not publicly available and therefore is considered a conservative choice. Prior to conducting the study, the data used in the statistical matching were harmonized to have consistent definitions and categories across databases. The extent of the recoding for categorical variables and the extent of the skewness in continuous variables have a bearing on the risk results.
The Re-id Study assessed the disclosure risk from combinations of 36 indirect identifying variables that were used together in a record linkage process. The study also took into account the longitudinal data structure by including four years of data for the same household records. Movements of households were addressed, to a limited extent, over the length of the panel. In addition, the study took into account that SIPP data are from a sample, as well as oversampling or differential sampling rates, by linking about 28,000 records from the SIPP 2014 panel to over 90 million records from a linked SSA/IRS data file that contained IRS data from 2013 to 2016 and SSA Numident for 2013. Linking in this way made possible many potential correct matches (or ties) on the 36 variables, where the true match may or may not be part of that group of potential correct matches.
The statistical matching in the Re-id Study used two distance measures, called the Taxicab and the Euclidean metric, that were computed for each pair of observations across SIPP panel data and the SSA/IRS file, within strata based on state by metro/non-metro status. For each of the four waves, the variables used were mover status, age, race, sex, presence of spouse in the household, number of children, and six sources of income. Note that only age is not susceptible to changes over time, and indeed it is identifiable. The results from the Re-id Study were classified into suspected matches using undisclosed thresholds. Re-identification rates using ratios of suspected to total count were computed. The “true” matches were identified using a third internal file (referred to as the Internal Use File). Knowing the true matches allowed two more rates to be computed: (1) the confirmed rate, which is the number confirmed over the total record count in the stratum; and (2) the conditional rate, which is the number confirmed over the number suspected. The Census Bureau uses the latter as a measure of precision.
Overall, the findings of the Re-id Study were that the risks of reidentification in the 2014 panel of the SIPP Public Use Microdata Files were “better than expected.” For state by metro/non-metro area strata, the study found that 72.2 percent of strata had zero confirmed re-identifications using the Taxicab metric and 13.9 percent had zero confirmed risk using the Euclidean metric. Mover households in SIPP were re-identified at similar rates to those households that did not move over the course of the panel, although more states had zero confirmed re-identifications for nonmover
households than for those that moved. Housing units in metro areas were re-identified at lower rates than nonmetro units. The study also found that no clear patterns emerged in rates of confirmed re-identifications across the various “linking variables” used to link with the external databases, such as age, race, sex, presence of spouse, number of children, and categorical versions of alternative sources of income, across the panel’s four waves. Finally, the Re-id Study found that age was the most disclosive variable among those used, both individually and with other linking variables. Based on this preliminary study, the Census Bureau concluded that no changes were needed to the public-use data file. (However, some limitations of this preliminary study are discussed in the following section.)
In this section, the panel offers a set of recommendations to the Census Bureau suggesting what elements should be included in an ongoing program for assessing the disclosure risks associated with both current and possible ways of releasing SIPP data, including public release files (public-use files) of microdata. The panel begins with an assessment of the Census Bureau’s recent re-identification study of the 2014 SIPP panel, a summary of which was presented to the panel for this report. The panel then offers its recommendations for the future program the Census Bureau should establish to assess the disclosure risk of SIPP data products.
The panel was impressed with the work that went into the Re-id Study. Furthermore, the authors of the Re-id Study were careful to note many caveats, limitations, and assumptions, some of which were noted above. Nonetheless, the panel had several reactions to the study that bear on the future efforts of the Census Bureau to assess the disclosure risks of SIPP data products, especially its public-use files.
First, the panel notes the lack of an acknowledgment in the study that many variables in SIPP are imputed. In particular, a large portion of the income data is imputed. One might hypothesize that data that are imputed have less predictive power and thus may produce more incorrect matches than would an analysis based only on actual survey responses. Still, if the imputed value comes reasonably close, it may support finding a match (certainly much more so than if one were working with a missing value). A data intruder might seek to focus only on non-imputed data to get better matches. Whether that strategy would work is not clear; a study by Krenzke et al. (2014) showed that the use of non-imputed data can lead
to overestimation of disclosure risk, while including imputed data corrects for that overestimation. Typically, for measures such as income one does not look for an exact match but rather at the closeness of the values in the examined databases, and an imputed value might still come sufficiently close (plus, respondents’ self-reports are also subject to error). The use of imputation also provides deniability to the respondent (e.g., “I didn’t say that; that is an imputed value”). The public release of imputation flags has multiple implications: (a) it may affect whether a respondent has deniability, and (b) it changes what strategies are available to a data intruder when seeking to identify a respondent. Ultimately, whether or not imputation has been performed, a re-identification study provides a test of the extent of disclosure risks, but it could also be used to explore whether a data intruder might benefit by focusing on actual reported values. As a side note, synthetic data are a type of imputed data and can contain disclosure risks just as imputed data can (Hotz et al., 2022). Again, one depends on tools such as re-identification studies to confirm the safety of the data.
Second, data intruders may show tremendous creativity and persistence when seeking to identify respondents, while the Census Bureau’s analysis did not explore the full range of options that are available to data intruders. The panel received only a summary of the Census Bureau’s findings, and only data that had been approved by the Census Bureau’s Disclosure Review Board could be released, but following are some additional types of analysis that might have been performed.
___________________
9 When there are exactly five observations deemed as being the closest, then the false positive rate is either 80 percent or 100 percent, depending on whether one of the five is a correct match. Lower false positive rates can appear when there are ties. For example, suppose that three observations are tied as being the fourth, fifth, and sixth closest. The Census Bureau’s procedure would be to exclude those three, leaving only those that are first, second, or third, resulting in a false positive rate of either 67 percent or 100 percent. However, in such cases, a lower threshold could also succeed in including a correct match. Thus, in general the larger the threshold is, the higher the false positive rate will be, even though increasing the threshold will also increase the chances of including the correct match.
___________________
10 Federal Committee on Statistical Methodology, https://nces.ed.gov/fcsm/dpt/content/3-2-2-1
11 The degree of risk or ability to match a person will depend on the database being used. A person who does not pay taxes will not be on the IRS database but may be in other databases (including of those receiving Social Security benefits and in databases maintained by credit agencies). IRS data are not designed to consider households with multiple filers, though perhaps they could be aggregated by address to create more comprehensive measures of households.
___________________
12 Email from Holly Fee, Census Bureau, May 16, 2023.
13 https://www.taxpolicycenter.org/model-estimates/tax-units-with-zero-or-negative-federal-individual-income-tax-oct-2022/t22-0131. A tax unit is defined as “an individual, or a married couple, that files a tax return or would file a tax return if their income were high enough, along with all dependents of that individual or married couple.” https://www.taxpolicycenter.org/resources/brief-description-tax-model. This is different from a SIPP household, which may consist of multiple tax units.
The panel recognizes that the Census Bureau faced multiple constraints, including what datasets were readily available, the time and resources available for the analysis, and the limitations placed by the Census Bureau’s Disclosure Review Board on what could be released to the panel or to the public. Most of the reporting focused on the precision results, which are not very helpful in identifying those who are at risk of disclosure. Rather, one wishes to know which subgroups faced disclosure risks so that they can be addressed. The panel suggests that using classification trees in lieu of logistic regression will allow a deeper dive into the results by determining “pockets” of high-risk areas. In doing so, risks among combinations of variables will be more apparent than risks are when looking at results one variable at a time.
More generally, the panel concluded that more can be learned from the Re-id Study than was contained in the presentation provided. Such information and detail would be needed for the panel and others to draw strong conclusions about the findings of this study and its adequacy. Furthermore, as
___________________
14 https://news.prudential.com/58-young-adults-are-still-living-at-home-impacting-their-parents-path-to-retirement.htm#:~:text=More%20than%20half%20(58%25),are%20living%20with%20their%20parents
15 This latter set of statistics is an approximation based on the following: SIPP data show 32,647 who filed federal income taxes. Of those, 9,855 filed a joint return, and of those 8,555 were in a household that filed two or more returns. To avoid double-counting, those with two or more returns were halved, leaving approximately 32,637 – 4,377 = 28,370 unique tax returns from 18,170 SIPP households. Thus, about 10,000 filers would have been excluded as belonging to households in which someone else was being counted.
discussed below, future studies of the re-identification risks of SIPP public-use files would benefit from additional analyses that were not contained in the Re-id Study. For example, using other external files and other key identifying variables, the statistical models mentioned in this chapter can be used to gauge the risk and may identify what subgroups (e.g., Supplemental Nutrition Assistance Program participants) pose the highest risk in more time-efficient first passes of the data. Then focus can be given to those subgroups by getting a better data source for them to conduct another matching study. It is important to maintain a continued focus on variables that are identifying and susceptible to change over time. Other areas of improvement to consider would be the use of preference weights in the linkage process to give priority to some variables over others. Another is to try different recoding scenarios to see the impact on risk. Yet another is to use different thresholds to see the impact on risk.
Nonetheless, the panel returns to its earlier observation that this study was a very good start, for which it wishes to commend the Census Bureau staff who completed it. The panel’s recommendations below are intended to build on this start and to help the Census Bureau develop a disclosure risk and re-identification program for SIPP products.
Specific to SIPP, where there are longitudinal data in the public-use file, it is important to include a time dimension in the key variables. In this manner, Li et al. (2021) applied a log-linear modeling approach (Skinner & Shlomo, 2008) to the Survey of Doctoral Recipients public-use files as an example to demonstrate a way to measure the re-identification risk while incorporating the longitudinal nature of the data. To assess the reidentification risk due to longitudinal data, the cross-sectional risk assessment model was re-processed by adding binary change indicators that were derived from variables that were more susceptible to changing values over the years. The variables for which the change indicators were derived from across survey years were academic position, number of children, federal government support, and year of highest degree. Marital status is another variable that would have an important time element to take into account. In another example, El Emam et al. (2011) examine and assess longitudinal risk due to demographics (age, gender, and postal codes) among Montreal residents for data gathered over 11 years by estimating the proportion of individuals who are unique in the population.
The panel notes that although the probabilistic modeling approach can be quite complex, it is a one-time study so long as the design of the
survey remains unchanged. However, if a new sample design is introduced, or new variables or new data products are introduced, then new risk assessments are appropriate. Each data product initially would be assessed for re-identification risk. To start, a re-identification risk assessment can be conducted using the measures discussed above on the current SIPP public-use file to identify subgroups of greatest risk, while incorporating the longitudinal structure and linkable sources of data.
At the same time, it becomes appropriate to update the re-identification studies for SIPP as external data sources, especially commercially available ones, change in both their scope of coverage and their content. For example, because these databases that contain names and addresses of individuals have recently been used to construct longitudinal data on households on annual changes in income and household composition and structure, assessing the risk of re-identification of these dimensions of SIPP longitudinal data requires monitoring external database development and their capacities.16
Finally, the Census Bureau has limited researchers’ access to the existing SIPP Synthetic Beta program, which synthesizes SIPP data linked with SSA earnings and benefit data (see Chapter 2), restricting access to approved research projects. Also, any output produced with the linked Gold Standard File data has been subjected to disclosure review. But the Census Bureau might consider including these data in its disclosure risk assessment program. Assessing SIPP Synthetic Beta data would provide a formal assessment of the disclosure risks of such data. This in turn could provide useful input into expanded use of synthetic methods for either public release files or files that can be accessed in modes other than through Federal Statistical Research Data Centers, as discussed in Chapter 6. Additionally, such assessments may provide guidance for other modes of access to SIPP linked with these and other administrative record linkage, such as through a remote flexible table generator, as discussed in Chapter 7.
Conclusion 3-1: Risk assessment may be quantified on either an absolute or relative scale, and differential privacy can be viewed as a type of relative risk measure.
Conclusion 3-2: There are several measures of disclosure risk that have been developed and assessed in the literature and they differ with respect to their ease of measurement and their appropriateness for addressing the tradeoffs between protecting data privacy and ensuring data usability.
___________________
16 A. Ziff, J. V. Hotz, & E. Wiemers, The association between the volatility of income and life expectancy in the U.S. (under review); made available by one of the authors.
Conclusion 3-3: While analytical approaches to risk assessment can be useful, Survey of Income and Program Participation’s complex and longitudinal data structure make it important to conduct empirical assessments using resources that would realistically be available to the intruders, various intruder models, and potential data release strategies. Intruder models range from intruders who know a lot about a specific model to those using databases to fish for someone to attack.
Conclusion 3-4: Widely available sources potentially used by intruders include federal administrative data, commercial data, and state program data. It is important to consider all these potential sources, also allowing for changes over time as new data become available and for the potential use of longitudinal data.
Conclusion 3-5: Survey of Income and Program Participation (SIPP) data are highly granular and may be subject to intruder attacks by matching. Potential identifying variables include age, state, country of birth, occupation, household income, home value and mortgage amount, program participation, demographic characteristics, household composition, and change over time in income, program participation, and household composition. The exact list may vary depending on what is collected through SIPP and what is available through other sources, both of which may change over time.
Conclusion 3-6: Current knowledge about Survey of Income and Program Participation (SIPP) disclosure risks is limited. The Census Bureau’s recent re-identification study for SIPP is helpful in this regard but still incomplete, not fully accounting for the range of content contained in SIPP and the inclusion of longitudinal data.
Conclusion 3-7: Analyses of disclosure risk conducted by outside researchers may provide valuable perspectives and can be important tools for guiding the improvement of disclosure limitation for Survey of Income and Program Participation.
Recommendation 3-1: The Census Bureau needs to establish a systematic and ongoing program of assessing disclosure risk for Survey of Income and Program Participation data products, especially microdata public-use files. This program should include both empirical and analytic approaches to assessing the risks of re-identification attacks, comparisons with a variety of external databases, probabilistic models, and the risks present in both cross-sectional and longitudinal data. All forms of data releases should be evaluated.
This program should include (a) conducting re-identification attacks and matching them with a range of external databases (e.g., federal government databases within the Census Bureau, state and local social program databases, commercial databases such as InfoUSA/Data Axle USA and Experian); (b) the use of probabilistic models to assess probabilities of re-identification for households/individuals with unique combinations of data; and (c) assessment of the disclosure risks associated with within-wave cross-section data and the longitudinal data across waves. If the results of these exercises reveal a set of sensitive variables that are at high risk of disclosure, it will be important to investigate the alternative disclosure limitation methods discussed in the rest of this report, in each case weighing the risk reduction against reductions in data usability. Each of those alternative SDL methods—including differential privacy mechanisms, partial or full synthetic methods, traditional SDL methods (these are all discussed in the next chapter) and other approaches—should be made the subject of disclosure risk assessments to gauge the reduction in risk that each obtains. These assessments should be considered for both existing and proposed SIPP data products.
The panel recommends that the Census Bureau establish a “baseline” assessment of these risks to guide its choices among disclosure avoidance systems methodologies and alternative products, including microdata releases; among data available through Table Generator and Remote Analysis platforms (discussed in Chapter 8); and possibly among data available through secure online data access (discussed in Chapter 6). Furthermore, the panel notes that this Disclosure Risk Assessment system for SIPP may require periodic re-assessments of disclosure risk to keep up with the development of new or more accessible external databases and (or) new or more challenging intruder attack models.
The panel thinks it is important that the Census Bureau find ways to release to users and the public the findings of its assessments of disclosure risk. Thus, the panel makes the following recommendation:
Recommendation 3-2: The Census Bureau should find ways to communicate to researchers and the general public the findings of its assessments of re-identification attacks and disclosure risk so that data users will understand the need for disclosure avoidance techniques.
Finally, while the Census Bureau has the primary obligation for conducting the disclosure risk assessments of SIPP public release data, the panel thinks it is important that the Census Bureau find ways to involve others in improving its understanding of these risks. As discussed in this chapter, the academic research community has developed a variety of sophisticated methodologies for re-identification and privacy attacks. Performing only
internal risk assessments using standardized strategies can lead to a vast underestimation of risk that does not represent what a creative and resourceful intruder might be able to do. Thus, the Census Bureau would greatly benefit from finding ways to involve external experts in its risk assessment program. This leads the panel to make the following recommendation:
Recommendation 3-3: The Census Bureau should find ways to partner with and involve external researchers and other experts in its risk assessment research program. Such involvement will expand the Census Bureau’s capacity and allow it to tap into state-of-the-art developments in the area of disclosure risk assessment.
When appropriate protections are in place, the involvement of external researchers in re-identification studies can be important for making progress in the SDL of SIPP data products and more broadly. Privacy protection specialists can help in identifying potential attack strategies that intruders might use and state-of-the-art methods for protecting the data, and SIPP data users can guide the Census Bureau on what data are needed and how the data might be accessed. Additionally, the involvement of data users would help them to feel that they are partners rather than people whose work is being undercut by disclosure avoidance procedures that they do not value. Non-experts such as college students might also be helpful in risk assessment, perhaps bringing a different kind of lateral thinking.