A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation (2024)

Chapter: Appendix A: Technical Details on Measuring Disclosure Risk

Previous Chapter: References
Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Appendix A

Technical Details on Measuring Disclosure Risk

Denote Fk as the population size and fk the sample size in cells k defined by cross-classifying the K key variables. The set of sample uniques is defined as SU = {k:fk = 1}, and these are the high-risk records in the public-use file that have the potential to be population uniques. The DIS measure (Skinner & Elliot, 2002) is defined as

D I S = k I ( f k = 1 ) k I ( f k = 1 ) F k ,

where I is the identify function. The DIS measure is the number of sample uniques divided by the sum of the population sizes in those cells defined by SU. The advantage of the DIS measure is that it can easily be estimated without the need for any probabilistic modeling. The estimate for the DIS is

D I S ^ = π n 1 / [ π n 1 + 2 ( 1 π ) n 2 ] ,

where π is the sample fraction, n1 = ∑k I(fk = 1) the number of sample uniques, and n2 = ∑k I(fk = 2) the number of sample doubles.

As mentioned in Chapter 3, generally the “fishing” scenario is not sufficient for statistical agencies as they need to assess the consequences of releasing a public-use file. Therefore, a more relevant disclosure risk measure is based on the attack scenario—that is, when an intruder has access to the confidential microdata and wants to make an identification based

Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

on an external data source containing information about the population. Two main disclosure risk measures are

  1. expected number of sample uniques that are population uniques: τ1 = ∑k I(fk = 1, Fk), and
  2. expected number of correct matches for sample uniques assuming a random assignment within cell k.

For example, if a sample unique matches to three individuals in the population, the match probability for that sample unique would be 1/3. Aggregating all match probabilities over the sample uniques leads to τ2 = ∑k I(fk = 1) 1/Fk.

Shlomo and Skinner (2022) and Shlomo (2010) (and references therein) provide an overview of the literature on (a) measuring the risks of re-identification and (b) using probabilistic modeling to estimate the population sizes in cell k.

Using a probabilistic model, the risk measures are estimated by

τ 1 ^ = k I ( f k = 1 ) P ^ ( F k = 1 | f k = 1 ) and τ 2 ^ = k I ( f k = 1 ) E ^ ( F k = 1 | f k = 1 ) .

Skinner and Holmes (1998) and Elamir and Skinner (2006) propose a Poisson distribution and a log-linear model to estimate the disclosure risk measures. In this model, they assume that Fk~Pois(λk)for each cell k. A sample is drawn by Poisson or Bernoulli sampling with a sampling fraction πk in cell k: fk|Fk~Bin(Fk,πk). It follows that

fk~Pois(πk λk) and Fk|fk~Pois(λk(1 = πk))

where the population cell counts Fk are assumed independent given the sample cell counts fk.

The parameters λk are estimated using log-linear modeling. The sample frequencies fk are independent Poisson-distributed with a mean of μk = πk λk. A log-linear model for the μk is expressed as l o g ( μ k ) = x k ' β , where xk is a design vector that denotes the main effects and interactions of the model for the key variables. The maximum likelihood estimators β ^ are obtained by solving the score equations: ∑k (fkπkexp( x k ' β))xk = 0.

The fitted values are then calculated by μ k ^ = e x p ( x k ' β ^ ) and λ k ^ = μ k / π k ^ . Individual disclosure risk measures for cell k are

P(fk = 1) = exp (λk(1 − πk))

E(1/Fk|fk = 1) = (1 − exp(λk(1 − πk)))/(λk(1 − πk)).

Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Plugging λ k ^ for λk in the formula above leads to the estimates P ^ (Fk = 1|fk = 1) and E ^ (1/Fk|fk = 1) and then to τ 1 ^ and τ 2 ^ . Skinner and Shlomo (2008) develop a method for selecting the main effects and interactions of the log-linear model based on estimating and (approximately) minimizing the bias of the risk estimates τ 1 ^ and τ 2 ^ . There have been many extensions and similar approaches to estimate the risk of re-identification based on probabilistic modeling and these are mentioned in Shlomo and Skinner (2022). In addition, the R package SDCNway on the CRAN network has been developed that carries out the probabilistic modeling for estimating the risk of re-identification.

Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

This page intentionally left blank.

Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.
Page 207
Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.
Page 208
Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.
Page 209
Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.
Page 210
Next Chapter: Appendix B: Inferences Based on Multiple Synthetic Data
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.