Dropdown items
My Academies

Personal Library

Account settings

A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation (2024)

Chapter: Appendix A: Technical Details on Measuring Disclosure Risk

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Previous chapter Next chapter
Page of 248
Search this publication

Page 207 Cite

Suggested Citation: "Appendix A: Technical Details on Measuring Disclosure Risk." National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. doi: 10.17226/27169.

Appendix A

Technical Details on Measuring Disclosure Risk

Denote F_k as the population size and f_k the sample size in cells k defined by cross-classifying the K key variables. The set of sample uniques is defined as SU = {k:f_k = 1}, and these are the high-risk records in the public-use file that have the potential to be population uniques. The DIS measure (Skinner & Elliot, 2002) is defined as

$D I S = \frac{\sum_{k} I (f_{k} = 1)}{\sum_{k} I (f_{k} = 1) F_{k}},$

where I is the identify function. The DIS measure is the number of sample uniques divided by the sum of the population sizes in those cells defined by SU. The advantage of the DIS measure is that it can easily be estimated without the need for any probabilistic modeling. The estimate for the DIS is

$\hat{D I S} = π n_{1} / [π n_{1} + 2 (1 - π) n_{2}],$

where π is the sample fraction, n₁ = ∑_k I(f_k = 1) the number of sample uniques, and n₂ = ∑_k I(f_k = 2) the number of sample doubles.

As mentioned in Chapter 3, generally the “fishing” scenario is not sufficient for statistical agencies as they need to assess the consequences of releasing a public-use file. Therefore, a more relevant disclosure risk measure is based on the attack scenario—that is, when an intruder has access to the confidential microdata and wants to make an identification based

Page 208 Cite

on an external data source containing information about the population. Two main disclosure risk measures are

expected number of sample uniques that are population uniques: τ₁ = ∑_k I(f_k = 1, F_k), and
expected number of correct matches for sample uniques assuming a random assignment within cell k.

For example, if a sample unique matches to three individuals in the population, the match probability for that sample unique would be 1/3. Aggregating all match probabilities over the sample uniques leads to τ₂ = ∑_k I(f_k = 1) 1/F_k.

Shlomo and Skinner (2022) and Shlomo (2010) (and references therein) provide an overview of the literature on (a) measuring the risks of re-identification and (b) using probabilistic modeling to estimate the population sizes in cell k.

Using a probabilistic model, the risk measures are estimated by

$\hat{τ_{1}} = \sum_{k} I (f_{k} = 1) \hat{P} (F_{k} = 1 | f_{k} = 1)$ and $\hat{τ_{2}} = \sum_{k} I (f_{k} = 1) \hat{E} (F_{k} = 1 | f_{k} = 1) .$

Skinner and Holmes (1998) and Elamir and Skinner (2006) propose a Poisson distribution and a log-linear model to estimate the disclosure risk measures. In this model, they assume that F_k~Pois(λ_k)for each cell k. A sample is drawn by Poisson or Bernoulli sampling with a sampling fraction π_k in cell k: f_k|F_k~Bin(F_k,π_k). It follows that

f_k~Pois(π_k λ_k) and F_k|f_k~Pois(λ_k(1 = π_k))

where the population cell counts F_k are assumed independent given the sample cell counts f_k.

The parameters λ_k are estimated using log-linear modeling. The sample frequencies f_k are independent Poisson-distributed with a mean of μ_k = π_k λ_k. A log-linear model for the μ_k is expressed as $l o g (μ_{k}) = x_{k}^{'} β$ , where x_k is a design vector that denotes the main effects and interactions of the model for the key variables. The maximum likelihood estimators $\hat{β}$ are obtained by solving the score equations: ∑_k (f_k − π_kexp( $x_{k}^{'}$ β))x_k = 0.

The fitted values are then calculated by $\hat{μ_{k}} = e x p (x_{k}^{'} \hat{β})$ and $\hat{λ_{k}} = \hat{μ_{k} / π_{k}}$ . Individual disclosure risk measures for cell k are

P(f_k = 1) = exp (λ_k(1 − π_k))

E(1/F_k|f_k = 1) = (1 − exp(λ_k(1 − π_k)))/(λ_k(1 − π_k)).

Page 209 Cite

Plugging $\hat{λ_{k}}$ for λ_k in the formula above leads to the estimates $\hat{P}$ (F_k = 1|f_k = 1) and $\hat{E}$ (1/F_k|f_k = 1) and then to $\hat{τ_{1}}$ and $\hat{τ_{2}}$ . Skinner and Shlomo (2008) develop a method for selecting the main effects and interactions of the log-linear model based on estimating and (approximately) minimizing the bias of the risk estimates $\hat{τ_{1}}$ and $\hat{τ_{2}}$ . There have been many extensions and similar approaches to estimate the risk of re-identification based on probabilistic modeling and these are mentioned in Shlomo and Skinner (2022). In addition, the R package SDCNway on the CRAN network has been developed that carries out the probabilistic modeling for estimating the risk of re-identification.