Denote Fk as the population size and fk the sample size in cells k defined by cross-classifying the K key variables. The set of sample uniques is defined as SU = {k:fk = 1}, and these are the high-risk records in the public-use file that have the potential to be population uniques. The DIS measure (Skinner & Elliot, 2002) is defined as
where I is the identify function. The DIS measure is the number of sample uniques divided by the sum of the population sizes in those cells defined by SU. The advantage of the DIS measure is that it can easily be estimated without the need for any probabilistic modeling. The estimate for the DIS is
where π is the sample fraction, n1 = ∑k I(fk = 1) the number of sample uniques, and n2 = ∑k I(fk = 2) the number of sample doubles.
As mentioned in Chapter 3, generally the “fishing” scenario is not sufficient for statistical agencies as they need to assess the consequences of releasing a public-use file. Therefore, a more relevant disclosure risk measure is based on the attack scenario—that is, when an intruder has access to the confidential microdata and wants to make an identification based
on an external data source containing information about the population. Two main disclosure risk measures are
For example, if a sample unique matches to three individuals in the population, the match probability for that sample unique would be 1/3. Aggregating all match probabilities over the sample uniques leads to τ2 = ∑k I(fk = 1) 1/Fk.
Shlomo and Skinner (2022) and Shlomo (2010) (and references therein) provide an overview of the literature on (a) measuring the risks of re-identification and (b) using probabilistic modeling to estimate the population sizes in cell k.
Using a probabilistic model, the risk measures are estimated by
and
Skinner and Holmes (1998) and Elamir and Skinner (2006) propose a Poisson distribution and a log-linear model to estimate the disclosure risk measures. In this model, they assume that Fk~Pois(λk)for each cell k. A sample is drawn by Poisson or Bernoulli sampling with a sampling fraction πk in cell k: fk|Fk~Bin(Fk,πk). It follows that
fk~Pois(πk λk) and Fk|fk~Pois(λk(1 = πk))
where the population cell counts Fk are assumed independent given the sample cell counts fk.
The parameters λk are estimated using log-linear modeling. The sample frequencies fk are independent Poisson-distributed with a mean of μk = πk λk. A log-linear model for the μk is expressed as , where xk is a design vector that denotes the main effects and interactions of the model for the key variables. The maximum likelihood estimators are obtained by solving the score equations: ∑k (fk − πkexp(β))xk = 0.
The fitted values are then calculated by and . Individual disclosure risk measures for cell k are
P(fk = 1) = exp (λk(1 − πk))
E(1/Fk|fk = 1) = (1 − exp(λk(1 − πk)))/(λk(1 − πk)).
Plugging for λk in the formula above leads to the estimates (Fk = 1|fk = 1) and (1/Fk|fk = 1) and then to and . Skinner and Shlomo (2008) develop a method for selecting the main effects and interactions of the log-linear model based on estimating and (approximately) minimizing the bias of the risk estimates and . There have been many extensions and similar approaches to estimate the risk of re-identification based on probabilistic modeling and these are mentioned in Shlomo and Skinner (2022). In addition, the R package SDCNway on the CRAN network has been developed that carries out the probabilistic modeling for estimating the risk of re-identification.
This page intentionally left blank.