Denote the number of synthetic datasets by m, the estimate of the parameter of interest θ is and the within-set variability of by w(j) (Liu, 2022). Let (average within-set variability) and (between-set variability).
Fully synthetic data: The final estimate of θ over m synthetic sets is and its estimated variability is given by T = (1 + m−1) B − W. Hypothesis testing and confidence interval construction are based on the asymptotic assumption of .
Partial synthetic data with or without differential privacy (DP): The final estimate of θ over m synthetic sets is and the variance estimator is T = W + m−1B. Hypothesis testing and confidence interval construction are based on the asymptotic assumption of , where the degrees of freedom are . Though the inferential approaches based on multiple synthetic datasets are the same with or without DP, what is captured in the between-set variance component B is different between the two. For DP data synthesis, B has the extra variability in the synthetic data due to the employment of randomized mechanisms for achieving DP guarantees, compared to the case without DP.
This page intentionally left blank.