Facing increasing demands to allow public-use files that have higher levels of granularity, as well as demands to expand the user base, many statistical agencies are exploring the use of remote flexible table generators and potential extensions to allow for remote analyses through a direct web link. Given the release of public-use file data, the aim is to provide internet platforms where more restricted data may be safely used. Table generators differ from the secure online data access (SODA) discussed in Chapter 5 by not providing direct access to the microdata or allowing fully customized interaction with the data; instead, they provide a tool for generating tables or some types of analyses (with constraints). Based on the literature review provided in Chapter 9, current Survey of Income and Program Participation (SIPP) users tend to perform complex analyses that could not readily be supported by a table generator at this time, though the development of an automated verification and validation server (per Recommendation 6-3) could be used to develop a tool for remote analysis of this type.
However, the complex structure of SIPP data is itself a barrier to SIPP data use, and a simplified table generator might broaden the use of SIPP to a larger audience. For example, policymakers may want simple descriptive statistics of program participants, or researchers may want an initial look at the data before pursuing more complex analyses through SODA or through a Federal Statistical Research Data Center (FSRDC). To begin, an initial product can be created as a simple tool with a simplistic interface, with a specific purpose and clarity about the specific value of that tool. Keeping
track of users and the variables used in queries will help the Census Bureau gain an understanding of key uses and help it in guiding future development. If value is seen in offering the simple tool, it is possible to expand from there by surveying users about their interests. By starting small, being clear about the tool’s purpose, value, and what it will contain, an agency can “test the waters” on the usefulness of the tool.
Internet platforms that allow users to retrieve automatic outputs require applying perturbation methods in the outputs prior to their release. The panel envisions that the underlying microdata for this tool would contain the variables from the public-use file and a select set of variables that are not released in the public-use file. In contrast to releasing synthetic versions of these additional variables, here the perturbation is only applied on the requested outputs generated by the users; hence, there is more utility in the data. The development of a remote table generator and analysis platform would serve the wider research and policy analyst community and could help provide evidence and inform subsequent grant or FSRDC applications using SIPP.
There are some working examples of remote flexible table generators, particularly for census outputs that can be used by the public or require a light registration process. Two examples are the European Union Census Hub and the Australian Bureau of Statistics Tablebuilder.1 With more research and development, these flexible table generators could go a step further and provide statistical analyses such as exploratory analyses and regression models (e.g., the Australian Tablebuilder allows the calculation of means, medians, and confidence intervals). Some details of the confidentiality protection in the Australian Tablebuilder are discussed in Thompson et al. (2013). More research would be beneficial on handling complex statistical modeling, such as hierarchical and multilevel modeling, which can accommodate the analyses of SIPP longitudinal data.
At its core, the flexible table generator with remote analysis extensions uses large multiway hypercubes (crosstabulation) as input data, thus ensuring minimum building blocks for the tables and models, below which the data cannot be disaggregated. In particular, the generation of two outputs that differ by only one individual would break any potential confidentiality protection. The entry into the platform is through a menu-driven user
___________________
1 For the European Union hub see https://ec.europa.eu/CensusHub2. For the Australian Bureau of Statistics Tablebuilder see https://www.abs.gov.au/websitedbs/censushome.nsf/home/tablebuilder
interface and bespoke software platforms where variables and their (coarsened) categories are predetermined and can be selected as a variable in the table or model. The software includes general rules-of-thumb regarding minimum size thresholds, sparsity thresholds, rows/columns that have only a few categories while the other categories are zero (attribute disclosures), number of dimensions allowed in a table or model, and more. When requesting the table or model, if these rules-of-thumb are not met, the user will receive a message that their request is denied and will be redirected to selecting a reduced number of categories of a variable or alternatively to include a larger sample size by expanding the subgroup selection. There are particular rules-of-thumb for remote analysis servers as well—for example, that no single data point can be released. Therefore, all requested scatter plots are produced as sequential box plots, and outliers in the whiskers of box plots are suppressed. In addition, the minimum and maximum values would not be disseminated, and the underlying models may have robustness features added to them to down-weight outliers. Quantiles can be released but with appropriate perturbation added as described below.
The use of differential privacy in flexible table generators is well documented; see, for example, Rinott et al. (2018), where the exponential mechanism is utilized based on a Discrete Laplace distribution. Since SIPP is a probability sample and has survey weights, one can adjust for the weighted counts in the tables as shown in Shlomo et al. (2019). See Appendix C for more details.
Though the emphasis here is on a flexible table generator, the same framework may be extended. For continuous variables, such as sums, averages, quantiles, and correlations, one can use the same concept of the microdata keys to obtain the same perturbations. For more advanced modeling in the remote analysis server, one can add noise to the underlying estimating equations. However, more research is required on how the table generator can be extended for other analyses as well as on actual implementation in a differentially private setting. Implementation may require a registration process before policy analysts could access more restricted data compared to what would be available to the wider public on the internet. Future development of this strategy would require ongoing collaborations among the academic community and government researchers. SIPP users use sophisticated statistical models and a table generator may not fulfill their needs, but such a system would be useful to perform some exploratory or preliminary analyses before embarking on requests to access—for example, through SODA or an FSRDC.
It could take years to fully develop these platforms, since it will take time for interface development and other IT solutions. The release of initial products with milestones outlined would help to provide access to restricted data through a work-in-progress tool. As an example, progress in
the development of the flexible table generator and, more generally, of the remote analysis platform could follow a sequence like this:
Recommendation 7-1: The Census Bureau should assess the demand for an initial flexible table generator as a simple tool, with a specific purpose, that is designed to gauge value and provide direction for further development, and proceed with the development of one if there appears to be sufficient demand.