The panel felt that it needed to understand how Survey of Income and Program Participation (SIPP) data users access SIPP data and how they use the data. Panel members pursued multiple approaches. First, they studied publications based on SIPP data. There were three separate studies: one based on a bibliography of published studies that used SIPP data compiled by the Census Bureau, a second based on articles in the past five years across a mixture of methodological approaches, and the last based on the 50 most-cited studies based on SIPP data as defined by the Scopus citation index. The results provided in Chapter 5 reflect the papers that are included in each category, though the details in the assessments based on these studies may reflect the judgment of the analyst who summarized the papers.
The panel also decided to collect information directly from SIPP data users. However, there is no list of SIPP users to provide a valid sampling frame for a scientific sample survey. The approach that was used and its limitations are described in this appendix.
On September 13, 2022, the National Academies of Sciences, Engineering, and Medicine’s Committee on National Statistics (CNSTAT) published notice of an invitation to SIPP users and potential users to complete a questionnaire concerning their experiences in using SIPP data. The notice was published online, through an e-blast to CNSTAT subscribers, to the American Association for Public Opinion and Survey Research Methods Section of the American
Statistical Association listservs, to users of SIPP Synthetic Beta, and by using #EconTwitter. A follow-up notice was published on September 29, 2022. The last response was received on October 5, 2022.
A total of 65 responses were received, with 41 respondents completing the full questionnaire and 24 partially completing the questionnaire. Because the notification was published as a general announcement, and neither the number who received the announcement nor the total number of SIPP users are known, true response rates cannot be calculated. However, there are 275 known users of the synthetic data, and 18 respondents reported using the synthetic data; this is approximately a 6 percent response rate for the synthetic data users.
The call for information was not designed to produce nationally representative results. Thus, these data provide a picture of the experiences of SIPP users that responded to the call for information, but a truly representative sample might show different proportions in the responses. Typically, those who are most involved and have the strongest opinions are the most likely to respond to surveys, so the results are likely to overrepresent those who are heavy users of SIPP, who care deeply if access might be changed, and who had either highly positive or highly negative experiences. They are likely to underrepresent those who might want only peripheral involvement with SIPP data, such as those who only wish to download tables.
The respondents to the questionnaire were promised that only summary statistics and possibly verbatim quotations would be made publicly available. For this reason, the database containing their responses will not be released.
A total of 42 respondents provided information on their personal situation, with 10 being government employees, nine being faculty members, nine being students, and nine belonging to research organizations (Figure E-1).
Respondents were asked which ways they used SIPP data, with multiple responses allowed. Overall, 42 reported downloading the public-use file, 18 accessed the synthetic data, 17 downloaded tables or reports, and seven accessed the data through a Federal Statistical Research Data Center (FSRDC). Additionally, eight of the respondents had not yet used SIPP data.
Looking at what combinations of data sources were used, two reported using all four sources, 25 reported using only one source, and 23 reported using some other combination of sources (Figure E-2).
Faculty members and government employees were the most likely to use downloaded public-use data, while students and government employees were the most likely to use the synthetic data. Only government employees and faculty members used FSRDCs (Figure E-3).
Every listed SIPP module, or topic area, was used by at least three respondents, and six were each used by at least half of those responding (Figure E-4). The most commonly used modules were income, employment and earnings, demographics, family and household, education, and poverty.
Under “Other,” respondents wrote in the following:
As might be inferred from the above statistics, SIPP data users typically used multiple modules, with every respondent using at least two of the listed modules, and one using all 26 (Figure E-5).1 More than half used between
___________________
1 Of the four respondents who wrote in additional modules, only the person adding “tax” failed to mark any other modules as having been used.
one and 10 modules, but if a package of the top 10 modules was created, that would have met the needs of only six of the 41 users. No two respondents used exactly the same set of modules.
Most typically, the analysis each respondent had performed was both cross-sectional and longitudinal, and next most often it was longitudinal only (Figure E-6).
Almost all respondents performed both tabulations (83%) and statistical modeling (85%). Next to “Other,” respondents offered the following comments:
Of the 42 respondents who had downloaded public-use files, 10 experienced some kind of difficulty, including both problems with the actual data (not having needed measures or losing important details) and more general issues reflecting weaknesses in documentation and in clarity on which files to download (Figure E-7). Next to “other,” respondents wrote in the following:
Eight respondents said they either could not complete a research project or had to significantly change it because of a lack of access to detailed categories in the public data.
Fewer respondents reported using synthetic data than public-use data (18 vs. 42), and fewer reported encountering difficulties (5 vs. 10), but the percentage encountering difficulties was roughly the same (28% vs. 24%; Figures E-7 and E-8). Equal numbers reported having difficulties getting clearance for access, incomplete documentation, and unclear documentation. One person, responding to “Other,” offered the following comment:
The Cornell server and the validation server had different programming software versions. My code would run on the Cornell server but not the Census server, and I couldn’t help fix the problem. This delayed my validation request for a few weeks, but Census staff were super nice and got it figured out.
Only one respondent reported having to stop or change the research approach because of inadequate data, but this may reflect how access to the synthetic data is obtained: to gain access to synthetic data, researchers must submit a proposal that the Census Bureau determines to be feasible.
A difference between the public-use data and the synthetic data is that synthetic data users are encouraged to have their data validated on the Gold Standard File. Eleven of the 18 users reported asking for their results to be validated; two said major differences appeared, and five said minor differences appeared. Following are the comments provided by the respondents on this experience:
Only one of the seven FSRDC users reported having difficulties, expressing the following issues: a time delay in getting clearance for access, unclear documentation, and file(s) without the measures that were needed.
Describing the impacts resulting from their difficulties in accessing the data, five gave up on obtaining access, including three who had difficulty accessing the public-use files (Figure E-9). Four said their research was never
completed, and three said there was a time delay in obtaining access. Notably, all three who said there was a time delay were users of the public-use file, which can be downloaded, so they were not referring to clearance procedures.
Following are the text comments for those who responded “Other”:
Eighteen respondents reported having published a journal article or report as a result of using SIPP data, six reported that the results supported the public policy process, and five reported that the data were used to make projections of program or policy impacts (Figure E-10). Some respondents reported multiple uses: one reported both published articles/reports and supporting the policy decision process, one reported both publishing and making projections, and two reported all three. Nineteen respondents reported no results yet from their use of SIPP data.
Those 18 respondents who reported publishing articles or reports based on SIPP also reported publishing across a wide span of areas. Most often they reported publishing on earnings, employment, survey methodology, and sociology (Figure E-11). Of the 17 fields listed on the questionnaire, respondents reported publishing in 14 of them, with each publishing in one to six fields.2 Among the 18 who reported publishing, 16 supplied the number of articles they published, for a total of 57 articles or reports, a mean of 3.2 per respondent.
Only 7 percent of the respondents said their research needs could largely be met through the availability of a comprehensive set of standardized tables (Figure E-12). By contrast, roughly three-fourths said their research needs would be met only to a little extent (31%) or no extent (43%). These results are consistent with the results above that the respondents often engaged in statistical modeling rather than depending only on tabulations. However, these results may be biased in the sense that those most heavily involved in using SIPP data, and having the greatest vested interest in access to the data, may have been the most likely to respond. There may be a pool of less committed SIPP users or of potential future users, if the data were more readily accessible, who are not well represented in these results.
___________________
2 The respondent checking “Other” wrote in “NA.”
The respondents were also invited to provide open-ended text comments that they felt would be helpful to the panel. Eighteen respondents provided comments. These are presented verbatim below.