Previous Chapter: 6 The Role of Data Institutions
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.

7 | DATA AVAILABILITY IN ACADEMIC RESEARCH

THE FINAL PANEL returned to several cross-cutting themes that had emerged earlier during the forum. Three final case studies touched on the tensions between open science and privacy, the repurposing of privately held data, transparency in data provenance and collection methods, and cultural and practical inclinations among academic researchers.

In his introductory remarks, moderator Frank Kelly, emeritus professor of the mathematics of systems at the University of Cambridge, focused on the issue of the allocation of credit for scientific and technological advances involving data. Inventors who made critical contributions to the infrastructure of science—such as John Harrison, the inventor of the chronometer, and Tim Berners-Lee, the inventor of the World Wide Web—were not initially considered scientists and only later received the credit they deserved from the research community, said Kelly. “Some of the problems we have we inflict on ourselves in academia.” He also pointed out that artificial intelligence and machine learning will pose additional challenges to the equitable allocation of credit.

Data in the U.S. Federal Statistical System

Data access falls along a continuum, observed Chris Marcum, senior statistician and senior science policy analyst in the Office of the Chief Statistician of the United States. At one end of the spectrum are data that are completely closed and locked down. At the other end are data that are completely open and available. In between are data assets that require special care to preserve privacy or confidentiality.

Marcum added that open access is not the same as public access. Not all public access data are open data. But all open data, at least from the perspective of the federal government, are public access. “Public access is a broader umbrella by which data can be made available.”

The U.S. federal statistical system collects and holds data that fall throughout the spectrum. Coordinated and overseen by the Office of the Chief Statistician of the United States within the Office of Management and Budget, the system is “the quintessentially federated system,” in Marcum’s words. The federal statistical system includes 16 principal statistical agencies with the mission of generating and disseminating objective, relevant, and timely statistics, along with about 100 other statistical units and programs across the federal government that generate or use statistics in meeting their missions and needs. The information gathered by the system “makes the economy work,” said Marcum. It “includes our economic indicators. It includes the census. It’s core to our democracy.”

The system includes about 1,400 confidential datasets within the standard application process portal that require special protection, Marcum observed. A researcher may need to become a sworn agent of an agency to access these data or may need to visit a brick-and-mortar research data center. In general, the Foundations for Evidence-Based Policymaking Act (the Evidence Act) codifies in law that openness/sharing and confidentiality/security are not opposing principles.

At the time of the forum, a proposed regulation was being discussed that would promulgate four fundamental responsibilities for federal statistical agencies and units:

Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
  1. Produce and disseminate relevant and timely statistical information.
  2. Conduct credible and accurate statistical activities.
  3. Conduct objective statistical activities.
  4. Protect the trust of information providers by ensuring the confidentiality and exclusive statistical use of their responses.

Two additional regulations that seek to reduce barriers to access of confidential statistical data, as required by the Evidence Act, are in development by the Office of the Chief Statistician of the United States. The Presumption of Accessibility for Statistical Agencies and Units regulation would implement the authority for statistical agencies and units to request and access data from across federal agencies, with very few exemptions, for the purposes of evidence building, even if the data are confidential. The Expanding Secure Access to Confidential Information Protection and Statistical Efficiency Act (CIPSEA) Data Assets regulation would improve researcher access to confidential statistical data while protecting the data from improper disclosure by standardizing the levels of protection and access to statistical information. This represents “a key milestone in the federal statistical system,” said Marcum. “For the first time, the recognized statistical agencies and units have coordinated and agreed upon the same application process to access restricted data assets. For a federated system such as ours, that coordination is paramount for increasing access to data.”

Marcum also pointed out, in response to a question, that the 2022 CHIPS and Science Act authorized the National Center for Science and Engineering Statistics at the National Science Foundation to support the National Secure Data Service Demonstration project, which can be improved upon to serve as an eventual national secure data service shared across government.

Preparing for Crises

The International Science Reserve (ISR), which is coordinated by the New York Academy of Sciences, is a network of scientists and scientific institutions that are preparing to respond when the next complex, border-crossing disaster strikes, said Mila Rosenthal, the organization’s executive director. The ISR emerged from the COVID-19 pandemic, as did several important lessons for data access.

One lesson is that scientists around the world wanted to be part of the solution to the pandemic, but many of them did not have a place where they could connect to a solution. “There was a lot of talent and energy that did not get to be part of solving problems,” said Rosenthal. “There weren’t any mechanisms set up for them to connect to each other or to resources or responses.”

Companies such as IBM and Google were among the founders of ISR because they knew they had technical resources that could contribute to crisis solutions based on their experience during the pandemic. One resource they offered during the pandemic, in partnership with the U.S. government and academic participants, was access to high-performance computing capacity for COVID-19-related research projects. IBM, for example, offered access to its suite of data analytics and modeling tools for geospatial temporal mapping.

As of the forum, more than 5,000 researchers had joined the initiative, and the network was growing at 5 to 10 percent per month, with a very broad range of disciplines included. More than half of the countries represented in the network are low- and middle-income countries, which increases trust among people in countries that, in the past, have not seen scientists from their own countries working on international research teams.

The ISR runs crisis scenarios, developed with international experts in those domains, as a way to practice reactions to different types of crises, including wildfires, floods, and food system shocks. “Scientists have been able to reach across their disciplinary borders and across global borders to explore how they would work together

Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.

and to make those connections in advance.” Such exercises are a way to maintain a “readiness mindset,” said Rosenthal, and also to assess available resources in advance of a crisis.

During the pandemic, researchers tended to use their established networks to call people they already knew. “That’s great,” said Rosenthal, but it meant that the networks were not diverse. Global problems require global solutions and diverse voices representing a wide variety of disciplines, contexts, and countries. Such networks need to be built in advance with pre-established relationships to ensure an inclusive response. Furthermore, researchers are eager to get connected, in part because they recognize that they need input from other disciplines to succeed. As an example, Rosenthal cited a geographer from the University of the West Indies who was able to make hurricane predictions with good accuracy but was unable to convince many people to leave evacuation zones. “Her point to us was [that] this isn’t a weather science problem. This is a social science problem. This needs behavioral sciences [and] communication science. I need to find other people who are working on this, who can help with the kind of data and research that they have [to] apply in these situations.”

In closing, Rosenthal noted that earlier speakers at the forum had raised issues of public benefit, ethics, and morality. But an international human rights framework already exists under which many of these issues are standardized. “Those are standards that have been universally agreed on for 70 years, and they’ve been negotiated very carefully among all 193 countries of the world. [In thinking] about data and data access in relation to a global standard, that’s one that already exists.”

In response to a question about the best way to build relationships that result in strong networks, Rosenthal pointed to the value of “doing things together.” For example, the large-scale crisis scenarios offer a way for scientists to cross borders and work together in a way that has relatively low stakes and is “hopefully a little bit fun.” On that note, moderator Kelly pointed to the value of loose connections, which can be “incredibly important for spinning up capacity quickly.” These loose connections can be developed easily while building trust, which provides resilience to the system as a whole for dealing with crises.

In response to another question about including in networks people who take contrarian views and thereby disrupt consensus formation, Rosenthal noted that part of the answer derives from the power of having a large network so that problems within the network can be overcome. Also, participants in the ISR are self-selecting, and the people who choose to take part have a high level of expertise. Quality control is still needed at the point of implementation, just as with research, but “having the widest variety of views and expertise at least increases the chances that you’re going to get better results.”

In response to a question about the development of the scenarios, Rosenthal observed that they have to have enough scientific content that they constitute genuine practice, but “they also have to be abstracted enough that lots of scientists can see themselves in them.” Thinking through the scenarios is not a compliance requirement but a learning experience. “You want to let [people] off the leash a bit to be able to think, ‘I could do this in a crisis.’”

Understanding Data Systems

In decades past, social scientists developed models without extensive data to use in those models. But by the time Johan Ugander, now associate professor of management science and engineering at Stanford University, was working on his PhD in the early 2010s, enormous amounts of data had become available through people’s use of the internet. A common metaphor, Ugander said, was that “what the telescope did for astronomy and the microscope for biology, the internet was doing for social science, because we could now, though with tremendous privacy consequences, study people everywhere.”

One major problem with this approach, he continued, is that “with enough data, anything is true.” As a result, the field of computational social science has had to mature in recent years. Students are now well trained, tools are much better than they were a decade ago, and the data are even more abundant. “The field has really come into its own.”

Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.

Two of Ugander’s interests have been best practices for auditing machine learning algorithms and audits of the broader sociotechnical system. In studying systems, he said, the most important issue is not just having access to the data but also having access to the system. “How do we run large-scale experiments on the system? How do we test questions of causality and variations on algorithms?”

As an example of how systems can be studied from the outside, Ugander described a series of studies on countering the dissemination of misinformation using such mechanisms as abuse detection systems, feed algorithms, professional fact checking, and crowd-based fact checking. X, formerly Twitter, for instance, recently experimented with a system of crowd-based fact checking in which individuals were recruited to label tweets through a consensus process designed to be apolitical. For example, a tweet about a Supreme Court ruling accompanied by a photograph of a flaming car would be labeled with the actual source of the photo and an explanation of the photo’s source. This is “a very important factual contribution,” he said. In this case, all the code and the data on which the code operated were openly available, as was the entirety of the code’s history. Still, understanding “the effect of adding this to the information ecosystem is a tricky question.”

Ugander drew several lessons from his experiences studying such systems both inside and outside the private sector. First, modern optimization problems are very difficult problems and use many stochastic methods that are difficult if not impossible to replicate. Second, the value of data for artificial intelligence has led to decreases in data access, which “has been a huge problem for being able to audit the system.” Finally, social media companies are becoming increasingly litigious even as changes going on within those companies have made working with their data more difficult.

In response to a question about the kinds of data access that might enable more insights into not just data but also systems, Ugander noted how difficult developing such insights can be. Even people hired by social media companies to do research can make mistakes in analyzing the behavior of systems. “It’s hard even for new employees to get analyses, so you really do need inside guidance.” Nevertheless, bit by bit, he said, “we’re learning more.”

Breakout Group Questions

Are there other barriers to data availability that have not been addressed in the forum?

  • Data availability is not the same as data accessibility.
  • Thorough and easy-to-use documentation is needed to use data effectively.
  • Data curation is a factor in future data usability.
  • The availability of many big datasets is limited by national security issues.
  • Data collected under proprietary formats or processed with proprietary algorithms can have limited availability and may have limited value without access to those formats or algorithms.
  • Institutions could be provided with incentives to lower the barriers to data availability that they erect to reduce their risks of sharing data.
  • Mandates put in place at the beginning of data collection, such as the requirement that the Human Genome Project use part of its funding to support research on the ethical, legal, and social issues associated with genomics, can enhance data availability.
  • Making data more inclusive will require better training, starting before college, of both data producers and data users.
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.

Are the differences in data-sharing culture between disciplines inherently driven by the nature of the disciplines? Are there lessons from interdisciplinary experience?

  • Differences in data sharing are sometimes the result of factors within disciplines, but historical events, such as the decision to make all data from the Human Genome Project available, can have a profound effect on the culture of data sharing within a discipline.
  • A change in the culture resulting in greater professional awards going to those who help to make data available would provide powerful incentives for data sharing.
  • Scientific societies offer one potential avenue to change cultural norms within both disciplines and subdisciplines in which cultural differences affect data availability.
  • Countering the “not invented here” problem is one way to resolve overarching problems.

How important is data provenance, and how should this area evolve?

  • Established norms for data sharing could emphasize and reinforce awareness of the importance of data provenance.
  • Communities of data users need to understand who is collecting the data and how they will use them, and then properly attribute that source.
  • Provenance is important, but so is how data are collected, shared, and attributed to a source.
  • Drawing people together with different backgrounds who would not otherwise have interacted is a valuable way for people to learn new things and generate new insights. Institutional innovations and different ways of communicating and interacting can offer a way to foster these cross-disciplinary interactions.
  • An awareness of the provenance of data is not enough, because aggregating data from various sources also involves appreciation of the attributes of the data. This could be an issue particularly with the use of data by artificial intelligence applications.
  • Existing models of open provenance could be further developed to avoid repeating work that has already been done.
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
Page 49
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
Page 50
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
Page 51
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
Page 52
Suggested Citation: "7 Data Availability in Academic Research." National Academy of Sciences. 2024. Toward a New Era of Data Sharing: Summary of the US-UK Scientific Forum on Researcher Access to Data. Washington, DC: The National Academies Press. doi: 10.17226/27520.
Page 53
Next Chapter: 8 Takeaway Messages
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.