DATA ARE AT THE FOREFRONT of efforts to solve many of today’s greatest problems, including climate change, misinformation and disinformation, the threat of future global pandemics, and the quest by people everywhere to lead better lives. But if researchers are going to use data to contribute to the solutions of problems, data need to be available for them to use.
The US-UK Scientific Forum on Researcher Access to Data held in Washington, DC, on September 12–13, 2023, examined the constellation of issues surrounding researchers’ access to data, best practices and lessons learned from exemplary research disciplines, and new ideas and techniques that could drive research forward. Over time, data have become increasingly voluminous, complex, and heterogeneous. Massive volumes of data are being generated by new devices and methods, and many of these data are not easy to analyze, interpret, or share. Groups that generate data may be reluctant to share them for a variety of professional, personal, financial, regulatory, and statutory reasons.
Data from different domains have varying characteristics and pose distinct challenges for sharing. Accessing, processing, and sharing health data involve major legal, ethical, and technical issues. The nature and diversity of environmental data require new ways of linking data with different structures, including Traditional Ecological Knowledge stewarded by Indigenous Peoples. Privately held data can produce powerful insights into the behaviors of individuals, communities, organizations, systems, and the physical environment—as well as into the interactions among these levels—but the use of these data raises issues of access, consent, privacy, trust, and scale.
Data scientists are developing new tools and approaches that can overcome such barriers. Standardization, automation, and interoperability make it possible to combine data to get the maximum value from individual datasets. Tools such as federated systems, artificial intelligence–driven analytic methods, and secure infrastructure and methods can derive valuable findings from large, complex, and heterogeneous datasets. Standards, shared specifications, and trusted brokers can enable and enhance data sharing. Suites of tools and methods can mix protections according to different use cases.
While common standards for data stewardship and sharing upon which everyone agreed would be desirable, various complications can make this goal difficult to achieve. For example, making de-identified health data available generally serves a public good, but this imperative has limits. Broad consent to use data may later result in sensitive uses that were not envisioned when the data were collected and released. People differ in their perceptions of how industry and academia will use data, and the line between these two can be blurred in practice. Such considerations create an imperative to use data responsibly and to ensure that they are protected, since retaining the trust of the public is essential for science to inform public policy and public decision making.
Data institutions, including archives, statistical agencies, data repositories, federated data systems, and data commons, play critical roles in supporting scientific research and enabling access to data. These institutions also rely on a broader ecosystem that includes funders, journals, researchers, and even promotion committees, all of which can raise the standards of data deposition and management. How data are stewarded by institutions, disciplines, groups, and individuals ultimately affects what types of products, services, and insights data can be used to create; what decisions data can inform; and which activities data can support.
Many of the greatest challenges are not technical but cultural. Problems posed by data access therefore involve the tools of the social sciences, including norms, codes of conduct, incentives, disincentives, and governance. Technologically driven solutions to privacy problems can seem appealing, but they often turn out to be years away from implementation. As a result, normative and values-driven frameworks are typically more realistic for the immediate future.
Data sharing is not rewarded in career progressions in many institutions, which creates impediments to sharing. Some fields have developed interesting solutions to supporting data career paths, and other fields could borrow these solutions as incentives to data sharing. Institutions also could be provided with incentives to lower the barriers to data availability that they erect to reduce their risks of sharing data.
Mandates to treat data in particular ways are of little practical value unless accompanied by the resources to carry out those mandates, including assuring compliance. Sharing data requires infrastructure, frameworks, standards, trained people, methods of curation, and metadata generation. With some exceptions, the missions of federal and non-federal funders tend to be short term, pointing to the need for long-term funding of data generation, analysis, and stewardship. Also, long-term stewardship requires a diversity of funding streams so that stewardship does not depend on any one organization.
In addition, the environmental costs of gathering, analyzing, storing, and sharing large volumes of data are substantial and likely to increase. Progress requires educating people and doing more research to understand the carbon footprints of various actions and how these footprints could be reduced.
Drawing people together with different backgrounds who would not otherwise have interacted is a valuable way for people to be exposed to new issues and generate new insights. Learned societies—including the Royal Society and the National Academy of Sciences—governments, data institutions, and individual researchers all can foster cross-disciplinary interactions and enhance researcher access to data.