“OUR SPECIES IS FACING one of the most difficult times in our 200,000 years on Earth,” said Arturo Casadevall, professor and chair of the Department of Molecular Microbiology and Immunology at Johns Hopkins Bloomberg School of Public Health, in his introductory comments at the US-UK Scientific Forum on Researcher Access to Data.
The planet is changing rapidly. We have just been through a pandemic. The power of the individual has risen tremendously—single individuals can wreak havoc in ways that they could not do before. We’re dealing with hate. We’re dealing with massive human migrations. I am an optimist. I think we’re going to get through it, and afterwards is going to be better than the present. But to get there we’re going to need information, we’re going to need data, [and] we’re going to need to make decisions the right way. That is what this meeting will try to address.
The forum, which was sponsored by the National Academy of Sciences and the Royal Society and held on September 12–13, 2023, in Washington, DC, brought together researchers from many different scientific disciplines to discuss the evolution of researchers’ access to data, best practices and lessons learned from fields on the forefront of data sharing, and data challenges related to pressing societal problems. “There are clearly many differences between the various fields represented amongst us,” said Frank Kelly, emeritus professor of the mathematics of systems at the University of Cambridge, who with Casadevall served as co-chair of the forum. Yet many of the data-related issues encountered in different disciplines are very similar, he observed. “We need to delineate the lessons on data access that are common across fields from the ones that are specific to each field.” In that way, the presentations and discussions at the forum were designed not only to unearth commonalities and best practices among fields but also to generate new ideas that could drive research forward across disciplines.
Throughout his more than 40-year career, Sir Ian Diamond, the United Kingdom’s National Statistician in the Office for National Statistics (ONS), has pursued “statistics for the public good,” he explained during his keynote address at the forum. “This forum is addressing critical questions which, if we are able to answer them properly and imaginatively, can lead in the long term to the use of data properly to benefit humankind in many ways.”
Diamond cited three reasons for the vital importance of data sharing. First, fresh eyes bring more insights to existing datasets. “Collecting data, as anyone who’s ever done it will know, is hard business. It costs real money. It’s a real effort.” Researchers therefore have an obligation to maximize the value of data that have been collected.
Second, the ability to link datasets has enabled new types of analyses “that I could only dream of for the first 40 years of my career.” Linking datasets opens up innovative ways to address complex problems, though linkages become more difficult for sensitive data.
Third, sharing data is the best way to replicate and verify findings, which enhances research integrity and trust in scientific results. “Having a truly open and accessible world of data enables reproducibility and enables data to be not just checked but [also] used more than once.”
As an example of the ways in which linking datasets can provide new insights, Diamond cited several studies conducted during the COVID-19 pandemic. Standardized mortality rates from deaths involving COVID-19 in England
differed between men and women and among ethnic groups. These results were obtained by linking data from the census to death certificates, which do not specify ethnicity. In addition, hazard ratios for ethnic groups could be compared and analyzed by linking data on, for example, housing types, socio-demographic characteristics, pre-existing illnesses, and vaccination status. As a particular example, the difference in hazard ratios between men of Black Caribbean heritage and men of other heritages went away after correcting for vaccination rates. “In other words, the real message was to double down on the communications about encouraging men of Black Caribbean heritage to get vaccinated. And the government communication services, in my view, did a fantastic job in doing that. That example shows how data linkage in many ways can start to answer questions, which can impact positively on the lives of many of our fellow citizens.”
The linkage of different datasets has been possible for decades, though it has become more powerful as more data have become available and as new statistical methods have been developed, Diamond observed. At the same time, data have become more complex. They can include results from quantitative primary data collection, administrative data, data that are “born digital,” and textual data. Very large datasets, complex ownership, and issues of consent can further complicate data analyses and linkage.
Some of the biggest challenges, said Diamond, are not legal or technical but cultural. For example, resistance to data sharing can stem from an inherent conservatism. If people are worried that consent may not cover a proposed use of data, they may refuse to share those data. If people are worried that their sensitive data will be stolen in a cyberattack, they may be unwilling to provide those data to others. “I don’t think that’s necessarily the case, but we need to recognize the cultural issue.”
Such concerns point to the need for easy, straightforward, and safe data sharing that is demonstrably in the public interest, said Diamond. Researchers need to understand thoroughly the topics they are studying and be clear about the hypotheses they are testing or the analyses they are doing. Data and their metadata should be available for secondary analyses.
Under the Secure Research Service established by ONS, the public interest is guarded through a framework known as the “Five Safes”: safe people, safe data, safe outputs, safe settings, and safe projects. Research should be undertaken by people who have the ability to do the research and have been trained in the importance of privacy. Data should be stored in a way that is demonstrably secure. Research outputs should not impact individuals’ privacy. Research should be undertaken in safe settings. Safe projects should be in the public interest and have ethical approval. “The Five Safes [framework is] absolutely central to the way we enable data to be analyzed, and it has been extremely successful.”
The 2017 Digital Economy Act authorized Diamond, as national statistician, to require that data be shared, though this authority does not extend to health data. “That sounds like I could be an enormous autocrat, but it’s absolutely not the case,” he said. “We work very hard and closely with colleagues who are potential data providers to assure them of the security [of their data] and to work with them in a positive way.”
Under the authority granted by the Digital Economy Act, special exceptions during the pandemic made possible analyses and data linking that yielded important insights, especially through the use of privately held data. Data drawn from credit and debit cards revealed how consumers were spending not only during the pandemic but also in the subsequent cost-of-living crisis. Scanner data from shops and supermarkets provided information on how consumers responded to price shocks. Utilities data revealed what kind of accommodations people inhabited. Though the use of private-sector data inevitably raises privacy concerns, “we work very closely and painstakingly
with data providers under very strict processes to enable access,” said Diamond. These and other successes have led to calls for health and other data to be held in a small number of trusted and accredited research environments that can interact with each other.
Though the Digital Economy Act removed many legal impediments to the sharing of data, barriers remain. Bureaucracies may limit data-sharing agreements to a small piece of analysis when a much larger set of analyses would be preferable. Some data were not collected to share, requiring an extensive amount of “wrangling and engineering,” in Diamond’s words, to do so.
Another major initiative in the United Kingdom is a government multicloud platform called the Integrated Data Service (IDS), which has been designed to bring together ready-to-use data to enable faster and wider collaborative analysis for the public good. Users of the IDS will include government analysts, statisticians, data scientists, researchers, academics, educators, and scientists. All users of the analytical platform will need to achieve accreditation through the Research Accreditation Service in accordance with the Five Safes framework. Data will be organized according to a reference data management framework, with integrated and indexed data assets at scale that will support adoption of a government enterprise data model. Data and users will be preapproved for broader analysis, and trusted research environments will be optimized for government use.
The keys to data usage, said Diamond, are good public engagement and the use of truly secure settings. “We have never been as good at public engagement as we should be,” he said. “Public engagement is not telling people ‘your data is safe with us.’ Public engagement is involving the public in decision-making processes…. When we have done that, it has been very successful.”
The most pressing needs today are for government and academic leadership, simple systems, better understanding, and public engagement, Diamond concluded. “The public deserve to know what is being done with their data and how their lives could be improved through secondary analyses of existing data.” In the United Kingdom, public trust remains high—almost 90 percent of poll respondents say that they trust ONS statistics, and three-quarters of the public believe that statistics are free from interference. However, Diamond observed, “it’s easy to lose that trust. We will be tireless in the UK in working with the public to be transparent in what we do and not in any way ask them to trust us but to demonstrate to the public that we are trustworthy.”