DATA COLLECTED BY private companies often contain useful insights that can help alleviate major societal challenges, including climate change, health care, food security, and disinformation. Accessing these data, however, can be costly, controversial, and unreliable.
In her introduction to the fourth panel of the forum, Gina Neff, professor and executive director of the Minderoo Centre for Technology and Democracy at the University of Cambridge, described the enormous potential of privately held data that were not necessarily generated for research. For example, data generated through interactions with the digital world, sometimes called digital footprint data, are generated through engagement with digital systems, devices, and sensors. Sources could include internet and mobile apps, navigation systems, social media, sensors in consumer devices and the environment, and digital transactions. These data can produce powerful insights into the behaviors of individuals, communities, organizations, systems, or the physical environment, as well as the interactions among these levels, Neff said.
Neff touched on a variety of examples. Pharmacy loyalty card data have been used to enable the early detection of diseases such as cancer. Social media data have produced real-time insights about job hiring and economic mobility. Consumer data have helped researchers map food deserts. Commercial satellite and social media data have served as early warning detectors for conflict zones. The data used in these studies were not necessarily designed for those purposes, but they were repurposed to produce novel insights and better guidance for policymakers.
Neff also discussed several challenges associated with the use of privately held data, including issues of sensitive data, consent, trust, data access, and scale. Because data are not gathered for the reasons that interest researchers, data quality is often a challenge, many researchers struggle with the software engineering required to use the data, and the scale of the data often far exceeds that of other research projects. As examples, Neff cited a project that seeks to combat misinformation and disinformation in the European Union through automated monitoring of social and news media with advanced artificial intelligence–based technologies to enhance the work of human fact checkers, and a program to repurpose data gathered by the International Committee of the Red Cross to improve the delivery of services to vulnerable people.
She ended her introductory comments with two challenges. First, the European Union and United Kingdom are creating legislation that will structure new ways for researchers to access social media data. Research using these data is seen as a key element in regulation, but converting research into action hinges on many of the issues posed by working with privately held data. Second, both the capacities and infrastructure for working at scale are needed at the level of the individual researcher as well as at institutional and national levels. For example, she asked, “Are we going to be able to continue having the kinds of research in the academy around responsible artificial intelligence technologies? Or are we going to have to cede those kinds of questions to a handful of companies?”
Human data are constantly being generated in immense quantities, said Henry (Hank) Greely, professor of law and director of the Center for Law and the Biosciences at Stanford University. Some are generated in the public sphere,
such as driver’s license or passport data, while some are generated in the private sphere, such as the information being collected by people’s cellphones. People care deeply about some of these data, but some of those concerns may be misplaced. For example, observed Greely, they care much more about their Social Security numbers being available than about more serious issues, even though Social Security numbers are widely available from previous information leaks.
Data collected about nonhuman entities may also be of great interest to humans. For example, fishermen can have a great interest in data collected on fish, since those data can directly affect their livelihoods. Some people have a great interest in data on the vegetation around their houses, since it can influence whether an insurance company will provide insurance coverage for wildfires.
Concerns about data privacy vary among people and cultures, Greely observed. In Scandinavia, disease registries are common because people largely believe that this information should be available as a public good. In Iceland, even income tax returns are publicly accessible, whereas making this information available in the United States “would be an easy spark for a revolution.” Similarly, Greely said that he would be happy to make his genome sequence available, because the odds of it containing useful information are low, but he would not want to make his Google search data public.
People’s attitudes about data privacy also change over time and depend on circumstances. “When my kids learned that their grandmother was on Facebook, suddenly they made a lot of changes to what was accessible on Facebook and what wasn’t,” said Greely. Similarly, people are likely to post different things in their 40s than in their 20s. “People vary and change, so figuring out what people care about is going to be tricky and difficult.”
Non-lawyers often talk about who owns data, but a more important issue is control, Greely explained. When people sign the lengthy consent forms required by information companies, for example, they are giving up some measure of control over how their data are used. But companies also want to maintain good public relations. For example, some genetic genealogy companies have said that they will not provide information to law enforcement agencies, while others have said that they will let their customers opt into such uses or will make decisions on a case-by-case basis. “They all face the same laws, but they made different decisions based on the public relations appearance that they want to create.”
Finally, standardization and accessibility will often be problematic with data. If information from multiple sources is being combined, especially from private sources, “good luck” comparing it, said Greely. As to access, companies have interests in keeping information private, and their lawyers will alert company decision makers to problems or lost opportunities that could occur if data are released. Furthermore, even if lawyers approve the sharing of data with a researcher, they may put limits on what the researcher can do with the data, including sharing them with others.
“Data live in context, and that context needs to travel with the data,” said Cyndi Grossman, senior director of Biogen Digital Health. But the metadata meant to provide context often are not enough to make effective use of the data.
From the perspective of those who contribute the data, de-identified data from electronic medical records may provide an understanding of what patients have experienced. The data can indicate how long it took people to get diagnosed, when they began treatment, and how their treatment worked. But the data do not necessarily provide the context needed to fully understand the reasons behind these findings. To address this gap, Grossman has spent part of her career on patient and community engagement, because “we need systems that help us understand and contextualize the data.” The Patient-Centered Outcomes Research Institute was established specifically to foster such systems, and Grossman was the recipient of an award to create tools and visualizations designed to build
community capacity for engagement. Particularly important, she said, is not only involving patients in governance but also creating a pipeline of people who will continue to participate in that governance.
From the perspective of those who use data, these uses and value reflect the interests of a researcher or community. That ascription of value “ideally should travel with the data,” said Grossman. The sharing of data also reflects a researcher’s or community’s interests, or the interests of the organizations that supported the data collection in the first place, which inevitably brings in additional stakeholders who are not scientists. Whether lawyers, business development partners, or others, their interests also represent information about the data, including the rationale for both collection and sharing. For instance, decisions about whether to share or not share data can reflect business decisions that involve much more than just the science involved.
The importance of such information points to the need for systems and technologies that can capture this information, said Grossman. Some of this information may be written down—in laboratory notes, for example—but some may have been presented in a data sharing discussion leading to a decision. Without information about how a decision was made, it can be difficult to create a more robust data-sharing ecosystem. Greater transparency, Grossman concluded, can not only clarify these processes but also build trust.
The importance of linking data exposes the role of privately held data to produce rich cross-domain datasets toward optimal decision making, said Uyi Stewart, chief data and technology officer at Data.org. The question then becomes how to ensure that the commercially sensitive information and personally identifiable information contained within the private-sector data are not made public.
What is needed is a trusted infrastructure to facilitate the sharing of privately held data, Stewart observed. Current methods of protecting privacy—including masking, hashed values, data aggregation, and encryption—cannot guarantee the anonymity of individuals or designated variables in datasets. Instead, Data.org is working with partners to develop a differential privacy solution to provide a mathematically verifiable proof against re-identification.
The classic issue with differential privacy is the trade-off between privacy and the statistical utility of the data. Data.org is therefore working with its partners to ensure that the resulting datasets retain enough signal to support decision making while preserving privacy. In particular, it is seeking to provide accurate and actionable signals for research by epidemiological scientists to analyze and model the effects of the COVID-19 pandemic and future pandemics. Working with a financial company, it is combining financial and health data to understand possible correlations between financial transaction activities and public health behavior before, during, and after the pandemic. It is also using the datasets to build probabilistic simulations that can inform public health policies through the mapping of correlations between human mobility and infection patterns.
Data.org has organized Epiverse as a global collaborative of interdisciplinary experts developing a distributed, open-source software ecosystem as a public good for pandemic preparedness and, in the future, for a broader range of health and social challenges. Within the Epiverse, it also has established the Privacy-Enhancing Technologies (PETs) for Public Health Challenge in partnership with a private financial transaction firm, Harvard OpenDP, Pontifical Javeriana University in Colombia, and the cloud computing company Amazon Web Services. The goal of the PETs for Public Health Challenge is to create innovative privacy-preserving solutions that will unlock privately held data to enable optimal data decision making for health and social challenges. Data.org will make the baseline differential privacy solution available in a secure cloud environment to competing research and academic teams. The teams will leverage the use cases and challenge questions to develop new, better differential privacy tools that improve the baseline in its privacy–utility trade-off. “We have a quarter of a million dollars that we are putting up for the top winners of this challenge,” Stewart said. The goal is to produce a set of open-source privacy-
enhancing utilities to encourage the sharing of privately held data, which companies around the world could then use to share their data freely across domains.
Data.org is in the social impact sector, Stewart said, which is at the forefront of meeting the many systemic challenges facing the world. But, like datasets in general, these organizations are fragmented, not just in terms of the datasets they generate and use but also in the incentives and funding that motivate their work. “That’s what keeps us at Data.org up at night: how to build that coalition of the willing, or remove the fragmentation, beyond datasets [and] beyond technology, so that we can actually achieve the scale of impact that we all desire for a more equitable world.”
Critical problems “are coming at us from every possible direction,” said Gavin Starks, founder and chief executive officer of Icebreaker One, adding up to an “omni-crisis.” Information will be critical not only to devising solutions to these problems but also to implementing the solutions, he continued. As Donella Meadows wrote in her book Thinking in Systems: A Primer, “Missing information flows are one of the most common causes of systems malfunction. Adding or restoring information can be a powerful intervention, usually much easier and cheaper than rebuilding physical infrastructure.”1
Regulations in Europe and elsewhere require companies to report on their climate impacts. Standards and frameworks, based on scientifically derived targets, have been developed for the required disclosure and reporting. This process requires not just assessment methods and models but also accounting and reporting tools. The resulting information is then used by credit rating agencies and markets to drive investment.
At the core of this process is a huge amount of data on resources, manufacturing, and distribution that needs to be combined with financial information, Starks observed. It is an “incredibly complex picture,” but an analogy helps to point a possible way toward progress. The United Kingdom has a system known as the Open Banking standard that is designed to produce interoperability across the banking sector. Every retail bank in the country must be a member of this open banking implementation, and within this trust framework anyone can share their current account data with any third party. “Open banking has codified the way the web works into a policy instrument for the first time,” said Starks. Any other financial technology company that agrees with the rules can join, and more than 700 have done so. More than 80 countries have copied the system to different degrees, and it has created a $20 billion market that is projected to be more than $100 billion in the next decade.
A trust framework can be applied to all data types, including open data, and is an effective way to implement and automate the adoption of rules for data providers, aggregators, and users, said Starks. It enables assurable peer-to-peer data flow between organizations while verifying that organizations and their data sharing are compliant with the rules. Collaborative efforts among government, industry, academia, and trade bodies define the rules, while the trust framework addresses user needs and impacts, the technical infrastructure, data licensing and other legal issues, engagement and communications, and policy. Licenses codify the rules, while verification and assurance processes test compliance. In addition, the publication of metadata enables searches for open data.
This trust framework for data sharing at a market-wide scale addresses issues encountered with many kinds of data, Starks concluded. It is as follows:
___________________
1 Meadows, D.H. 2008. Thinking in systems. White River Junction, VT: Chelsea Green Publishing.
Asked about ways to further the sharing of privately held data, Starks said that government must get involved to encourage both standards and collaboration. New Zealand, for example, was able to include about 70 percent of its market in open banking through the efforts of a trade association, “but it needed government intervention to get everybody else on board.” Even with powerful incentives, fragmentation can persist.
Counterintuitively, one key is trying not to do too much. “When you’re designing any ecosystem, how can each of its component parts do as little as possible?” Starks asked. “This very much echoes the way the web works.” In that case, the data governance framework does not have anything to do with the data. “It has everything to do with getting people together and saying, ‘What rights do participants in this market have over the data they’re trying to share?’,” Starks said.
Grossman similarly highlighted the need to “pay attention to the resources needed, the technologies needed, and the governance structures in place to be able to comfortably share and collaborate. I hope we can solve that problem.”
Organizations may feel that they have control over data, said Greely in response to a question about data ownership, but the people from whom the data were drawn may also feel that the data belong to them. Their ability to enforce that feeling legally may be quite limited, but politically it can be substantial. Data researchers need to deal with both sides—the institutions that use data and the people from whom the data were drawn. For example, rare diseases tend to have corresponding organizations that are very interested in what happens to data about their disease, which are very often drawn from members’ children. Similarly, Indigenous Peoples have very specific interests in the data drawn from their groups. The legal ability to enforce restrictions on the use of such data will differ, “but legal ability isn’t the only important ability to block things,” said Greely. In choosing where and how to access data, “pick individual circumstances where the stars are aligned, and you have a chance that both the humans who care about the data and the institutions that hold the data are going to be willing and able to share the data with you.”
As Starks said,
I try to shy away from using the word data ownership. I talk more about data rights. And I try to avoid talking about data ethics, because that’s an infinite rabbit hole. But the framing of “What can I do with this information for this purpose?” is a useful bounding condition when you’re trying to navigate through the infinity of lawyers [who] and regulations and rules [that] are designed to say no because it increases risk. The narrower you can make that use case, the better.
In response to a question about whether data that are initially sensitive may become publicly available at a later date, Grossman drew a contrast between data generated for a clinical trial and data generated in other ways, with the focus having historically been on clinical trial data sharing. Clinical trial data are structured to be accessible, while real-world health data are different. They may be governed by a steering committee, by health care systems, or by the people who contributed the data. Such data can come from many thousands of people and be quite
varied in terms of what they include. “In terms of how you make the data available, my experience has not been that people don’t want to do it, it’s how do you do it well, ethically, and responsibly.” The systems we have for sharing clinical data can’t be the same ones we use for sharing real-world health data.
A question was asked about the difficulties of sharing data between two different countries or across national boundaries. These difficulties can be political, technical, and sociological, Grossman said. “The challenges, at least when it comes to health data in those two contexts, are formidable, especially if you’re talking about data collected out of a health care system.”
Greely cited three categories of problems. The first is national privacy legislation that may make one country unwilling to share data with another country that does not have similarly strict private provisions. The second is when countries restrict data flows for reasons of national interest that are not about privacy. For example, India and China have prohibited the export of human DNA or DNA sequence data for many years because they think it is a valuable resource. The third category is related to national security. For example, the Federal Bureau of Investigation in the United States is concerned about China collecting American human whole genome sequences and genetic data, though “I’m not sure what the FBI thinks China’s going to be able to do with those in ways that threaten U.S. national security.” International law can also be a factor. The Convention on Biological Diversity has required that countries make their biological resources available but requires a sharing of benefits and is now considering the extension of this provision to DNA sequence information, referred to as digital sequence information. “There are a lot of potential problems when data crosses national lines. It does that all the time, so the problems aren’t always insurmountable, but sometimes they are.”
What other opportunities, challenges, and lessons for working with privately held data do you want to surface?
What helps the research community make progress in researcher access to such data?
Is there anything else that the breakout group wants to respond to from the lightning presentations or in reflecting on this topic and the previous sessions?