DATA INSTITUTIONS ARE organizations or arrangements for facilitating researcher access to data through data stewardship; examples include archives, statistical agencies, data repositories, federated data systems, and data commons. They play critical roles in supporting scientific research and enabling researchers to collaborate for access to data.
Organizations that steward data make important decisions about who has access to data, for what purposes, and to whose benefit, said Sir Nigel Shadbolt, principal of Jesus College and professorial research fellow in computer science at the University of Oxford and moderator of the fifth panel at the forum. How data are stewarded ultimately affects what types of products, services, and insights data can be used to create; what decisions data can inform; and which activities data can support. These institutions serve a wide array of functions, including the following:
The data structures managed by institutions “are engineered artifacts,” said Shadbolt. “They don’t come for free, they don’t come without maintenance, and they don’t come without issues around the engineering qualities that are attached to them.” Infrastructure has particular qualities—including that it should be reliable, repeatable, efficient, scalable, transparent, repurposable, revisable, and interoperable—and all these attributes come into play in considering data as infrastructure.
Data.gov, which is the U.S. government’s open data site, seeks to unleash the power of government open data to help the public, achieve agency missions, drive innovation, and uphold the ideals of an open and transparent government, said Hyon Kim, program director for the site. Launched in May 2009 with 47 datasets, it has since grown to encompass more than 200,000 datasets from 50 federal agencies and 60 state, city, county, and other non-federal organizations. It is administered by the U.S. General Services Administration (GSA) under, most recently, the 2019 OPEN Government Data Act, which makes GSA’s operation of Data.gov required by statute.
Data.gov is a metadata catalog. Agencies maintain comprehensive inventories of metadata of all their data assets, which are harvested by Data.gov to maintain a continually updated and comprehensive catalog of datasets. Download or access of the underlying datasets is from the websites of individual agencies. GSA also maintains a repository of tools and guidance that help agencies comply with the OPEN Government Data Act, with the chief data officer at each federal agency responsible for the metadata inventory. “What we can tell from the feedback we get is that users range from individuals who have questions about data that affect their circumstances to a lot of researchers, students, and businesses,” said Kim.
At the heart of the program is the federal data catalog (catalog.data.gov). The catalog pages follow a standard metadata schema featuring titles, descriptions, keywords, and download and resource links. Clicking on the link for the source takes users to the URL from which the metadata have been harvested. Data are harvested daily to include updated information on anything added, deleted, or edited by an agency. Anything from a federal agency is unlikely to have any restrictions on its use in the United States, Kim observed.
Most agencies have been in the catalog for years, but smaller agencies are still reaching out and working with GSA to get their metadata into the catalog. “The harvesting of metadata inventories into the catalog is fairly difficult and technically requires a lot of operations maintenance, and we’re always looking at how can we reduce the burden of that to improve performance.” Data.gov is also working to update the metadata schema, publish regular metrics, provide additional services to agencies, and do additional outreach and demonstration of impact.
Founded at the University of Michigan in 1962 by 22 universities, the Inter-university Consortium for Political and Social Research (ICPSR) today consists of more than 825 universities, colleges, research institutions, and statistical agencies from around the world. Its focus is social and behavioral science data “very broadly defined,” said Margaret Levenstein, ICPSR’s director. It has data on more than 19,000 studies, a quarter of a million files, and thousands of users annually. It also has a bibliography of data-related literature with 110,000 entries, each of which has an article that has used ICPSR data. The article itself becomes associated with metadata for the relevant studies. Members of the consortium provide financial resources in return for access to the data. About 10 percent of its data are restricted, almost always to protect confidentiality.
ICPSR creates value for its members in two main ways, said Levenstein. From data that arrive at the consortium in many different states, it creates high-quality curated data in user-friendly formats. “We have a staff of about 30 people who spend all their time making data more usable,” said Levenstein, “and that is not an inexpensive task.” It also offers protection for data using multiple strategies that ensure the safety of research. The creation of high-quality metadata and documentation supports data sharing, as does the creation of digital object identifiers for each of its datasets. Usage metrics show how many people have downloaded a dataset, which is a way to give people credit when their data are used.
The data and metadata are all exportable, with multiple schemas that make them discoverable on Google Data as well as other sites. ICPSR’s Social Sciences Variables Database allows comparisons of variables across datasets, which furthers the discovery of data. ICPSR also sponsors, with other social science repositories, JEDI (a discussion forum on data sharing for journal editors) and holds a summer program to train social scientists in methods and data stewardship.
For many years, ICPSR’s practice was that it only made available public use data products. Today it also provides tiered access to restricted data using de-identification of data, encrypted downloads to a safe location, virtual data enclaves (known as trusted research environments in the United Kingdom), and a physical data enclave. In addition to these physical and technical restrictions, it has restricted data use agreements. “The social controls are absolutely as important as the technical and physical controls,” said Levenstein. Institutional signoff is critical, though time consuming, and anything that can be done to automate and standardize that process
across research institutions and funding institutions “would really help researchers get access to restricted data.” The consortium provides training in confidentiality protection and the responsible use of restricted data. It also uses stakeholder review of applications when appropriate, as with the datasets collected from Indigenous Peoples that are made available in specific cases, along with disclosure review (known as output checking in the United Kingdom).
New approaches include a digital credential, linked to established training programs, to expedite access to restricted data. ICPSR is also increasingly playing the role of trusted intermediary to facilitate access to data held elsewhere. For instance, it has built for the U.S. federal statistical system a portal for discovering metadata about restricted data resources held by federal agencies. ICPSR does not have access to any of these data, but it provides a single and reasonably uniform process for applying for access.
Levenstein concluded by listing several ongoing challenges to researcher access to data. Technological innovation is necessary, but so is investment in the production of high-quality public-use data products and in broad-based data literacy and empirical reasoning. Sharing data requires resources to cover the cost to researchers to share their data, treatment of data as a first-class research object by academic and scientific societies, and resources to cover the cost of preserving and providing access to data, she said. “This is where the organizations like the National Academies and the Royal Society can play an important role in influencing how academic institutions and research institutions provide rewards to researchers.”
The mission of the World Data System (WDS) is to enhance the capabilities, impact, and sustainability of its members and data services by creating trusted communities of data repositories, strengthening the scientific enterprise throughout the entire life cycle of data and all related components, and advocating for accessible data and transparent and reproducible science. “We want first-class data that create first-class research outputs,” said Meredith Goins, WDS’s executive director.
WDS is one of many affiliated bodies of the International Science Council, which has a history dating back to 1899, when the International Association of Academies was formed to strengthen international science for the benefit of humanity. WDS was established in 2008 to build on the experiences of its predecessor bodies, including the data analysis centers set up to manage the large amounts of data created during the International Polar Year and the International Geophysical Year.
WDS aims to ensure that data and all the iterations and tools that touch data are trusted. There is no cost to joining WDS, but members agree to the WDS Data Sharing Principles and are required to be currently seeking certification from, in the case of candidate members, or already be certified by CoreTrustSeal. The sharing principles call for data to be fully and openly shared, made available with a minimum time delay and free of charge, and subject to the least restrictive uses possible, with those who create data assuming responsibility for ensuring their authenticity, quality, integrity, and appropriate citation. CoreTrustSeal resulted from the merger of the WDS regular member certification with the Data Archiving and Networked Services Data Seal of Approval. “We want to make sure that the data you have is curated well and has quality, because integrity matters in all our research,” said Goins.
In 2022, the Subcommittee on Open Science of the National Science and Technology Council released its “Desirable Characteristics of Data Repositories for Federally Funded Research,” which emphasized long-term organizational sustainability. This report called for repositories to have a plan for long-term management of data, including maintaining the integrity, authenticity, and availability of datasets. It also called for contingency plans to ensure that data are available and maintained during and after unforeseen events. “There needs to be a business model and a business plan,” Goins observed. “How are you going to fund yourself? How are you going to
be sustainable past your 3-year grant or 5-year grant or your tenure funding stream?” Technology also needs to be sustainable, with long-term management of data building on a stable technical infrastructure and funding plans.
The requirements of CoreTrustSeal include descriptions and diagrams of governance bodies, groups, and hierarchies, along with timescales for provision and renewal of funding for operational costs and recruitment even if permanent, ongoing funding cannot be perfectly quantified or guaranteed. The requirements establish 16 criteria that need to be met:
WDS has also been working with the United Nations Educational, Scientific and Cultural Organization on its open science initiative, which designates sustainability as a guiding principle to help make open science a reality.
Sylvie Delacroix, professor in law and ethics at Birmingham Law School, asked, “Can researchers’ data needs be met within ecosystem limits?” In the past, this question has been asked about resources like water, but applying it to data still “raises eyebrows,” she said. Yet the parallels run deep.
Sustainability requires preserving a physical environment, but if future generations are to fulfill their needs the social aspects of environments are just as important. With regard to researchers’ access to data, these social aspects are the practices that underlie and enable data production and data sharing. But challenges arise in maintaining these practices. The legal tools are relatively poor. Even in jurisdictions that grant personal data rights, these rights are individual while data are relational. Also, personal data rights rely on a consent process that is rarely more than “make believe,” said Delacroix.
The key to unlocking the research potential underlying datasets is the establishment of a long-term, bottom-up, data-sharing infrastructure, Delacroix stated. Such an infrastructure would allow groups of individuals to pool
the rights they have over their data. This requires moving from a system that typically starts with a problem to a system that equips data producers, whether they are scientists or not, with data institutions that give them long-term agency over their data. Because no one has the time or resources to make informed decisions on every single data transaction, a mechanism is needed for collective action and representation.
A data trust is an example of an institution that can act as a robustly independent, trusted intermediary vested with fiduciary responsibilities. A data trust is well suited to the sharing of personal data because of the high level of built-in safeguards it provides. Other organizations, such as data co-ops or collaboratives, fulfill the same basic need, though they may be structured differently. Such organizations enable groups to form over shared aspirations for data sharing and to pick a level of engagement that suits them. This kind of bottom-up data infrastructure would give social, political, and economic voice to a variety of groups that could be leveraged to shape a data-reliant future, Delacroix said.
The advantages of such a system accrue not just to the data producers but also to the people who use the data. They do not have to negotiate consent on an individual and often flimsy basis. They can engage in ongoing dynamic negotiation of incoming flows of live data.
Delacroix concluded by returning to the analogy with water rights. A property-based system of water distribution tends to allow the degradation of an underlying ecosystem and does not ensure the sustainable distribution of water. Data can be polluted—for example, by synthetic data—in the same way that water can be polluted. Just as a human water right is only possible if reciprocity expectations are imposed on the economic exploitation of rivers, humans have a right to access and build on culture. None of the generative artificial intelligence tools available today would have been possible without access to high-quality content made available under Creative Commons or open-source licenses. Yet few of those tools respect the reciprocity expectations, without which free access movements become unsustainable.
Dealing with non-pollution obligations will require concerted international efforts, Delacroix said. Furthermore, meeting data ecosystem preservation obligations requires more than top-down regulation. “Just as, today, water law reforms are waking up to the fact that we need to empower river-dependent communities, we also need robust bottom-up empowerment infrastructure if data sharing is to become socially sustainable.”
The first topic that arose during the discussion session was whether the models shared during the presentation session pointed in the direction of convergence toward a single model characterized by federated but distributed systems, hubs for harvesting metadata, and people retaining control over their data. If so, what is blocking movement toward that model?
Goins responded that human nature and policies are obstacles to progress, but the situation could change when both the users and creators of data state that they want change to occur. “We have to listen to our audience.” However, Kim cautioned that convergence is not as simple as it might appear. Even in the process of getting metadata from federal agencies into a central catalog, policy and technology issues can get in the way. Metadata inventories can be very large, and technologies do not always work perfectly. While Data.gov does include state, city, and county datasets if these levels of government make their metadata available in a compatible format, federal policy has specific limitations on providing resources to these levels of government.
Moderator Shadbolt drew attention to a related issue—the impetus provided by emergency situations to collect and centralize data streams. When the emergency abates, the impetus disappears, and data collection can dwindle. Levenstein agreed that the COVID-19 pandemic greatly increased data sharing. But “when South Africa very quickly shared its data when they found omicron there, the response of the international community was to ban people from South Africa from traveling around the world! If we want people to share data, we have to be ethically responsible in how we manage that.” When data are being used for high-level political decisions, or when
organizations are monetizing their access to data, questions arise about who has a voice in those decisions. “It is about control, and it is about access to resources.”
The discussion then turned to the long-term sustainability plans for data resources if funding or other support for those resources is withdrawn. Levenstein observed that ICPSR is part of a coalition of social science data repositories that have made a commitment to preserve each other’s data if an organization were to fail. Because the metadata of the different organizations have not been standardized, taking over another repository’s data would require a lot of work, “but there is a commitment among all those organizations to do that.” ICPSR also keeps copies of its data in five different locations around the United States to ensure physical sustainability, and it hosts a repository for at-risk government data resources to make sure that they do not disappear.
Kim reiterated that Data.gov is a metadata catalog of assets that are held by many federal agencies, but the data are hosted at the agencies. Also, the amount of money needed to run Data.gov is relatively small, and the inclusion of the effort in statute makes its continued funding more secure.
Moderator Shadbolt specifically addressed the issue of relying on universities as long-term repositories of data, noting that data management resourcing within universities can be “quite fragile and ephemeral.” When a university makes a commitment to maintain a data resource, that commitment needs to be “thought about pretty hard.”
On this point, Goins noted that many universities do not meet the CoreTrustSeal certification. Funding of data repositories from individual projects “is not a long-term view, and we’re in the long-term game so that we can have quality data for our children and our children’s children.” Universities are part of the solution, but governments also must come into play in preserving important data, she said.
Finally, a question was asked about the need to vet researchers and their proposals while facilitating access to data. Levenstein noted that ICPSR actively vetted researchers for access to restricted data, but this is more difficult to do in universities. That has caused problems in the past where sensitive data are included in institutional repositories. In Europe, the General Data Protection Regulation requires that data controllers have legal responsibility for vetting processes, but this has not had a big impact on data sharing in European social science repositories, according to Levenstein, because of the exemptions for research in the legislation.
How do we ensure that data institutions are sustainable?
What are the lessons for data institutions when mobilizing data and making them available for researchers in crisis-management contexts such as the pandemic or climate change?