Massive volumes of high quality data on human health, social activity, the environment, and our impact on the planet are routinely collected, analyzed for insights, and used to guide evidence-based decision making that fuels innovation and progress. But what happens to high quality scientific data when all the original questions are answered?
For the millions of scientific articles likely published in 2020, a record breaking year for science publishing, less than half were published with their data. Open Access repositories for research data such as Dataverse, Figshare, and Digital Commons have been around for almost a decade. However, the likelihood of a dataset remaining extant is still estimated to decline 17% every year after publication. Why is it that scientific data can be so easily be lost when web technology is at the height of cultural adoption, energy consumption, and economic impact?
Scientists have cited several systemic and technical blockers that prevent them from sharing their research data. The most common concern lies with the fear of losing intellectual ownership and suffering a lack of attribution for work. Why share the expensive data that I have painstakingly collected for others to publish before I can?
“Data Parasites” are a virtual scientific boogeyman for some principal investigators collecting valuable and expensive scientific data. The “Data Parasite” is a pariah label describing researchers that publish high volumes of articles using open data collected by other groups. Peer-reviewed publications are the reputation metric for academics, the so-called “Currency of the Realm.” It is not difficult to grasp how very little incentive exists for researchers to share any data that places them at a disadvantage from securing future funding for their own projects.
For the researchers that shrug the adversarial Nash Equilibria that dictate the availability of knowledge to the public, several technical barriers remain to be overcome. Scientific datasets sometimes contain sensitive information and must be painstakingly screened before publishing. Data must also be properly annotated and formatted into schemas that can be understood and applied by other researchers. Data is not useful if the following is absent: 1) the context in which it was collected, 2) the history of modifications, and 3) verification of integrity.
Even after overcoming several cultural, institutional, and technical challenges researchers must deal with poor infrastructural support and high cloud service costs for scientific data repositories. In the best case scenario, a costly lease for data storage is obtained on university or third-party servers for some limited period of time. These centralized solutions pose a risk for valuable data in the event that the services suffer a critical failure, and this happens more often than we’d like.
Automated pipelines for dataset archival, contribution tracking, and virtuous value feedback loops built on decentralized web technology may provide solutions to some of these challenges. The Filecoin network is a hallmark example of incentive engineering that subsidizes the cost of high quality data to provide free storage for verified providers such as library archivists, public service providers, and other public goods. Scientific data archived on Filecoin is content addressed and can be indexed with versioned databases stored in a distributed network to track contributions, history, and ensure integrity of requested content.
Here we describe an approach to leverage public scientific data aggregators to automate free archival pipelines for massive datasets distributed across several service providers, academic servers, and centralized repositories.
Large data collecting consortia have generated a deluge of neuroimaging data in their quest to map the brain at multiple scales. Several permissioned repositories have emerged to share data across institutions to authorized contributors. Standards for interoperable data analysis and machine learning workflows have also gathered adoption for those working in these semi-closed ecosystems.
Tools like DataLad, have emerged as an open source utility to manage datasets distributed across institutional infrastructure. DataLad also maintains a metadata index of published data, providing a starting point for a public records archive to take form for all of neuroscience.
DataLad team has aggregated an index of over 280TB of neural data at datasets.datalad.org. This data also contains detailed versioning, notes, and machine readable data to support automated workflows.
However, despite the ability to define many remote sources for datasets, some data still becomes inaccessible due to faulty servers. DataLad is excellent for indexing datasets, but the hosting servers might themselves become unavailable over time in the absence of maintainers. To fix this, we need an automated process to fix broken links - repair workers that poll for missing files and fix the data proactively.
Decentralized web protocols offer a potential improvement in metadata portability, data availability, and content integrity through the use of:
- content addressable data - data is indexed by its content instead of its location, solving the issue of broken links if there are enough replications of that data
- programmable incentives for data storage - rewards for reliable storage providers, to improve reliability of the network
- provenance tracking - who created the dataset, who were the study participants, what changes were made and when
- censorship resistance - due to their distributed nature, and reliance on widely used protocols
- bandwidth/speed that scales with global adoption - more people storing the data, more data availability
- data sharing agreements embedded within the data - using encryption and role based access control for enforcing agreements automatically. Royalties can also be programmed in.
Efforts to index data together at one place (like DataLad) inherit the same problems as their original data sources did. We need an easy to use pipeline to archive data verifiably, and redundantly, for the long term.
The key hypothesis: decentralized, verifiable, long-term storage of important scientific data is key for lowering barriers for doing science
As a first step, we’ve been working on moving the widely used datasets indexed by metadata aggregator datasets.datalad.org, onto Filecoin storage. In this use case, we demonstrate an automated pipeline that scrapes data from datalad and archives it for long term storage on Filecoin, providing one piece of the puzzle towards sustainability and maintenance of public science goods. The goal of this experiment is to identify bottlenecks in the process and think of good metadata structures that are forward-looking.
Filecoin is a decentralized storage protocol that uses incentive mechanisms for getting reliable and verifiable long-term storage. Storage providers get block rewards for being dependable, and bad actors get their stake slashed. Also, by creating essentially an open marketplace for providers, the storage costs for the user are driven down tremendously.
Filecoin is ideal for long-term archival for the following reasons:
- Provably unique copies of the data - Proof-of-Replication. We can make redundant copies of the same data, and get proof that they are actually replicated.
- Verifiable storage throughout the time period agreed to in the deal - Proof-of-Spacetime. We can verify that the storage providers continue to store the data, and don’t drop the files in between.
- Usage of a general tree data structure (IPLD) for files. This combined with the upcoming Filecoin Virtual Machine (Smart Contracts layered on top of storage) will allow a variety of ways to access and process the data while lowering the cost. For example, we’ll be able to request for retrieval of only a subset of files - say, all files that match a query we write.
On a high level, there are 4 types of entities involved in the process of on-ramping datasets from datalad’s remotes onto Filecoin archival
- Servers - holding the neuroimaging data, these might be at universities, or on cloud compute providers
- Our bridge - preprocessing the data, creating auction requests to the Textile Broker, and making the datasets available to the miners in a suitable form
- Textile Broker - coordinates with miners to create storage deals
- Miners - who store the packaged data for a specified length of time, in exchange for mining rewards
Textile Deal Flow
Walkthrough of the entire process:
- A dataset is selected from the datalad index and its contents are periodically fetched into a storage droplet hosted on DigitalOcean.
- The first pre-processing step is to catalog the data and clean up any unavailable file pointers. Files might be missing due to some of the datalad remotes becoming unavailable. (this unavailability due to servers going unmaintained is one of the major problems we’re solving). It is worth noting that the original version history is preserved, and archived along with the dataset.
- In the second pre-processing step, data needs to be packaged nicely before it gets hauled off to filecoin miners. There are several points to consider at this step:
i. Filecoin requires data to be packaged into Content Addressable aRchives (CAR) - these are serialized archives of the IPLD DAG constructed from the dataset. We use the powergate CLI tools for this step.
ii. The Dataset needs to be smaller than 32GB. Anything larger has to be broken up into smaller chunks, each stored separately. This requires addition of another step during retrieval where we maintain metadata of all the required pieces of a particular dataset, and join them together when we need to access the data.
- The CAR file is made available for miners via an HTTP server. We set up a basic NGINX server for this.
- An auction request is sent to the broker. The broker then orchestrates storage deals with miners in the network.
- Miners who accept the deals proceed to pull the data from our server, perform various checks, and finally store the data in their node.
The auction is finished when either of the two conditions are satisfied:
- The deal deadline expires
- The auction has been accepted and data stored by the specified number of miners in the request
Each step in the preparation process takes some amount of time.
Step 1: Downloading the data. This is the fastest step in our pilot experiments, taking a few minutes at most. Time taken in this step will increase linearly with the dataset size (some of the datasets we’ll be dealing with are 10s of TB large).
Step 2: The first step is to convert the data into a serialized IPLD DAG, called the CAR. This process is linear in the size of the input:
This plot shows time (y-axis, seconds) taken to convert files of given sizes - 2, 4, 8, 12, 16, 32GB. All tests run on my laptop - ( Intel i7-10750H 2.6GHz | 8GB RAM ).
If we extrapolate these numbers for our server running on a droplet, where a 2.05GB file took 191.47 seconds, so a 32 GB file will take 2988s or ~49 minutes.
Step 3: The next step is to host the file on a server to make it available to miners, for the duration of the auction. This step takes the longest of the lot, 5 days is the recommendation for the auction deadline, to give the storage providers enough time to pull the data off our server.
The duration for step 3 is capped at 5 days since each auction has a cap of 32Gb on the data size (miners will pull at most 32 GB per auction). For data larger than 32 GB, we’ll have to chunk the data into multiple pieces, and auction them off separately, each auction still taking 5 days at most.
We have a grand vision, but foresee several problems to be solved before we get there.
Balancing version control with Quality archival - Filecoin archival is best thought of as taking “snapshots” of the current state of a dataset. If the dataset is updated in the future, how do we “patch” these snapshots? What is the best frequency of re-archival, if we were to batch these changes? What kind of logic would we need to re-assemble the version history from a set of snapshots?
Getting both low-latency access and redundant, long-term storage - Filecoin is great for storing large amounts of data for the long haul, but the layers of security are at odds with retrieval speed. Every retrieval requires making a new retrieval deal, and takes time.
Exciting research is ongoing on this front as well - Estuary.tech provides combined IPFS and Filecoin storage, automatic re-pinning and deal making on expiry, as well as automatically breaking up larger files into smaller chunks for storage.
Another option is layering computation on top of, and co-located with, storage infrastructure. This will afford more fine-grained control over data retrieval, while also reducing costs. This is the idea behind the upcoming Filecoin Virtual Machine - smart contracts running alongside incentivized long term storage.
We’re working towards the goal of a self-sustaining, open science community; complete with democratized funding for new research, highly available and verifiable open datasets, and reputation tracking for participants in the ecosystem. Several pieces of the puzzle are yet to be built before we get the flywheel of science turning on its own.
- Funding mechanisms for important public goods - for maintaining storage infra, making deals with providers, running nodes
- Rich metadata standards for all kinds of datasets
- Open Data Markets - keeping in mind regulatory compliance, data provenance, keeping bad actors out
- Standards for science-specific DID profiles - researcher info, affiliations, ORCID
We are currently working on closing the loop of archival with tools for retrieval. We already have a working datalad remote for IPFS, and once we bind together Filecoin archival with IPFS retrieval, this work would benefit everyday users of datalad.
This post was written in collaboration with Shady el Damaty and Yaroslav O. Halchenko