Rich in Data, Poor in Wisdom: Science Needs a Decentralized Data Commons

Rich in Data, Poor in Wisdom: Science Needs a Decentralized Data Commons


Photo by Alina Grubnyak

The corpus of scientific data is fragmented, access-controlled, and rapidly growing beyond the capacity of centralized services to maintain. Recent developments in peer-to-peer technology have made it possible to establish a permanent archive of scientific records that is open to all. In this article series, we dive deep into the cutting edge technology of decentralized file storage networks and offer potential paths forward for a collaborative decentralized science ecosystem. We also introduce Coral, an open-access knowledge commons sustainably designed to capture the value of scientific enterprise in a decentralized cloud services marketplace.

Knowledge, Who is it For?

The boundaries of knowledge have historically been limited by access to tools for observation and high quality data. The ability to make significant jumps forward in our understanding of the natural world used to belong to the privileged few.

Ptolemy had the Armillary Astrolabe and papyrus to record the Earthly boundaries of human understanding – boundaries that went unchallenged for over a millenia. Galileo had the Convex Objective and parchment to populate our universe with god-like spheres locked in cosmic coordination. Hubble used the power of the Hooker telescope to circumscribe an infinitely expanding horizon for all human knowledge, leaving behind a challenge for subsequent truth-seekers in a universe where anything was possible.


Distributed Knowledge, Anatomical Plate. Source: 1857 JG Heck

Until recently, only those that were part of an exclusive club of academics could obtain access to the instruments and troves of data required to take on outstanding challenges in science. Today, significant advancements in astronomy and physics are made possible by open collaboration and data sharing practices. The questions are too big, byzantine models too recursive, and engineering challenges too complex for even the most enlightened individual to single handedly solve. The horizon of our cumulative understanding of the universe only expands today because the doors to high quality datasets and the tools to work with them are open to everyone, everywhere.

Rich in Data, Poor in Wisdom

While the astronomy community has set the standard for collaborative open science practices, many fields are still rooted in the legacy practice of career advancement based on reputation and ego. It is difficult for many to see how we can move beyond such adversarially entrenched academic interests. However, the reality of the challenges facing modern science today will inevitably force a cultural revolution–this very paradigm shift is already occurring today with the emergence of open access science data commons, journals, and free software. However, a digital explosion of data obtained from scientific observations of our natural world is generating more content than what institutional infrastructure can provide for upkeep, storage, and tools to sift through the expanding mass of raw knowledge.

Thousands of petabytes of valuable data and observations on human health, economic activity, social dynamics, and the universe and our impact on it are siloed in outdated storage systems. This data is inaccessible to search engines, stored in arcane schemas known only to a few, and likely never to be utilized. An estimate of over 80% of raw scientific data collected from the 1990s is estimated to be lost forever due to deprecated technology and inadequate archival infrastructure (Wiener-Bronner 2013). Today, the likelihood of finding a dataset falls by 17% three years after a paper is published (Vines et al. 2014). The practice of deliberately restricting access to scientific data limits our society’s rate of innovation precisely when we have never had so many problems that require scientific innovation to solve.

Decentralized file storage protocols offer solutions to this failing via content-addressable data, programmable incentives for data storage, provenance tracking, censorship resistance, and bandwidth that scales with global adoption. A peer-to-peer science data commons powered by these features may provide a resilient digital fabric that aligns a decentralized community of discovery around the most critical and challenging problems of today.

A Short History of Peer-to-Peer Content Networks

Peer-to-peer file sharing is as old as the internet. In fact, the predecessor to the internet as we know it, ARPANET, was strictly a peer-to-peer network when it was first booted up in 1969 (Paratii, 2017). Resilience to network degradation, high bi-directional bandwidth, information redundancy, aggregation of resources, and an intrinsic participatory nature are all merits that made distributed peer-to-peer networks a first-choice design amongst early internet architects and engineers. Many iterations of such direct information-sharing have appeared in the short history of the internet, some improvements, others dead-ends.

The emergence of public key cryptography in 1973 marked the beginning of identity protocols and verifiability of content through an ingenious key-pair signing system (Cocks 2001). For the first time, users on a network could trust a packet of information encrypted by a secret key if it could be uniquely decrypted by a key publicly posted by a known identity. Later, Ralph Merkle would invent the Merkle Tree in 1979 as a way of tracking the provenance of packets of information and paving the way for version control software such as git and svn (Merkel 1987). The synthesis of public key cryptography technology with Merkel Tree data structures would continue to drive innovation such as the emergence of blockchains, distributed computing, and consensus mechanisms that enhance resilience to attack and minimize fragmentation of information in distributed networks.

One of the most famous examples of distributed networks, Napster, connected peers through a centralized indexing server; which was later shut down by authorities following a lawsuit by Metallica for copyright infringement in 2001. The introduction of the Distributed Hash Table (DHT) revolutionized the design of peer-to-peer networks, unlocking higher tiers of decentralization and making the networks more resilient to content moderation and censorship. DHTs were initially used to help nodes on peer-to-peer networks remember each other’s locations. In the early-internet era, this allowed peer-to-peer networks to scale in a truly decentralized way because they did not need to rely on centralized servers like Napster did. The extremely popular peer-to-peer network BitTorrent was one of the first networks to utilize DHTs.


The Bitcoin Codebase Fingerprint. Source: Amelia Wattenberger

In 2009, Bitcoin entered the scene (Nakamoto 2008). While the peer-to-peer networks prior to Bitcoin allowed users to easily and quickly transfer data to each other, they were not engineered to be tamper-proof records of cryptographically verifiable exchanges. Events can only be appended to the Bitcoin ledger if the node submitting the transactions proves that they have done a certain amount of computational work within a short time window. Bitcoin is the first instance of a peer-to-peer network with a single global state that defines truth for the purposes of the network at consensus–in this case, the transfer of a cryptographic token representing economic value.

The concept of cryptographic proof for verifying events in a distributed network paved the way for accelerated innovation in peer-to-peer technology. Interplanetary File System (IPFS), a peer-to-peer file sharing protocol, synthesizes key advancements in decentralized computation such as DHTs and Merkel Trees with cryptographic proof to provide a base layer for a permanent internet records archive. IPFS makes it possible, for the very first time, for information to truly belong to a web commons with intrinsic resistance to geographical censorship, attacks on data integrity through content revision, and bandwidth bottlenecks imposed by centralized service providers.

Current State of Cloud Storage

The early 2000s saw the emergence of centralized cloud service providers that would become the gate-keepers for content on the internet. Today, the cloud storage market is dominated by very few players. Amazon, Microsoft, and Google control over half the market, and Amazon alone controls a third of the market, according to a Canalysis (2020) estimate. Amazon reached its near-monopoly position by solving critical scalability problems of the early internet, but by reaching this position, it created a new set of problems, all of which stem from centralization. The main problems are inefficient resource allocation, data fragmentation across isolated repositories, lack of privacy and security, and unnecessarily high costs. Overall, the cloud service providers control the terms that govern the data they store, making them an arbiter for access to knowledge.


A Taxonomy of Control Schemas Employed by Big Tech Corporations. Source: Manu Cornet

Amazon has recently begun offering enticing data storage deals for scientists to further increase the size and depth of their content moat (Amazon, 2018). Analysts speculate that Amazon may increase the value of their services if they can compile massive amounts of high quality interoperable datasets from industry, academic, and government researchers (Goldfein and Nguyen, 2018). For example, the Allen Brain Observatory has struck an agreement with Amazon to store dozens of terabytes of valuable neuroimaging observations (Allen Brain Institute, 2018).

While Amazon offers free storage for data upload, egress from their servers often incur a heavy fee, sometimes trapping data within their expansive computing centers and making Amazon the de facto owner of publicly funded research. Community backlash appears to have budged Amazon enough to consider a 15% waiver in monthly cloud storage costs for “eligible” research institutions. It appears that Amazon has taken a page from the science publishing industry and identified access to knowledge as another lucrative component of their increasingly sprawling cloud business model. Even so, a counter-current to the trend of centralization is building momentum and stands to disrupt the monolith of control that big tech companies have erected in the last two decades.

Looking Forward to An Open Web

As part of this counter-current, IPFS has led to the emergence of many additional technological innovations powering the decentralized web. In this article series, we cover the major decentralized data storage protocols and discuss their potential to serve as an underlying fabric for a decentralized science data commons. We begin with a deep dive into the history, mechanics, and popular applications behind IPFS.

Join the Decentralized Open Science Movement

Does the idea of a free, open, internet of science ring a resonant chord with you? Consider joining the Opscientia community to learn, connect, and collaborate with others building a commons for co-discovery.

Articles in This Series

  1. Decentralized Content Networks for a Permanent Science Data Commons: IPFS
  2. Engineering Incentives for Data Storage as a Commodity: Filecoin
  3. A Permanent Web of Linked Data: Arweave
  4. Peer-to-Peer Storage without a Blockchain: Storj
  5. One of the First Decentralized Cloud Storage Platforms: Sia
  6. The World Computer’s Hard Drive: Swarm
  7. Open, Free, and Automated Pipelines for Permanently Archiving Massive Scientific Datasets
  8. Coral: A Decentralized and Autonomous Knowledge Commons

References

Allen Brain Institute. (2018, August 9). Neuroscience Data Joins the Cloud. Retrieved November 21, 2021, from ​​https://alleninstitute.org/what-we-do/brain-science/news-press/articles/neuroscience-data-joins-cloud

Amazon. (2018, July 12th). New AWS Public Datasets Available from Allen Institute for Brain Science, NOAA, Hubble Space Telescope, and Others. Retrieved November 12, 2021, from New AWS Public Datasets Available from Allen Institute for Brain Science, NOAA, Hubble Space Telescope, and Others

Canalysis. (2020, April 29). Global cloud services market Q1 2021. Retrieved November 27, 2021, from https://www.canalys.com/newsroom/global-cloud-market-Q121

Cocks, C. (2001, December). An identity based encryption scheme based on quadratic residues. In IMA international conference on cryptography and coding (pp. 360-363). Springer, Berlin, Heidelberg.

Jocelyn Goldfein and Ivy Nguyen. (2018, March 27). Data is not the new oil. Retrieved 20 November, 2021 from Data is not the new oil – TechCrunch

Merkle, R. C. (1987, August). A digital signature based on a conventional encryption function. In Conference on the theory and application of cryptographic techniques (pp. 369-378). Springer, Berlin, Heidelberg.

Paratii. (2017, October 25). A Brief History of P2P Content Distribution, in 10 Major Steps. Retrieved November 20, 2021, from A Brief History of P2P Content Distribution, in 10 Major Steps | by Paratii | Paratii | Medium

Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system. Decentralized Business Review, 21260.

Vines, T. H., et. al. (2014). The availability of research data declines rapidly with article age. Current biology, 24(1), 94-97.

Wiener-Bronner, D. (2013, December 23). Most Scientific Research Data From the 1990s Is Lost Forever. Retrieved November, 13, 2021, from Most Scientific Research Data From the 1990s Is Lost Forever - The Atlantic

1 Like