Photo by Alina Grubnyak
This article is the first in a series introduced in Rich in Data, Poor in Wisdom: Science Needs a Decentralized Data Commons.
A unified science data commons needs a single unified file system. InterPlanetary File System (IPFS) is a peer-to-peer protocol that enables the “storing and accessing [of] files, websites, applications, and data” (source). Note that IPFS is not intended to be a cloud storage solution. It is only a protocol and network. Nonetheless, because the protocol underpins some decentralized storage solutions (such as Filecoin), we will outline some of its important features. After discussing how IPFS is such a core component of Web3, we review its mechanisms: content addressing, Merkle trees, and distributed hash tables.
IPFS is fully peer-to-peer, thus meeting the top requirement of being a Web3 technology. Anyone can set up a node and participate in the IPFS network. A node can host and retrieve content. Any content a node hosts can be discovered and retrieved by other nodes on the network. Of the graphs below, the IPFS network looks like the third, the distributed network where all nodes have equal authority. By contrast, in centralized and decentralized networks, some nodes have more authority than others.
Source: Paul Baran’s On Distributed Communications: Introduction to Distributed Communications Networks.
IPFS nodes are highly configurable. By default, if a node downloads content from another node, that content is cached so it can easily be accessed in the event that another node requests it. Also by default, a node’s cache is cleared every hour. Because IPFS has no incentive system, people and organizations often host only their own data.
IPFS is used by many decentralized applications (dApps) to accomplish the original dream of the web as a truly peer-to-peer network. The music streaming app Audius stores its music using IPFS. The Ceramic protocol–”the smart document protocol for an open dataweb”–uses IPFS. Plenty of new storage services use IPFS (e.g., Textile, OrbitDB, Pinata, Fleek, Space, Estuary, Web3.storage, and NFT.Storage). One can easily spin up a single-page website with IPFS. Many non-fungible token (NFT) smart contracts include links to files stored on IPFS because the immutable content addresses on IPFS work well in the context of the immutable code implemented in dApps.
IPFS uses content addressing instead of the familiar location addressing. In location-based addressing, one must know the location of a file to retrieve it (e.g., /Users/user/Desktop/file or www.wikipedia.org). With content addressing on IPFS, each address is derived from the file’s contents. The derivation is a hash and so looks like a bunch of random characters (e.g., QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco). One can retrieve a file from the network without knowing its location because IPFS nodes keep track of which nodes are storing which files. This setup is similar to the DOI (Digital Object Identifier) system which gives a resource a unique identifier and then maintains a record of the resource’s location. Content addressing differs in that a resource’s address is derived from the resource itself and used by the protocol to locate the file, while each DOI is created and managed by an agency.
Generating an address on IPFS for a piece of content takes a few steps. When one adds a piece of content to IPFS, the content is given a content identifier or CID. Each CID has two components: a codec and a multihash. The codec “holds information about how to interpret the data” (source). The multihash has two components: the hash of the content and metadata about which hash function produced the hash and how long the hash is. All in all, a CID is a sequence of bytes with the following structure:
There are a couple interesting features of IPFS’s content addressing. Every CID is unique. As a consequence, addresses are permanent and immutable. Permanent addresses make versioning easy but present difficulties for those wanting to store mutable content.
On a web that uses location addressing users are blind to whether malicious content is hosted at a given location because the address of a web page says nothing about its content. This highlights the main problem with location addressing: relatively high trust requirements. In a Web2 environment, a user must trust the party who controls the location of a file. For example, I trust that there is no malicious code at www.wikipedia.org; I can navigate to it without risking my computer or sensitive information. That is, I trust the Wikimedia Foundation. This trust requirement makes the web less safe than it needs to be. For example, a bad actor might earn trust and subsequently direct people to a web address that runs malicious code, or the bad actor might hack a trusted domain (such as Wikipedia). Users are thus required to trust centralized authorities.
Content addressing on IPFS reduces the trust requirement. Because addresses derived from the content will reveal differences in content, there are some cases in which we can know a file isn’t malicious. We know a CID corresponds to non-malicious content if we already have the file, have retrieved it before, or know someone who has retrieved it. Again, this only partially solves the trust problem, but it is still superior to the blindness of location addressing.
A smaller problem of location addressing is unnecessary duplication. For example, if the same photo is on two different blog posts from two different blogs, it is likely that the photo is stored twice–once for each blog–and has two different addresses. This is often inefficient, especially because the blogs might be hosted at the same physical server locations even if they have different domain names. Such redundant duplication is beneficial if the file is in high demand or if the file is stored across a geographically diverse network. In IPFS, this beneficial kind of redundant duplication is the default behavior, while the unnecessary kind is avoided. Plus, each file has only one address.
IPFS uses Merkle trees to link directories, files, and pieces of files together. A Merkle tree is a tree data structure in which each node’s ID is a hash of the node’s contents. The graph below represents a Merkle tree.
There are three significant benefits of using Merkle trees to link content: verifiability, distribution, and deduplication. One can verify that a certain piece of content corresponds to a certain CID by simply hashing the content. “This offers both permanence … and protection against malicious manipulation” (source). The distribution benefit is that any Merkle tree–including any subtree of a Merkle tree–can be retrieved on IPFS. This makes content, directories, and datasets more modular: one can retrieve a whole dataset, just half the dataset, or half of a dataset from one peer and another half from another peer. Deduplication involves removing the need to duplicate files. For example, if two distinct datasets have one image in common, this image only needs to be stored once. Content addressing and Merkle trees allow us to split files and directories into their smallest parts, store the smallest parts only once, and reconstruct the content as needed.
If the address doesn’t specify a file’s location, how does the network know where to get a file? IPFS uses a distributed hash table (DHT) to store information about which nodes are storing which files. First, what is a DHT? A hash table (HT) is a data structure that stores key-value pairs, where each key is used to find the location of value. A DHT is a hash table that is stored across a network of devices. The graph below represents a simple hash table.
The IPFS DHT stores three kinds of records: provider records (which maps data identifiers to peers who host the content), IPNS records (which maps IPNS keys to IPNS records), and peer records (which maps peerIDs to multiaddresses which locate the peers).
On IPFS, there are three steps in retrieving a file: discovery, routing, and exchange. First (discovery), query the provider records in the DHT, using the content’s multihash as the key to see which peers are hosting the content. Second (routing), query the peer records in the DHT to figure out where the peers are. Third (exchange), request from those peers the desired content by sending to those peers a wantlist; this wantlist is a list of blocks, where “a block [is] a single unit of data, identified by its key (hash)” (source).
- “IPFS - Content Addressed, Versioned, P2P File System.” Juan Benet. Available at https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf.
- IPFS website. Available at https://ipfs.io.
- ProtoSchool by Protocol Labs. Available at https://proto.school.
The brilliance of IPFS is its integration of technologies and data structures. Its design truly allows it to serve as the interplanetary file system. Content addressing enables permanence and security. Merkle trees enable directories and modular file storage. DHTs help the network stay connected. IPFS doesn’t, however, have the cryptoeconomic incentives that define blockchain systems. In the next article in this series, we will explore the decentralized storage network Filecoin. This network is a natural evolution of IPFS, as it has a blockchain and is built on top of the technologies that power IPFS but with engineered incentives to encourage behavior that secures and grows the network.
Does the idea of a free, open, internet of science ring a resonant chord with you? Consider joining the Opscientia community to learn, connect, and collaborate with others building a commons for co-discovery.
Articles in This Series
- Decentralized Content Networks for a Permanent Science Data Commons: IPFS
- Engineering Incentives for Data Storage as a Commodity: Filecoin
- A Permanent Web of Linked Data: Arweave
- Peer-to-Peer Storage without a Blockchain: Storj
- One of the First Decentralized Cloud Storage Platforms: Sia
- The World Computer’s Hard Drive: Swarm
- Open, Free, and Automated Pipelines for Permanently Archiving Massive Scientific Datasets
- Coral: A Decentralized and Autonomous Knowledge Commons
Benet, J. “IPFS - Content Addressed, Versioned, P2P File System.” (n.d.). Retrieved November 20, 2021, from https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf
Distributed Hash Tables (DHTs) | IPFS Docs. (2021, February 21). IPFS Docs. Retrieved October 31, 2021, from Distributed Hash Tables (DHTs) | IPFS Docs
DWeb Tutorial | Content Addressing on the Decentralized Web (Lesson 2) | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 30, 2021, from DWeb Tutorial | Content Addressing on the Decentralized Web (Lesson 2) | ProtoSchool
Host a single-page website on IPFS | IPFS Docs. (2021, August 24). IPFS Docs. Retrieved November 20, 2021, from Host a single-page website on IPFS | IPFS Docs
How IPFS Works | IPFS Docs. (2021, June 22). IPFS Docs. Retrieved October 30, 2021, from How IPFS works | IPFS Docs
Immutability | IPFS Docs. (n.d.). IPFS Docs. Retrieved October 24, 2021, from Immutability | IPFS Docs
IPLD Tutorial | Merkle DAGs: Structuring Data for the Distributed Web | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 30, 2021, from IPLD Tutorial | Merkle DAGs: Structuring Data for the Distributed Web (Lesson 5) | ProtoSchool
Multiformats Tutorial | Anatomy of a CID | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 28, 2021, from Multiformats Tutorial | Anatomy of a CID | ProtoSchool
Rumburg, R., & Sethi, S., & Nagaraj, H. (2020). Audius: A Decentralized Protocol for Audio Content. https://whitepaper.audius.co/AudiusWhitepaper.pdf
What is IPFS? | IPFS Docs. (2021, June 22). IPFS Docs. Retrieved October 24, 2021, from What is IPFS? | IPFS Docs
Work with blocks | IPFS Docs. (2021, February 21). IPFS Docs. Retrieved October 30, 2021, from Work with blocks | IPFS Docs