Opscientia Community Spotlight: Google Summer of Code

Opscientia Spotlight: Google Summer of Code

Google Summer of Code, or GSoC, is a competitive program that funds select university students to work for 8-10 weeks on an open source project during their summer break. Students choose from a diverse collection of organisations, and submit a proposal for a project they’d like to work on.

Opscientia Founder Shady El Damaty was a mentor for GSoC through the International Neuroinformatics Coordinating Facility (INCF). Shady posted projects on the Neurostars Forum for which Kinshuk applied through the Google Fellowship program to support his work on the project.

Kinshuk is a senior year undergrad studying computer science, and has been developing Open-Source software since high school. Below is a summary of his experience entering the world of Web 3.0 through the GSoC program and Opscientia.


Stumbling on Opscientia

In the idle hours of March, I was looking for something to do over the summer. I started searching for interesting GSoC projects, combing through the various participating organisations. I was very excited to find that INCF had a project (from Opscientia) that mentioned Ethereum.

I immediately reached out on the Neurostars forum and then joined the Opscientia Discord, which only had 4-5 members at the time :scream:. As I read more about the vision and interacted with Shady, Opscientia’s founder, I grew more convinced that the time is ripe for a revolution in how science is done.

Ideas that excite me

Opscientia is addressing many pain points that scientists routinely face. People in academia regularly encounter frictions due to data access issues, such as IP conflicts, misinterpreted regulations, or simply a lack of infrastructure.

There is a deluge of new data, but most of it remains siloed in central servers which are slow to access. Outsiders like citizen scientists and hobbyists who are not part of the academic system are totally left out. This leads directly to hesitancy in collaboration between research labs, and to research bubbles that are difficult for outsiders to access.

Opscientia aims to unlock access to data silos, create incentive structures for effective collaboration, and democratise research funding. These ambitious ideas are backed up by solid user research and development by the team.

My experience so far

Web3 is well known for providing an excellent experience for newcomers — including development, novel ideas, and most of all, the amazingly open nature of connections formed among contributors.

I can confirm — I’ve had a delightful experience working on a project with the potential to have a huge impact on the world by interacting and developing ideas with a brilliant peer group. I honestly cannot recommend it enough to anyone interested in GSoC, Opscientia, or Web3 in general.


My GSoC project

My project with Opscientia focused on two important pieces of the ecosystem — storage of large datasets remotely with version control using DataLad, and a scientific Data Wallet for accessing data from decentralised storage.

Motivation

Modern science depends heavily on empirical data, which has led to a dramatic increase in the creation of data resources. The past few decades have seen major improvements in the ability to share these data resources for collaboration. But the accessibility is far from ideal as mentioned earlier — data remains siloed in disconnected warehouses with incompatible interfaces.

Interoperability is still just a dream in the minds of most scientists. Even when the data becomes available, there is usually no version control to inform users about data updates. This leads to a lot of friction for researchers and peer-reviewers, and does not allow citizen scientists and hobbyists to take part.

Traditional storage providers have several inherent issues:

  • They are monolithic and expensive solutions that may act as a single point of failure
  • This monolithic property allows easy censoring (location-based or otherwise)
  • Several simultaneous downloads can lead to bandwidth limits, making downloads of large datasets time-consuming, expensive, and ultimately impractical.

The InterPlanetary File System (IPFS) protocol is an emerging web standard that offers a unified open-source service for peer-to-peer sharing of datasets. On IPFS, multiple peers may share components of a dataset simultaneously, offering greater bandwidth. Data is immutable, and easily verifiable by checking the hash of the Merkle root. This enables provenance tracking of datasets, verification of authenticity, the potential for open access, ultimately making IPFS a tool against censorship and towards open access.

Why is DataLad important?

The DataLad utility enables data sharing and collaboration by providing uniform access to available data, independent of storage providers or authentication schemes. It pairs this with reliable versioning using git-annex, which allows obtaining file contents upon request.

IPFS remote for git-annex and DataLad

In order to use IPFS for git-annex storage, we need a script for a special remote that git-annex can communicate with. There was a pre-existing bash script for this, but there were problems getting it to run — once you add files to an annex on one machine, they just didn’t show up in the same repository on another machine. As part of my GSoC work, I investigated this issue.

After trying out many different angles, the solution turned out to be relatively simple. It was an issue with the way Github and other git-based services prioritise branches. Git-annex makes several branches to keep track of metadata, versions, and annex locations. The default branch on cloning this repository on another machine is incorrect, and changing this manually fixes the bug. Thanks to the wonderful git-annex forum members who pointed me in the right direction!

Once the special remote for git-annex works, integration with DataLad is simple. We just need to specify the remote to be used before we want to upload to IPFS.

Opsci Data Wallet

In my original GSoC proposal, I had outlined an application for storage of large neuroimaging datasets. As the weeks passed, the entire Opscientia team iterated on the idea, and we decided to build a prototype so that we can gather feedback from peers.

The Data Wallet that we ended up implementing is a web application that has the following features:

  • Simple user login using an Ethereum wallet (Metamask)
  • Drag-and-drop file upload to IPFS storage, through Textile.io bucket API
  • Scientific kudos using POAP (Proof Of Attendance Protocol)

Relevant Links:

As we implemented this simple Data Wallet and gathered more user data, we refined our idea of the larger ecosystem of Decentralised Science (DeSci) tools, of which the Data Wallet is an integral component.


Future of the project

Extending DataLad for better user experience (UX)

In its current form, DataLad has several UX limitations. If you are a seasoned Linux user or have programming experience, using DataLad via the command line interface (CLI) feels natural. But we cannot make that assumption for the majority of scientists who can make use of the DeSci stack. To use DataLad with IPFS in the current state, you need to have a local IPFS node running since the remote script uses the IPFS CLI command.

Given a choice between a technically sophisticated but difficult to use tool, and a simple and extremely easy to use web-native platform, most people will pick the latter. We need a way to reduce the barrier to entry and technical frictions in the onboarding process.

The long term aim is to make a composable building block that is easy to integrate with the larger Web3 ecosystem. For example, we would like to integrate DataLad provenance tracking features with proof-of-publication and integration with data marketplaces.

DataLad Gateway Service

DataLad is easy to use for people familiar with using command line tools, but we need modifications to make it part of our composable building-blocks-of-science pipeline. Providing a gateway which supports API access to DataLad functions will enable remote access as well as usability for non-linux users. This is akin to how you don’t need a local IPFS node to access content on IPFS (you can access via public gateways like Infura, IPFS community gateway, etc.)

DataLad has support for extensions which can be used to create a REST API. Being able to run DataLad through an API will help improve discoverability and access by integration with Coral, Opscientia’s fork of Ocean Protocol.


Closing thoughts

We want to make the UX of the science metaverse as intuitive as possible in order to automate the most mundane parts of science and free up space for scientists to ‘do science.’ We need to refine our tokenomics for managing and propagating scientific reputation and how to manage data access through data marketplaces. These are some open problems we are working on, and we invite you to collaborate with us through our Discord to help build the future of science!

After the fellowship

Since I am still in college, I will continue working as a Fellow with Opscientia to help refine ideas and build out the ecosystem. I will also write more about our findings as we continue to build upon our research and development.

– Kinshuk, aka PengKin

2 Likes