Ocean Protocol: Data Exchanges and Next-Gen Intellectual Property

19 min readJul 9, 2023

In the current information age, it is echoed far and wide that “data is the new oil”. Many internet and tech companies of the early 2000’s did not initially orient their business model towards this untapped “well” of value, usually offering products and services (for example, the social platform of Facebook or the online bookstore of Amazon) that harvested only the most essential user information. This was not out of disinterest, but at the time, the resources required to derive actionable user insights couldn’t be employed at scale.

However, with the advancement of machine learning and artificial intelligence techniques, developments in computing, and finally the attractive economics of targeted ads and individualized product offerings, an increasing number of companies — both silicon valley startups and old world manufacturing and finance corps — have realized that they will fall behind if they ignore data-based opportunity. This principle applies well beyond “human” data (i.e. social, preferential, and even biometric), extending into precision engineering, supply chains, and all other technical fields, mostly by way of IoT — the “Internet of Things”. In essence, IoT is a phenomenon driven by virtual representation of mechanical systems, anything from a home thermostat, electrical response system on a vehicle, to even the electrical grid of a city — all can be used as a source of IoT metadata for directed analysis.

This flood of information across all domains of tech has fed increasingly data-hungry artificial intelligence and machine learning (henceforth, AI/ML) systems, which unearth nuggets of statistical gold — effectively pattern recognition across massive swaths of otherwise esoteric information.

But this flood is not without concerns voiced throughout academia, industry, and even pop-culture — one of the most pressing being how these insights will be ethically stewarded by the companies that obtain them, falling under the broad category of ethical AI and/or AI safety. But a derivative, perhaps more actionable set of concerns runs in parallel: how much access and control should these companies have over this “harvested” data? Where should we draw the legal boundaries concerning property rights of this data, the models used, and the insights derived? Furthermore, is it possible to share models or co-create models across corporate boundaries, tapping one another’s “data wells” without compromising security and sovereignty?

Complicated questions, certainly, but without such questions the best minds would not embark on the journey to an answer. Though there is still a need for legal and ethical discussion about precisely who ought to own what segment of these pipelines, the infrastructure to support data marketplaces and “tokenized” property has been developed over the past ten years and continues to develop at breakneck speeds.

The DLT Ownership Paradigm

Starting in 2009 with the Bitcoin blockchain, we have for the first time seen widespread use of a fundamentally different form of online accounting and ownership, one that ultimately does not require a third party custodian (like a bank) to use complicated financial tools. As the infamous (and paraphrased) adage goes: “Not your [cryptographic] keys, not your crypto” — Andreas Antonopoulos.

Sparing the reader from the details of modern cryptography, this effectively means that your assets are truly yours and yours alone, unless you personally decide to exchange, transfer, or otherwise delegate the value under your cryptographic control to another entity. Effectively, nothing short of failure of the baseline protocol — which, due to the distributed/decentralized nature of the “agreement” (consensus) protocol, is very unlikely for reasons that vary based on the underlying protocol. Namely, extreme computational difficulty and/or the economic infeasibility of corrupting the threshold for consensus.

Since the creation of Bitcoin, we’ve seen the expansion of this third generation of the internet (Web3) — the development of numerous blockchains that facilitate running complex, automated code, cryptographic developments that makes transactions more secure/private, and a swath of applications that make Web3 more accessible to end users. Among these developments are tools for representing assets besides currency, with projects already approaching tokenized real estate, art, and yes, even data, by bringing distributed storage and a gradient of fungibility (i.e. the ability to fractionalize property rights) to “abstract” assets.

In this review, we’ll walk through existing and potential applications of the blockchain based data marketplace, and some of the underlying engineering that makes them possible. In particular, we will examine Ocean Protocol, a data marketplace making these possibilities a reality. First, we’ll break down the vision and design of Ocean Protocol, then the fundamentals of how the data marketplace operates, then explain the tools available to users and builders and their relationship to the future applications. Finally, we walk through an example of their secure computing tech-stack, and conclude with a vision of a paradigmatic shift in the stewardship of AI/ML systems, intellectual property (henceforth, IP), and self-sovereign data.

Enter Ocean Protocol

Ocean Protocol is a data marketplace that facilitates data and AI/ML model exchange, and provides custom monetization of IP in these verticals, with the ultimate aim of democratizing AI training and data services. Ocean was started as a pivot from Ascribe.io, which was the initial attempt by Bruce Pon and Trent McConaghy to answer IP inefficiencies using blockchain technology. Upon recognizing that the infrastructure was not yet in place to support such a system, they built BigChainDB, a blockchain based database, and subsequently Ocean Protocol (henceforth, Ocean), to answer the question of how IP should be represented and utilized in the age of data science.

Currently live on the Ethereum mainnet, Polygon mainnet, Binance Smart Chain, Energy Web Chain, and Moonriver (as well as a variety of testnets), Ocean is at home on a variety of layer ones, each serving a different community, and with likely more to be added in the future!

Designing Ocean

The core of most DLT projects is a token, a framework through which that token flows, and whose circulation in that framework “snowballs” the growth of the ecosystem itself. In the case of Ocean, we have the $OCEAN token which is used to incentivize a virtuous cycle of ecosystem growth and utility accumulation. As described in Trent McConaghy’s blog post concerning the “Web3 Sustainability Loop”, the primary vehicles by which token utility and value are created in their ecosystem are by 1) OceanDAO, which serves as a subjective promise of future value added by the stewards and builders in the Ocean community, and 2) Data Farming, which serves as an objective optimizer of ecosystem function — i.e. volume of data consumption and dataset curation.

Coming from the same article, we have a diagram that articulates the big picture flow of $OCEAN as the ecosystem develops:

Source: https://blog.oceanprotocol.com/the-web3-sustainability-loop-b2a4097a36e

In short, the $OCEAN token is partially used by the network as rewards to users that use their own $OCEAN to curate (or “stake” — more on this shortly) datasets, and as disbursement by the Ocean Protocol Foundation treasury for use in community grants and OceanDAO distributions. At the bottom line, this loop increases the utility of $OCEAN, thereby increasing its scarcity and price, while allocating a fraction of the network revenue to be:

Burnt, decreasing supply…
Allocated back to OceanDAO for community stewardship…
Used as rewards for those using $OCEAN to curate datasets (in addition to existing network rewards).

As a tokenomic exercise in framework verification, the team at Ocean has even run trials using an EVM based agent simulator developed in-house called TokenSpice, verifying parameters for rewards, burns, and curation to maintain their own “Web3 Sustainability Loop”. I highly recommend anyone who is token-engineering curious to check it out — it’s an incredibly powerful tool!

Marketplace Dynamics — Data DeFi

The beating heart of most applications using blockchain technology fall under the category of decentralized finance (DeFi) tools, and a marketplace built for data services is no different. The beating heart of Ocean is the data marketplace, whose tools help users manage end-to-end flow of data monetization and control: publishing datasets, curating their own and other datasets, searching for and using other datasets, and finally purchasing and “consuming” data either by running computation over that data or gaining “raw” access to it.

In doing all of this, Ocean provides a means to easily monetize datasets and AI/ML models, ultimately providing a simple on-off ramp for data assets in Web3, as well as providing a framework for easily launching one’s own data marketplace in a specific vertical. All of this, while maintaining the blockchain-native tenets of decentralization, censorship resistance, privacy, and auditability. In order to meet these aims in the marketplace, Ocean uses its own token, $OCEAN, denominated against a highly customizable representation of datasets and services: Datatokens — more on their format in the next section.

Fundamentally, each dataset/service found on an ocean marketplace will have access purchasable in the form of a specific Datatoken. However, the pricing of Datatokens can be a difficult task, and will naturally vary in relation to the relevance of the dataset/service being offered. Drawing on the wisdom of traditional financial markets, Ocean offers marketplace builders to use a variety of methods in pricing their data, including order books, auctions, and market makers. Most promising of these three is the market maker approach, wherein some third party provides liquidity to that pairing in exchange for fees on each trade, ensuring that people have access to Datatokens (assuming the market maker has sufficient liquidity) without the need to wait for an auction to end or an order book to find a correspondent offer.

But doesn’t the use of this third party undo the tenets of decentralization in such a marketplace? Not quite — because Ocean is an application built on a blockchain, these market makers are fully automated (hence their common acronym “AMM”), with the owner of the dataset/service defining the parameters for the “bonding curve” that define how the exchange rate changes in relation to supply and demand. These functions can be linear, quadratic, and even made to change dynamically with supply and demand.

Ocean AMMs are structurally similar to those that run at the core of decentralized exchanges (DEXs) like Uniswap, but at a much smaller scale. Because AMMs operate using user delegated liquidity, any user holding that specific Datatoken and $OCEAN can contribute some paired amount of each to the pool, subsequently being paid a percentage of the fees accrued by the AMM as it swaps another user’s $OCEAN tokens for Datatokens. This process of staking one’s liquidity on a certain dataset/service doubles as a means of curation, representing a signal for quality and of the Datatoken offered, and thereby helping to find its true price in the marketplace.

These dynamics help to keep healthy Datatokens afloat and relevant in the marketplace, simultaneously paying out fees to the liquidity providers that curate this market. But the ball doesn’t stop there — these are the fundamentals of single asset publishing and consumption. The real power of Ocean emerges when we further compose these data assets with code to manage rights and governance, allowing for wonderful new organizations and tools to emerge. These outgrowths begin to answer the question of data ownership in the modern day, and establish the baseline for Ocean’s vision and ingenuity.

Datatokens and Data NFTs: Customizable Representations of Access

If you’ve been tapped into the Ethereum ecosystem long enough, it should not be surprising to find that the two types of data assets are represented with the ERC20 and ERC721 standards. ERC20 defines the standard for a fungible token, and ERC 721 for a non-fungible token (NFT). To simplify further conversation, I will call these ERC20 tokens “Datatokens” and the ERC721 tokens “Data NFTs”.

In boilerplate terms, Data NFTs are used to represent a unique asset, and Datatokens are used to access data services. Usually, because most data services are meant to be widely available, a Datatoken is used for broader access. However, Data NFTs are still valuable for explicitly representing ownership of the rights to create additional access licenses, i.e. mint additional Datatokens. This is an implicit feature of the Ocean V3, which allows the publisher to automatically mint additional datatokens. However, as we explore complex permission schemes for baskets of Datatokens or specific verticals or use-cases, Data NFTs do come to play an important role.

Datatokens are unique to some data (or data service) being offered on the marketplace, and allow for the publisher to express custom types of access. A common variation pattern of a Datatoken is one that is related to the timespan of access, meaning that having the data token could be used to access the service/data one time only, perpetually, or within a set range or schedule. Datatokens are also tunable so that they access the specific storage locale of some static data, be it in Web2 cloud service like AWS or on a Web3 distributed file system like IPFS. In the case of accessing data streams, the Datatoken must permit access to that stream either by Web2 API or Web3 oracle.

This custom access can restrict different groups from accessing a Datatoken. In a real world context, a data publisher of sensitive information would gate the data so that only those with certain credentials — for example proof of being a part of that research or medical team — have access to it. These credentials beg the question of how identification is represented on chain, and though there is much work being done to find the ideal form of decentralized ID (DID), this topic is beyond the scope of this review.

To take access and customization a step further, the ERC998 standard facilitates something called a Composable Datatoken, which I will refer to as a “CDT”. Composability by this standard means that publishers are able to combine access control for a variety of ERC20, ERC721, and ERC998 tokens into one unified token schema. Other protocols, for example Set Protocol, may be used for to create a CDT for use on Ocean, however their core functionality remains the same: using a CDT schema, one is able to “package” any variety of ERC20, ERC721, and even its own ERC998 type token into a basket of assets that has its own hierarchical control.

These CDTs enables representations of:

Streams of data by packaging specific intervals into one data token. (Note: this is different from the above method of API access to a data stream because the data is not live)
Data from a variety of sources, for example a hundred distinct datasets from different IoT devices that ultimately represent a larger complex system, like the totality of a smart home or smart city.
A basket of data, where each individual dataset has relatively small value, but in combination, the composed access to the whole basket is more valuable due to a trend identified by your masterful data science skills.
Data indexes, allowing for investment in datastreams unique to different sectors or fields, acting to funnel capital into the data-architecture of that sector instead of a higher level of investment like company stock.
“Frames” of data, which allow access to certain subsets of a data stream, but not its totality. As an extension, this form of composition can create scaling access to any percentage of that data.
Formats of data reliability and reputation, allowing a user to append some metadata for a specific dataset to better represent its quality. In the future, this is one pathway for data-derivatives markets such as data use insurance.

These are only a few examples of where CDTs come into play in Ocean’s data marketplace, in the future we will see examples that emerge to better transport capital and ownership in industries reliant on AI/ML.

Though not explicitly a CDT, there is another variant that has been used recently to represent fractional IP in the decentralized science space, known as the “ReFungible” Datatoken. This variant utilizes an ERC20 (fungible) wrapper to represent fractions of some dataset/service’s base IP, i.e. its ERC721 Data NFT. This is done using a bonding curve to allow a user to sell portions of the rights to the set/service in fractional amounts. All the same, this variant could be combined with the hierarchy of CDTs to make ownership of AI/ML pipelines more fluid and democratic.

Balancing Risk and Reward in the Data Economy

As data becomes more valuable, individuals and organizations have been forced to balance the risks and rewards of moving data online. As a hospital moves clinical trial data from one medical or scientific institution to another, each move runs the risk of a breach, but each move is necessary in getting that data closer to a positive end result — the discovery of a new, life saving treatment. Institutions that deal with data, be it social, mechanical, medical, or otherwise, have the most to gain by this transfer, improving and training new AI/ML models to better solve a specific problem.

But the other side of the coin is perilous to the individual and the organization alike. What happens if your cost of medical insurance rises because of some discovery correlating your medical profile to an increased risk of a terminal illness? What happens if a massive corporate dataset for a breakthrough drug trial or mechanical device is compromised and made available to competitors?

These risks are weighed against the benefits each time data moves off-premise or is made available to an outside service provider. With the increasing capacity of data monetization, there is an increasing need for solutions that find a compromise between security and access — which, in recognition of this challenge, Ocean has set out to do.

An Ace in the Hole: Compute to Data

Ocean provides a comprehensive marketplace for publishing, permissioning, and exchanging datasets and services, and with this in mind we can explore what I consider the keystone tool of Ocean Protocol: their privacy preserving compute to data (C2D) framework.

Keeping data private while enabling monetization is the name of the game, and Ocean’s C2D approaches a solution that most resembles federated learning with blockchain based access controls. Federated learning alone allows a centralized orchestrator to bring a randomized neural network (or other model) to some private data silos, run computations to calculate weight updates, then send those updated weights back to be implemented in the model, all without the orchestrator ever viewing the data. Where Ocean differs with C2D is that the compute orchestrator is determined by the owner of the data, bringing further control to that owner. This orchestrator could be the data publisher themself, or another trusted compute provider on the marketplace.

From the Ocean Protocol C2D docs, the architecture can be visualized in the flow diagram below:

Source: https://docs.oceanprotocol.com/building-with-ocean/compute-to-data

The C2D architecture runs using the ‘Operator-Service’ provided by Ocean, which is in charge of managing the workflow executing requests, and the ‘Operator-Engine’, which is in charge of orchestrating the compute infrastructure using Kubernetes as a backend where each compute job runs in an isolated Kubernetes Pod. A thorough explanation of this process can be found on Ocean’s medium page here, but I will summarize the key steps of the process below using Alice as a data publisher (and compute provider), and Bob as a data consumer who wants to access Alice’s data:

Alice wishes to offer compute services over her data, and sets her choice of compute infrastructure using the Operator-Service and Operator-Engine, choosing either locally owned compute infrastructure or a trusted compute provider. In this case, Alice sets up her own compute system.
Bob finds Alice’s dataset on the marketplace and believes it could help to train his prediction model for weather. Ocean’s Brizo (an automated proxy for the data publisher/compute provider) then confirms validations for Datatoken payment, access, and signing of a service agreement.
Bob then publishes his algorithm to Ocean, which is given a decentralized ID (DID), and offers payment for the compute service, which is kept in escrow, only to be sent to Alice when the computation is complete.
When all of the necessary checks are completed by Brizo, the Operator-Service is instructed to initiate the computation over the data via their algoDID and dataDID.
The Operator-Service then computes its own checks on the inputs, and upon success passes them to the Operate-Engine to begin computation using Alice’s pipeline and Bob’s algorithm.
Once the computation is complete, the resulting logs and model are published to an AWS S3 bucket, whose URL is shared with Bob by Brizo.
If Bob is not satisfied with the result, Bob can continue to run the computation until compute access expires either with the same or a different algorithm.

*Note* here, that a data consumer can only run one active job per compute service. If Alice were to offer multiple, they could be run in tandem.

This is the essential flow of C2D, but there are still variants of private computing that can be integrated with the C2D framework, particularly those used to derive statistical insight not used for training of an AI/ML model. For example, simple aggregating functions that depict the distribution of the data for the sake of analytics, or the generation of synthetic data based on the published dataset for the consumer to download and use directly. Other methods of anonymizing data are described at length in Tent McConaghy’s blog post comparing privacy preserving techniques. Critically, Ocean C2D is not limited by any one privacy preserving technique, and can integrate a variety of methods used for (decentralized) federated learning which have been developed by OpenMined and others.

To summarize: C2D facilitates learning and other analytic computation where the data never leaves the owners premise while some remote computation is executed. There still exist challenges, such as intentionally malicious injections attacks in a complicated algorithm that may be difficult to catch by eye. As a response, in the same article Trent discusses the potential for community curation with “skin in the game” (staking) to help weed out bad algorithms and elevate good ones. Of course, self-inspection is always recommended, but others, very simple algorithms like logistic regression or aggregating functions are easier to trust.

As a final note on C2D, marketplaces can set any number of requirements about how C2D is to be used and over which datasets. Similarly, marketplaces can define which compute resources and algorithms are allowed on their marketplace.

A Veritable “Ocean” of Relevant Application

By providing a service that decentralizes and secures the infrastructure for data in AI/ML exchange, Ocean is providing a critical piece in the puzzle of unified data science in health, engineering, research, personal data protection, and more. Most domains of technology today are deeply reliant on data, and with the tools offered by Ocean, they can finally capitalize on those opportunities previously out of reach behind private, secure data-silos. Peering into the future, there are myriad applications for the tools offered by Ocean, applicable to our personal lives and the professional sectors operating in our daily lives.

In the context of industry, a system that builds on Ocean to orchestrate, permission, and administer data sharing and secure C2D learning could spark a Cambrian explosion of next-gen, highly trained AI/ML models at enterprise scale. Because composable Datatokens provide such high flexibility in permissioning their data, governance of AI/ML use at (inter)national scales can be rigorously tested in regulatory sandboxes, with full auditability on account of the blockchain record. This experimentation would be done in the aim of developing optimal governance and innovation frameworks for the data economy, searching for best practices in secure and private AI/ML, creating financial opportunities and channels, and discovering other best practices for sovereign data.

One such recent example where Ocean is being used in an industrial enterprise to create new standards is within the Mobility Blockchain Initiative, or MOBI, which facilitates the sharing of autonomous vehicle (AV) driving data to better improve safety models for a global coalition of AV manufacturers!

Ocean helps to universalize access to data science, allowing global talent and previously untapped “data wells” to contribute to AI/ML tools which lack the “full picture” of what they are attempting to model. This can be facilitated by way of hackathons and bounties created by individuals and organizations, turning AI/ML research and development into a sort of gig economy, but with self-sovereign, fine-grained control over the rights to that labor.

At the level of the individual, if an app or browser extension was made to access your browsing and activity metadata and transform it into a personal basket of data, that person could go about creating their own personal Datatoken for some facet of their online life. This in combination with the composition of Datatokens opens up the possibility of a personal data market, one that over time could come to uproot the data harvesting so prevalent on social platforms today. Decentralized social media platforms like Mastodon are already attempting this, and could serve as a testbed for a social datamarket in the future. Though a lofty aim, I anticipate legal conversations will be had concerning these rights, and depending on jurisdiction, give individuals greater sovereignty over the data they produce.

Abstracting away from the individual and towards the collective, groups could create a “Data DAO”, which is a DAO (decentralized autonomous organization — essentially an LLC on a public blockchain) that manages some set of data assets unique to them. Effectively, these Data DAOs would function as Data Unions (for example Delta-DAO), with examples already cropping up, both for personal data control and management in healthcare, neuroscience, and ecological/GIS projects. To take the idea of the Data DAO one step further, it has been discussed that DAOs will come to house forms of artificial intelligence as they operate in the future, and Ocean could provide those AI based DAOs a means of accessing data in an automated manner, as well as creating an interface for people to steward that algorithm’s aim. This aim could be profit driven in the realm of data DeFi, research based in the development of some new model/algorithm, or entirely philanthropic/creative, for example using a DAO to cybernetically represent a forest and stewarding the pilot algorithm for maximal regenerative aim (inspired by the Terra0 experiment — super cool).

These are just a few examples of how Ocean is applied and could be applied in the future. In the coming years, with data as a new asset class, data DeFi has the opportunity to open up many more domains and verticals to the AI/ML revolution. At this point, secure and decentralized data exchange is still a nascent tech stack, but Ocean has moved to answer the question of decentralized infrastructure in AI/ML safety, development, and monetization, all of which will help answer the big questions of ownership in the data economy. At the risk of sounding overly “solar-punk”, I believe these developments, and the data-economy as a whole, will let high-tech, regenerative efforts flourish in the years to come — something I think we can all look forward to.

A Conclusion

Data is becoming mission critical for massive entities and sovereign citizens alike, and the value at stake with that data continues to grow. All of these entities, whether they know it yet or not, could benefit from the security and financial opportunity of data monetization. With the tools available from Ocean Protocol, anyone can go about setting up such a data marketplace with custom rules, compute provisions, and IP schema for data-based organizations to flourish. Though artificial intelligence is developing at a breakneck rate, blockchain based control systems will help us safely experiment with these new technologies. A shift that could ultimately set us on a course for true data sovereignty, more comprehensive models of our environment, and financial opportunities in areas we could not have otherwise dreamed of.

Data is in fact the new oil — but with blockchain based data marketplaces, the concept of the oil-barron may be changing. New structures for governing data-scientific IP can create a more fair and self-sovereign data economy, where each person’s data is bound purely to their choice of monetization and allocation. Data Barrons of the present will not go quietly, but tech advances will necessitate legal reform, and by that time, the awareness of one’s personal data price tag may finally shift the tides in this ocean of value back towards their originators.

Social Media

Feel free to connect with me at the below locales for more content on the development of this burgeoning space — and don’t hesitate to reach out or collaborate!

LinkedIn: https://www.linkedin.com/in/noah-m-aa3b43107/

Twitter: https://twitter.com/FireMcGuireJ