RFC-003: Content Addressability
Start Date: 2021-11-20
Summary
This RFC specifies how Datasets are uniquely identified across the ODF network, decoupling identity from the aliases.
Motivation
Datasets are currently identified using their symbolic names. These names are then used, for example, when defining inputs of a derivative dataset or when referring to a dataset in a repository.
Two parties, however, can independently create datasets with conflicting names resulting in high likelihood of name collisions. Allowing users to rename datasets to resolve collisions will inevitably result in breaking the link between derivative dataset and its inputs.
There needs to be a way to uniquely identify a dataset on the network, with this identifier being immutable throughout dataset’s lifetime and resolvable to find the location(s) of data.
Guide-level explanation
Per rationale, we have established that:
- A hash of the last Metadata Block of the dataset can be sufficient to download an entire (subset of a) dataset from a content-addressable storage
- A named reference that points to the last Metadata Block in the chain is needed for identifying dataset as a whole
This RFC therefore suggests some tweaks to the MetadataBlock
schema to align it with content-addressability (see reference section).
Additionally, it will introduce a globally unique dataset identifier that can be used to refer to a dataset as a whole. This identifier will follow the W3C DID Identity Scheme. It will be created by hashing a public key of a ed25519
key pair, practically guaranteeing its uniqueness.
Symbolic names will become aliases for such identities, meaning that:
- Same dataset can have different names in different repositories (e.g. mirrors) while still sharing same identity
- Inputs of the derivative datasets will also be independent of renaming
Conforming to the DID
specification will allow us in future to expand related mechanisms like proof of control and ownership.
Reference-level explanation
To make MetadataBlock
content-addressable we will remove blockHash
from it to avoid chicken-egg problem of hashing. The metadata hashing procedure will be updated accordingly.
We will also expand the use of multihash
format proposed in RFC-002 to all hashes, removing the use of sha3-256
schema format.
New dataset-name
format will be introduced for dataset names (symbolic aliases) in the same manner as the existing dataset-id
format.
Existing dataset-id
schema format will be changed to expect a DID, following these rules:
- ODF datasets will use a custom DID method:
did:odf:...
- The method-specific identifier will replicate the structure of
did:key
method as a self-describing and upgradeable way to store dataset identity represented by a public key.
Dataset identity will be stored in the new seed
fields in the MetadataBlock
schema. Seed will be present (only) in the first block of every dataset.
Dataset creation procedure will involve:
- Generating a new cryptographic key pair (defaulting to
ed25519
algorithm) - Prefixing it with and appropriate
multicodec
identifier (likeed25519-pub
) - Storing this data in the first Metadata Block’s
seed
field.
When representing dataset ID as a string the DID format did:odf:<multibase>
will be used, where the binary data will use multibase
format and base58btc
encoding just like in did:key
method.
The DatasetSource::Derivative
schema will be updated so that inputs specify:
id
- unique identity of a datasetname
- symbolic name of an input to be used in queries only.
With the separation of Dataset IDs from Names we will update our PEG grammar to:
DatasetRefLocal = DatasetID / DatasetName
DatasetRefRemote = DatasetID / RemoteDatasetName
DatasetRefAny = DatasetRemoteRef / DatasetLocalRef
RemoteDatasetName = RepositoryName "/" (AccountName "/")? DatasetName
AccountName = Subdomain
RepositoryName = Hostname
DatasetName = Hostname
DatasetID = "did:odf:" Multibase
Hostname = Subdomain ("." Subdomain)*
Subdomain = [a-zA-Z0-9]+ ("-" [a-zA-Z0-9]+)*
Multibase = [a-zA-Z0-9+/=]+
Drawbacks
- We are adding more fields to
MetadataBlock
which is already anemic - this will be addressed separately in a follow-up RFC. - A seemingly unavoidable break in abstraction layers exists where a named reference (higher-level concept) is used from within derivative dataset inputs from metadata (lower-level concept). This is, however, similar to having an HTML page that contains a relative URL of another page.
Rationale and alternatives
Identity of static data in content-addressable systems
The most widespread form of resource identity in decentralized systems today is content addressability. Git, Docker & OCI image registries, DHTs, Blockchain, IPFS - resources in these systems are uniquely identifies by (hashes of) their content.
Assuming no hash collisions, this approach allows creating an identity for a resource without any central authority and a risk of collisions. If hashes collide this means resources are in fact identical, leading to a very natural de-duplication of data in the system.
This form of identity, however, is applicable only to static data. When you share a file via IPFS - it’s identity is a hash of file’s contents. When you modify and share the file again - you get a new identity.
Such form of identity is already perfectly suited for many components of ODF:
- Data part files
- Checkpoints
- Metadata blocks
If we align ODF’s hashing with the content-addressable system like IPFS we can get a cool effect:
- A hash of data part file stored in a Metadata Block could be directly used to find and download that data file from IPFS.
- Same goes for the previous metadata block identified by its hash - you could “walk” the metadata chain stored in IPFS in the same way you do on the local drive.
Entire ODF dataset can be stored inside IPFS as its structure maps onto content-addressable storage seamlessly.
Identity of dynamic data
Unlike all individual components of datasets that are immutable, ODF datasets themselves are dynamic.
In a content-addressable system dataset can be thought of as a “pointer” to the last Metadata Block in the chain. Every time new block is appended - dataset is updated to “point” to the new block. How to represent such “pointer” is ofter referred to as the “naming” problem.
This idea of named pointers is similar to:
- References (e.g.
HEAD
), branches, and tags ingit
- Image names in Docker / OCI
- IPNS naming service in IPFS
In case of Docker / OCI the naming is centralized - first person to create an organization and push an image with a certain name into the registry wins.
IPNS takes a much more decentralized and conflict-free approach:
- For every “name” a new key pair is generated
- Hash of the public key serves as the name of the pointer
- Private key is used to sign the value to later prove the ownership over it
- Owner writes an entry into DHT with signed value pointing to the hash of IPFS resource
- Entry can be updated when needed
Note that this is just one implementation. Alternative naming schemes using services like DNS and Blockchain exist, but they all have these parts in common:
- A single global-scale system that performs the name resolution (e.g. DNS)
- A mechanism to prove the ownership over a record (e.g. DNS registrar)
Out of this commonality a W3C Decentralized Identifiers (DIDs) specification have emerged that provides a common naming scheme, a data model of objects that names resolve into, and mechanisms of proving the control over decentralized identity. IPNS, Blockchains, etc. act as specific implementations of the “verifiable data registries” conforming to DID spec.
Prior art
- Content addressability in Git, Docker / OCI
- IPFS
- W3C Decentralized Identifiers (DIDs) Specification
- DID Specification Registries
- DID Method Rubric
- Use of DIDs in the Ocean Protocol
- IPID DID Method
These are covered in rationale above for better flow.
Unresolved questions
Future possibilities
A key pair generated during creation of the dataset identity in future can be used to implement access control and proof of control schemes described in the W3C DID-Core specification.
IPLD Format looks a lot similar to what ODF language-independent schemas are trying to accomplish, so we might consider using it in future for better interoperability.