Start Date: 2022-04-19

RFC Status

Spec PR

Summary

A minimalistic read-only transfer protocol that can be used to explore and transfer datasets between repositories.

Motivation

ODF needs a bare-minimum transfer protocol for synchronizing datasets between two repositories. The protocol needs to specify how, knowing the URL of a dataset, one can discover (“walk”) the dataset’s Metadata Chain and find all associated objects like refs, checkpoints, and data files.

Protocol is aimed to achieve maximal interoperability and ease of repository setup and less concerned with efficiency of transfer. More advanced protocols supporting efficient traversal of metadata, highly-parallel downloads, and data querying will be proposed separately.

Introducing this protocol will allow people to easily operate basic read-only ODF repositories by serving a directory under an HTTP server, or sync ODF datasets from IPFS.

Guide-level explanation

The protocol proposed here is inspired by Git’s “Dumb Protocol” that allows you to clone a Git repository backed by many application-level (L7) protocols (like HTTP and FTP) and works over any block or file-based storage that allows Unix path-like names for objects.

The only operation a server needs to support is reading an object by key. The protocol purposefully does not rely on any advanced features like ability to list a directory.

Reference-level explanation

The following section describing the protocol will be introduced into the spec:

Simple Transfer Protocol is a bare-minimum read-only protocol used for synchronizing a dataset between repositories. It requires no ODF-specific logic on the server side and can be easily implemented, for example, by serving a dataset directory under an HTTP server.

To describe the protocol we will use HTTP GET {object-key} notation below, but note that this protocol can be implemented on top of any block or file-based protocol that supports Unix path-like object keys.

  1. Process begins with GET /refs/head to get the hash of the last Metadata Block
  2. The “metadata walking” process starts with GET /blocks/{blockHash} and continues following the prevBlockHash links
  3. Data part files can be downloaded by using DataSlice::physicalHash links with GET /data/{physicalHash}
  4. Checkpoints are similarly downloaded using Checkpoint::physicalHash links with GET /checkpoints/{physicalHash}
  5. The process continues until reaching the first (seed) block of the dataset or other termination condition (e.g. reaching the block that has already been synced previously)

Additionally, the dataset identity grammar will be updated to allow referencing remote datasets by URL:

DatasetRefRemote = DatasetID / RemoteDatasetName / Url

Url = Scheme "://" [^\n]+
Scheme = [a-z0-9]+ ("+" [a-z0-9]+)*

The DataSlice and Checkpoint objects will be extended to carry the size of the files to provide size hints when transferring files between repositories in a streaming fashion.

Drawbacks

  • Walking the metadata chain one object at a time limits parallelism, but as previously stated - performance is not the goal here, and some parallelism is still possible when downloading data and checkpoint files.

Rationale and alternatives

Prior art

Unresolved questions

  • Git’s traversal begins with info/refs file that is updated by git update-server-info command. Having this file as a starting point allows client to discover all branches and tags in the repo. For the time being while ODF branching model remains underspecified we will begin traversal with the /refs/head path.
  • This protocol is scoped to a single dataset - how multi-dataset repositories and mirrors will function is left out of scope of this RFC.

Future possibilities

  • This is a necessary step for implementing pulling of datasets from IPFS and other storage systems that provide HTTP gateway.
  • In the future, we can specify protocols that allow more efficient and highly-parallel transfer in expense of having slightly more logic on the server side.