RFC-007: Simple Transfer Protocol
Start Date: 2022-04-19
Summary
A minimalistic read-only transfer protocol that can be used to explore and transfer datasets between repositories.
Motivation
ODF needs a bare-minimum transfer protocol for synchronizing datasets between two repositories. The protocol needs to specify how, knowing the URL of a dataset, one can discover (“walk”) the dataset’s Metadata Chain and find all associated objects like refs, checkpoints, and data files.
Protocol is aimed to achieve maximal interoperability and ease of repository setup and less concerned with efficiency of transfer. More advanced protocols supporting efficient traversal of metadata, highly-parallel downloads, and data querying will be proposed separately.
Introducing this protocol will allow people to easily operate basic read-only ODF repositories by serving a directory under an HTTP server, or sync ODF datasets from IPFS.
Guide-level explanation
The protocol proposed here is inspired by Git’s “Dumb Protocol” that allows you to clone a Git repository backed by many application-level (L7) protocols (like HTTP and FTP) and works over any block or file-based storage that allows Unix path-like names for objects.
The only operation a server needs to support is reading an object by key. The protocol purposefully does not rely on any advanced features like ability to list a directory.
Reference-level explanation
The following section describing the protocol will be introduced into the spec:
Simple Transfer Protocol is a bare-minimum read-only protocol used for synchronizing a dataset between repositories. It requires no ODF-specific logic on the server side and can be easily implemented, for example, by serving a dataset directory under an HTTP server.
To describe the protocol we will use HTTP
GET {object-key}
notation below, but note that this protocol can be implemented on top of any block or file-based protocol that supports Unix path-like object keys.
- Process begins with
GET /refs/head
to get the hash of the last Metadata Block- The “metadata walking” process starts with
GET /blocks/{blockHash}
and continues following theprevBlockHash
links- Data part files can be downloaded by using
DataSlice::physicalHash
links withGET /data/{physicalHash}
- Checkpoints are similarly downloaded using
Checkpoint::physicalHash
links withGET /checkpoints/{physicalHash}
- The process continues until reaching the first (seed) block of the dataset or other termination condition (e.g. reaching the block that has already been synced previously)
Additionally, the dataset identity grammar will be updated to allow referencing remote datasets by URL:
DatasetRefRemote = DatasetID / RemoteDatasetName / Url
Url = Scheme "://" [^\n]+
Scheme = [a-z0-9]+ ("+" [a-z0-9]+)*
The DataSlice
and Checkpoint
objects will be extended to carry the size
of the files to provide size hints when transferring files between repositories in a streaming fashion.
Drawbacks
- Walking the metadata chain one object at a time limits parallelism, but as previously stated - performance is not the goal here, and some parallelism is still possible when downloading data and checkpoint files.
Rationale and alternatives
Prior art
Unresolved questions
- Git’s traversal begins with
info/refs
file that is updated bygit update-server-info
command. Having this file as a starting point allows client to discover all branches and tags in the repo. For the time being while ODF branching model remains underspecified we will begin traversal with the/refs/head
path. - This protocol is scoped to a single dataset - how multi-dataset repositories and mirrors will function is left out of scope of this RFC.
Future possibilities
- This is a necessary step for implementing pulling of datasets from IPFS and other storage systems that provide HTTP gateway.
- In the future, we can specify protocols that allow more efficient and highly-parallel transfer in expense of having slightly more logic on the server side.