Start Date: 2022-04-19

RFC Status

Spec PR

Summary

This RFC proposes to no longer allow arbitrary files and directory structures as engine checkpoints and require that a checkpoint was always a single file.

Motivation

Currently, engine can produce a checkpoint that includes arbitrary structure of files and directories. This presents few problems:

  1. Metadata blocks need to refer to checkpoints by hash, but there is no standard approach to computing a hash of a directory. We would need to create a stable directory hashing algorithm ourselves and this complexity will spread to all implementations.

  2. Allowing arbitrary file structures is also a security concern, e.g. need to make sure engines don’t create weird symlinks.

  3. When downloading a dataset from repository, many transfer protocols don’t have a standard way to list a directory. One such example is HTTP - there is no standard content format for GET on a directory - most web servers will return a styled HTML. Similarly to how Git can clone a repo using the “dumb protocol” we’d like to be able to walk and download entire dataset, and this means having a fixed directory structure and only referencing files.

Guide-level explanation

Reference-level explanation

Specification will be updated to no longer refer to checkpoints as opaque directories.

The temporary ExecuteQueryRequest schema that relies on file mounting will be updated.

Drawbacks

  • Engines that produce multiple files as checkpoints will need extra tar / untar steps

Rationale and alternatives

  • Since ODF implements its own content-addressable storage we could support “tree” structures in it just like git does. This would however come at a high cost in complexity and is not justified at the current stage.

Prior art

Unresolved questions

  • What’s the most efficient checkpoint management scheme for long-term solution?
    • Can it be mmap’ed files as in our goal for data slices?

Future possibilities

  • This is a necessary step for implementing a sync protocol for pulling datasets from IPFS and other storage systems that provide HTTP gateway.