Kamu Documentation > Open Data Fabric > RFCs > RFC-006: Store Checkpoints as Files

RFC-006: Store Checkpoints as Files

Start Date: 2022-04-19

Summary

This RFC proposes to no longer allow arbitrary files and directory structures as engine checkpoints and require that a checkpoint was always a single file.

Motivation

Currently, engine can produce a checkpoint that includes arbitrary structure of files and directories. This presents few problems:

Metadata blocks need to refer to checkpoints by hash, but there is no standard approach to computing a hash of a directory. We would need to create a stable directory hashing algorithm ourselves and this complexity will spread to all implementations.
Allowing arbitrary file structures is also a security concern, e.g. need to make sure engines don’t create weird symlinks.
When downloading a dataset from repository, many transfer protocols don’t have a standard way to list a directory. One such example is HTTP - there is no standard content format for GET on a directory - most web servers will return a styled HTML. Similarly to how Git can clone a repo using the “dumb protocol” we’d like to be able to walk and download entire dataset, and this means having a fixed directory structure and only referencing files.

Guide-level explanation

Reference-level explanation

Specification will be updated to no longer refer to checkpoints as opaque directories.

The temporary ExecuteQueryRequest schema that relies on file mounting will be updated.

Drawbacks

Engines that produce multiple files as checkpoints will need extra tar / untar steps

Rationale and alternatives

Since ODF implements its own content-addressable storage we could support “tree” structures in it just like git does. This would however come at a high cost in complexity and is not justified at the current stage.

Prior art

Git’s bare repository format and “dumb sync protocol” don’t rely on directory listing

Unresolved questions

What’s the most efficient checkpoint management scheme for long-term solution?
- Can it be mmap’ed files as in our goal for data slices?

Future possibilities

This is a necessary step for implementing a sync protocol for pulling datasets from IPFS and other storage systems that provide HTTP gateway.

Edit on GitHub

Top