Start Date: 2022-04-19Documentation Index
Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
Use this file to discover all available pages before exploring further.
Summary
This RFC proposes to no longer allow arbitrary files and directory structures as engine checkpoints and require that a checkpoint was always a single file.Motivation
Currently, engine can produce a checkpoint that includes arbitrary structure of files and directories. This presents few problems:- Metadata blocks need to refer to checkpoints by hash, but there is no standard approach to computing a hash of a directory. We would need to create a stable directory hashing algorithm ourselves and this complexity will spread to all implementations.
- Allowing arbitrary file structures is also a security concern, e.g. need to make sure engines don’t create weird symlinks.
-
When downloading a dataset from repository, many transfer protocols don’t have a standard way to list a directory. One such example is HTTP - there is no standard content format for
GETon a directory - most web servers will return a styled HTML. Similarly to how Git can clone a repo using the “dumb protocol” we’d like to be able to walk and download entire dataset, and this means having a fixed directory structure and only referencing files.
Guide-level explanation
Reference-level explanation
Specification will be updated to no longer refer to checkpoints as opaque directories. The temporaryExecuteQueryRequest schema that relies on file mounting will be updated.
Drawbacks
- Engines that produce multiple files as checkpoints will need extra
tar/untarsteps
Rationale and alternatives
- Since ODF implements its own content-addressable storage we could support “tree” structures in it just like git does. This would however come at a high cost in complexity and is not justified at the current stage.
Prior art
- Git’s bare repository format and “dumb sync protocol” don’t rely on directory listing
Unresolved questions
- What’s the most efficient checkpoint management scheme for long-term solution?
- Can it be mmap’ed files as in our goal for data slices?
Future possibilities
- This is a necessary step for implementing a sync protocol for pulling datasets from IPFS and other storage systems that provide HTTP gateway.