> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# RFC-006: Store Checkpoints as Files

export const YouTubeList = ({id}) => {
  const src = `https://www.youtube.com/embed/videoseries?list=${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>;
};

export const YouTube = ({id, width}) => {
  const src = `https://www.youtube.com/embed/${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen width={width}></iframe>;
};

export const Schema = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/odf/schemas#${anchor}`;
  return <a class="schema-object" href={link}>{t}</a>;
};

export const Term = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/general/glossary#${anchor}`;
  return <a class="glossary-term" href={link}>{t}</a>;
};

export const Diagram = ({src, alt}) => {
  return <div style={{
    display: "flex",
    "flex-direction": "column",
    "align-items": "center"
  }}>
    <img src={src} alt={alt} style={{
    background: "#dddddd",
    "margin-bottom": 0
  }} />
    <span>{alt}</span>
  </div>;
};

**Start Date**: 2022-04-19

[![RFC Status](https://img.shields.io/github/issues/detail/state/kamu-data/open-data-fabric/24?label=RFC%20Status)](https://github.com/kamu-data/open-data-fabric/issues/24)

[![Spec PR](https://img.shields.io/github/pulls/detail/state/kamu-data/open-data-fabric/25?label=Spec%20PR)](https://github.com/kamu-data/open-data-fabric/pull/25)

## Summary

This RFC proposes to no longer allow arbitrary files and directory structures as engine checkpoints and require that a checkpoint was always a single file.

## Motivation

Currently, engine can produce a checkpoint that includes arbitrary structure of files and directories. This presents few problems:

1. Metadata blocks need to refer to checkpoints by hash, but there is no standard approach to computing a hash of a directory. We would need to create a stable directory hashing algorithm ourselves and this complexity will spread to all implementations.

2. Allowing arbitrary file structures is also a security concern, e.g. need to make sure engines don't create weird symlinks.

3. When downloading a dataset from repository, many transfer protocols don't have a standard way to list a directory. One such example is HTTP - there is no standard content format for `GET` on a directory - most web servers will return a styled HTML. Similarly to how Git can clone a repo using the ["dumb protocol"](https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols) we'd like to be able to walk and download entire dataset, and this means having a fixed directory structure and only referencing files.

## Guide-level explanation

## Reference-level explanation

Specification will be updated to no longer refer to checkpoints as opaque directories.

The temporary `ExecuteQueryRequest` schema that relies on file mounting will be updated.

## Drawbacks

* Engines that produce multiple files as checkpoints will need extra `tar` / `untar` steps

## Rationale and alternatives

* Since ODF implements its own content-addressable storage we could support "tree" structures in it just like git does. This would however come at a high cost in complexity and is not justified at the current stage.

## Prior art

* Git's bare repository format and  ["dumb sync protocol"](https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols) don't rely on directory listing

## Unresolved questions

* What's the most efficient checkpoint management scheme for long-term solution?
  * Can it be mmap'ed files as in our goal for data slices?

## Future possibilities

* This is a necessary step for implementing a sync protocol for pulling datasets from IPFS and other storage systems that provide HTTP gateway.
