Issue PR

Start Date: 2025-06-14

Published Date: 2025-08-25

Authors:

Compatibility:

  • Backwards-compatible
  • Forwards-compatible

Summary

This RFC introduces a new an experimental way to link large binary objects like images, videos, and documents to the dataset.

Motivation

While Parquet format supports large binary columns, sometimes it’s sub-efficient to keep binary data within the structured records of a dataset. Resons may include:

  • Downloading the core dataset much faster to query and filter it before deciding which of the large binaries to download
  • Storing large binaries in a separate storage from the dataset while maintaining tamper-proof properties (e.g. fast storage for dataset, glacier/tape for infrequently accessed binaries)
  • Resumable uploads / downloads of large binaries
  • Streaming binary data directly from storage without proxying data through a server (e.g. for image, video, and document previews).

NOTE: For streaming, while it is technically possible to obtain a byte range of a binary value withing a Parquet file and use Range: bytes=start-end header to access it - many preview systems take only a URL and don’t support streaming within a specified byte range, or the value in Parquet may be in the middle of a compressed chunk requiring special handling.

Guide-level explanation

Considering the above, we decided to provide an ability to attach external binaries to ODF datasets.

In order to preserve tamper-proof properties of the ODF datasets we refer to external binaries only by their hashes. This ensures that file cannot be modified without a corresponding change in the dataset and that dataset’s head block hash acts as a Merkle root of the dataset and linked objects.

The links (hashes) of the external binaries will be specified in data columns (see rejected alternatives). Metadata chain will only contain summary information that allows to count and calculate the size of linked objects.

To mark a column that stores links to external objects we will introduce a new ObjectLink logical type that initially will be in the multihash format. During this early stage we don’t have enough confidence in this approach to make ObjectLink and Multihash part of the core ODF schema, so we will first introduce them via extra attributes mechanism.

Initially we will only support linked objects that are stored alongside (embedded) in a dataset, but may expand it in future towards sharing objects between multiple datasets and use of separate storage.

Reference-level explanation

Example of defining a column using extended ObjectLink type:

- name: content_hash
  type:
    kind: String
  extra:
    opendatafabric.org/type:
      kind: ObjectLink
      linkType:
        kind: Multihash

Linked object data will be stored in /data/<multihash> section of the dataset, together with the data chunks.

Example schema of a dataset that links to externally-stored MRI images:

fields:
- name: offset
  type:
    kind: UInt64
- name: op
  type:
    kind: UInt8
- name: system_time
  type:
    kind: Timestamp
    unit: Millisecond
    timezone: UTC
- name: event_time
  type:
    kind: Timestamp
    unit: Millisecond
    timezone: UTC
- name: mri_content_hash
  type:
    kind: String
  extra:
    opendatafabric.org/type:  # [1]
      kind: ObjectLink  # [2]
      linkType:
        kind: Multihash  # [3]
- name: patient_id
  type:
    kind: String
  extra:
    opendatafabric.org/type:
      kind: Did  # [4]

[1] New experimental attribute opendatafabric.org/type that defines an extended set of field Types [2] New extended logical type ObjectLink signifies that the value references an external object. The mandatory linkType property defines the type of the link. [3] New extended logical type Multihash signifies a String in a self-describing multihash format. [4] Same extended type mechanism is used for Did type.

The AddData event will be extended with linkedObjects summary:

event:
  kind: AddData
  prevOffset: 2165
  newData:
    logicalHash: f9680c001200456651f4be2a8c1a2404c89c6356ce875969d69d93333f5ae97155ff996c4a7
    physicalHash: f162014678f832e823c1429eae7d178e0af1957291372b15171414976c04223026581
    offsetInterval:
      start: 2166
      end: 6644
    size: 554005
  newWatermark: 2022-08-12T03:41:37Z
  extra:
    opendatafabric.org/linkedObjects:  # [5]
      numObjectsNaive: 1123  # [6]
      sizeNaive: 100500  # [6]

[5] New opendatafabric.org/linkedObjects custom attribute contains the summary section to understand how many external objects were associated with a certain AddData event as well as their total size from metadata only, without querying the individual Parquet data chunks. [6] The object count and total size are labeled “naive” as they do NOT account for possible duplicates (see notes).

Impact on existing functionality

To properly support this extension, implementations will need to:

  • Allow upload/download of opaque binary data in the /data section of the dataset
  • Update ingest procedure to ensure referrential integrity - that hashes that appear in such columns have corresponding data objects uploaded
  • Update compaction procedures to combine linked object summaries
  • Update smart and simple transfer protocols with an additional stage that download/uploads the external data chunks.
  • Update dataset data size summaries to separately mention the number of linked objects and their total size.

Compatibility

The change is backwards and forwards compatible in a sense that introduction of the new logical type and extending the AddData event will not break old clients - they will be able to work with updated datasets, although they will not transport the external objects along with them.

Drawbacks

Rationale and alternatives

Rejected: Linking binaries via metadata chain

Initial idea was to introduce a special event or to extend AddData to allow linking the binaries. This would allow to identify which binaries are linked to the dataset using only the metadata chain.

This however introduces many problems:

  • Duplication of information between metadata and data chunks
  • Having two “sources of truth” that opens up avenue for dangling references
  • Issue of compact storage - metadata was not meant to store a lot of information, but this opens up possibility for it to have millions of linked object entries.

We rejected this path in favor of storing links only in data chunks, even though it also has drawbacks:

  • We now need to read the data chunks in order to see all linked objects complicates the process of scanning

Prior art

Unresolved questions

Linked objects in derivative datasets

It’s not clear how linked objects should be treated in derivative datasets. We may want to at least support plain propagation of ObjectLink columns. In this case we likely want to reuse the data objects from the input datasets instead of duplicating it to derivative dataset. This entails cross-dataset content links which may complicate size computation and grabage collection.

Future possibilities

External storage support

While the initial implementation aims to store large files as part of the dataset itself, we can extend this mechanism in future to support external storages for large objects.

Ideally the choice of storage could vary and be selected on per-node basis, e.g. one node may keep all data together, while another keep objects in a separate cheaper bucket. This means configuration of where liked objects are stored should be separate from ObjectLink column definition.

IPFS case is tricky as CID != hash of data - it can vary due to slicing settings. We likely want to store reference that represent content hash independent of the physical file layout of the object, which will require additional layer of mapping between hash references and CIDs.

Accounting for duplicates in size calculation

We chose to populate metadata with sizeNaive as we don’t want to require deduplication of references. Deduplication across the whole dataset would involve complex queries that span multiple data chunks which is impractical during ingest stage.

We cannot rely on checking if object store previously contained referenced object during the upload, as this would require complex accumulation of state during the uploads and passing it to ingest. It still would likely to be incorrect in presence of different merge strategies, retractions, compactions, and purging of history.

Even if we performed full deduplication within a dataset during every ingest, depending on the storage used by a node additional deduplication may happen on the storage level (e.g. across multiple datasets), so the size calculation may still over-estimate the footprint.

We therefore decide to keep “naive” size counting as a very rough estimate and leave it up to individual node implementations to perform exact size calculations, e.g. in a heavy background process, if desired.