Issue Spec PR

Start Date: 2023-10-25

Authors:

Compatibility:

  • [V] Backwards-compatible
  • Forwards-compatible

Summary

Introduces a new metadata event SetDataSchema that specifies the Arrow schema of the Data Slices added to the dataset following this event.

Motivation

We currently don’t store the schema of Data Slice files in metadata.

The closest thing we have is the read-stage schemas defined in SetPollingSource event, but those schemas are only applied to external data when reading it, can be reshaped by the preprocess queries, and also get extended with system columns.

This creates several problems:

  • One cannot understand the data schema from metadata alone and needs to inspect the last Parquet file, which can be costly
  • The data schema is effectively undefined until the first data file is written
  • It’s not possible to pre-define the dataset schema before writing data into it (which poses a problem for “push”-style ingestion, see RFC-011)

Guide-level explanation

A new metadata event SetDataSchema is introduced that will contain full schema of the Data Slice files that are written following it.

Multiple SetDataSchema events can appear in the metadata chain, allowing for schema evolution.

SetDataSchema event will not normally be specified in Dataset Snapshots, but instead will be generated by ODF implementations by inferring the final schema based on:

  • First root ingestion / derivative transformation to take place
  • Column names defined in SetVocab events

Reference-level explanation

SetDataSchema event schema:

{
  "$id": "http://open-data-fabric.github.com/schemas/SetDataSchema",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "description": "Specifies the complete schema of Data Slices added to the Dataset following this event.",
  "type": "object",
  "additionalProperties": false,
  "required": [
    "schema"
  ],
  "properties": {
    "schema": {
      "type": "object",
      "format": "flatbuffers",
      "description": "Apache Arrow schema encoded in its native flatbuffers representation."
    }
  }
}

Note the introduction of new format flatbuffers. It will handle the data as a plain byte array. Libraries will be able to add special accessors that provide typed access to the underlying object - in this case Arrow Schema type. Due to the nature of flatbuffers this can be done without extra allocations.

Compatibility

This change is backwards-compatible, as existing implementations can continue to fall back to reading schema from Data Slice files if SetDataSchema event is not found.

This change is forwards-incompatible as older implementation will fail to parse the new core event (we still lack the mechanism to mark some events as “OK to disregard”).

Drawbacks

  • More complexity in metadata

Alternatives

Event order rules

It was debated whether to require the existence of SetDataSchema prior to definition of any ingest sources or transformations.

This would require:

  • Either making users specify schema explicitly - deemed too complex and impractical.
  • Or implementations to infer the schema from the defined sources - deemed as something that would slow down the dataset creation from snapshots, since inferring the schema would require instantiating an Engine.

The compromise was reached in requiring SetDataSchema before any Data Slices are added, but allowing it to follow the SetPollingSource and SetTransform events from which it will be inferred during the first processing.

Schema format

We chose to use Arrow schema instead of Parquet because:

  • Arrow is our cross-system data interchange format
  • Arrow schema is more expressive and can represent types more accurately (e.g. timezones)
  • We care about more about logical schema rather than how data appears on disk, and this will allow us in future to support other on-disk formats as long as their schemas can be mapped to Arrow.

Schema representation

Arrow Schema currently does not have a standard human-readable representation - it is specified only by its flatbuffers schema.

We therefore store the schema as flatbuffer-in-flatbuffer in the metadata chain, which gives us optimal encoding, and leave it up to implementations to present in human-readable format of their choice to the users.

For now, we don’t expect users ever having to manually specify the schema, but if needed we could explore flatbuffers own standard way to serialize data to JSON.

See also:

Prior art

N/A

Unresolved questions

N/A

Future possibilities

N/A