RFC-010: Data Schema in Metadata
Start Date: 2023-10-25
- [V] Backwards-compatible
Introduces a new metadata event
SetDataSchema that specifies the Arrow schema of the Data Slices added to the dataset following this event.
We currently don’t store the schema of Data Slice files in metadata.
The closest thing we have is the
read-stage schemas defined in
SetPollingSource event, but those schemas are only applied to external data when reading it, can be reshaped by the
preprocess queries, and also get extended with system columns.
This creates several problems:
- One cannot understand the data schema from metadata alone and needs to inspect the last Parquet file, which can be costly
- The data schema is effectively undefined until the first data file is written
- It’s not possible to pre-define the dataset schema before writing data into it (which poses a problem for “push”-style ingestion, see RFC-011)
A new metadata event
SetDataSchema is introduced that will contain full schema of the Data Slice files that are written following it.
SetDataSchema events can appear in the metadata chain, allowing for schema evolution.
SetDataSchema event will not normally be specified in Dataset Snapshots, but instead will be generated by ODF implementations by inferring the final schema based on:
- First root ingestion / derivative transformation to take place
- Column names defined in
SetDataSchema event schema:
"description": "Specifies the complete schema of Data Slices added to the Dataset following this event.",
"description": "Apache Arrow schema encoded in its native flatbuffers representation."
Note the introduction of new format
flatbuffers. It will handle the data as a plain byte array. Libraries will be able to add special accessors that provide typed access to the underlying object - in this case Arrow Schema type. Due to the nature of flatbuffers this can be done without extra allocations.
This change is backwards-compatible, as existing implementations can continue to fall back to reading schema from Data Slice files if
SetDataSchema event is not found.
This change is forwards-incompatible as older implementation will fail to parse the new core event (we still lack the mechanism to mark some events as “OK to disregard”).
- More complexity in metadata
Event order rules
It was debated whether to require the existence of
SetDataSchema prior to definition of any ingest sources or transformations.
This would require:
- Either making users specify schema explicitly - deemed too complex and impractical.
- Or implementations to infer the schema from the defined sources - deemed as something that would slow down the dataset creation from snapshots, since inferring the schema would require instantiating an Engine.
The compromise was reached in requiring
SetDataSchema before any Data Slices are added, but allowing it to follow the
SetTransform events from which it will be inferred during the first processing.
We chose to use Arrow schema instead of Parquet because:
- Arrow is our cross-system data interchange format
- Arrow schema is more expressive and can represent types more accurately (e.g. timezones)
- We care about more about logical schema rather than how data appears on disk, and this will allow us in future to support other on-disk formats as long as their schemas can be mapped to Arrow.
Arrow Schema currently does not have a standard human-readable representation - it is specified only by its
We therefore store the schema as flatbuffer-in-flatbuffer in the metadata chain, which gives us optimal encoding, and leave it up to implementations to present in human-readable format of their choice to the users.
For now, we don’t expect users ever having to manually specify the schema, but if needed we could explore flatbuffers own standard way to serialize data to JSON.