RFC-010: Data Schema in Metadata
Start Date: 2023-10-25
Authors:
Compatibility:
- [V] Backwards-compatible
- Forwards-compatible
Summary
Introduces a new metadata event SetDataSchema
that specifies the Arrow schema of the Data Slices added to the dataset following this event.
Motivation
We currently don’t store the schema of Data Slice files in metadata.
The closest thing we have is the read
-stage schemas defined in SetPollingSource
event, but those schemas are only applied to external data when reading it, can be reshaped by the preprocess
queries, and also get extended with system columns.
This creates several problems:
- One cannot understand the data schema from metadata alone and needs to inspect the last Parquet file, which can be costly
- The data schema is effectively undefined until the first data file is written
- It’s not possible to pre-define the dataset schema before writing data into it (which poses a problem for “push”-style ingestion, see RFC-011)
Guide-level explanation
A new metadata event SetDataSchema
is introduced that will contain full schema of the Data Slice files that are written following it.
Multiple SetDataSchema
events can appear in the metadata chain, allowing for schema evolution.
SetDataSchema
event will not normally be specified in Dataset Snapshots, but instead will be generated by ODF implementations by inferring the final schema based on:
- First root ingestion / derivative transformation to take place
- Column names defined in
SetVocab
events
Reference-level explanation
SetDataSchema
event schema:
{
"$id": "http://open-data-fabric.github.com/schemas/SetDataSchema",
"$schema": "http://json-schema.org/draft-07/schema#",
"description": "Specifies the complete schema of Data Slices added to the Dataset following this event.",
"type": "object",
"additionalProperties": false,
"required": [
"schema"
],
"properties": {
"schema": {
"type": "object",
"format": "flatbuffers",
"description": "Apache Arrow schema encoded in its native flatbuffers representation."
}
}
}
Note the introduction of new format flatbuffers
. It will handle the data as a plain byte array. Libraries will be able to add special accessors that provide typed access to the underlying object - in this case Arrow Schema type. Due to the nature of flatbuffers this can be done without extra allocations.
Compatibility
This change is backwards-compatible, as existing implementations can continue to fall back to reading schema from Data Slice files if SetDataSchema
event is not found.
This change is forwards-incompatible as older implementation will fail to parse the new core event (we still lack the mechanism to mark some events as “OK to disregard”).
Drawbacks
- More complexity in metadata
Alternatives
Event order rules
It was debated whether to require the existence of SetDataSchema
prior to definition of any ingest sources or transformations.
This would require:
- Either making users specify schema explicitly - deemed too complex and impractical.
- Or implementations to infer the schema from the defined sources - deemed as something that would slow down the dataset creation from snapshots, since inferring the schema would require instantiating an Engine.
The compromise was reached in requiring SetDataSchema
before any Data Slices are added, but allowing it to follow the SetPollingSource
and SetTransform
events from which it will be inferred during the first processing.
Schema format
We chose to use Arrow schema instead of Parquet because:
- Arrow is our cross-system data interchange format
- Arrow schema is more expressive and can represent types more accurately (e.g. timezones)
- We care about more about logical schema rather than how data appears on disk, and this will allow us in future to support other on-disk formats as long as their schemas can be mapped to Arrow.
Schema representation
Arrow Schema currently does not have a standard human-readable representation - it is specified only by its flatbuffers
schema.
We therefore store the schema as flatbuffer-in-flatbuffer in the metadata chain, which gives us optimal encoding, and leave it up to implementations to present in human-readable format of their choice to the users.
For now, we don’t expect users ever having to manually specify the schema, but if needed we could explore flatbuffers own standard way to serialize data to JSON.
See also:
- https://github.com/apache/arrow/issues/25078
- https://stackoverflow.com/questions/48215929/can-i-serialize-dserialize-flatbuffers-to-from-json
Prior art
N/A
Unresolved questions
N/A
Future possibilities
N/A