RFC-016: ODF Schema Format
Start Date: 2025-06-14
Published Date: 2025-08-25
Authors:
Compatibility:
- Backwards-compatible
- Forwards-compatible
Summary
This RFC introduces a new human-friendly, extensible schema format for ODF datasets. It replaces the use of Arrow flatbuffer schemas in the SetDataSchema
event with an explicitly defined logical schema that can include rich metadata, such as column descriptions, logical types, and custom annotations. The new format balances human readability with future extensibility, making it easier to evolve schema definitions and metadata together, and adding support for custom attributes / annotations.
Motivation
We need an ability to attach extra information to data columns (such as description, ontology, logical types) and dataset as a whole (such as archetypes, visualization option etc.).
We also lack an ability to predefine schema of the dataset in the manifest. We currently only have control over the “read” schema and queries/transformations. The final schema of the dataset always ends up inferred. Ability to define schema is a prerequisite in order to attach any extra information to it.
Guide-level explanation
Currently in ODF the schema of the actual dataset is never defined in DatasetSnapshot
manifests - it is inferred upon first processing and written in the SetDataSchema
event using Arrow schema flatbuffer format.
In order for us to enrich the schema with extra information we have two choices:
- Attach extra information to columns via external annotations (rejected)
- Provide ability to explicitly define the schema as part of the manifest and make this definition extensible.
We pick the latter option as it:
- Avoids duplication of column names in several places
- More convenient to see everything related to a column in one place
- Evolution of schema happens along with evolution of annotations - there is no risk of schema and annotations diverging.
To explicitly define the schema we need a format that can be expressed within our manifests.
The first obvious candidate was Apache Arrow which we already use. The only well-defined layout for Arrow schema currently is its flatbuffers schema. The prior art section links a few tickets that discuss common JSON representation from which we draw inspiration.
We chose not to use Arrow schema and define our own schema format instead, because:
- The purpose of our schema is different - we want it to describe logical types and semantics of column values, rather than physical layout of data in files or in memory
- We want schema to be simple, expressive, and easy to define manually in a manifest
- We want to provide extensibility that Arrow schema lacks
Reference-level explanation
Flatbuffer schema
The old SetDataSchema
flatbuffer definition:
table SetDataSchema {
// Apache Arrow schema encoded in its native flatbuffers representation.
schema: [ubyte];
}
Will be modified as follows:
table SetDataSchema {
// DEPRECATED: Apache Arrow schema encoded in its native flatbuffers representation.
raw_arrow_schema: [ubyte] (id: 0);
// Defines the logical schema of the data files that follow this event. Will become a required field after migration.
schema: DataSchema (id: 1);
}
This format is compatible with old event format because flatbuffers only care about field IDs and not their names, so we are free to rename the old schema
field to raw_arrow_schema
and introduce new schema
field. Explicit tags are added as a reminder about the ongoing schema migration.
Manifest Schema Format
- kind: SetDataSchema
schema: # [1]
fields:
- name: offset
type:
kind: UInt64
- name: op
type:
kind: UInt8
- name: system_time
type:
kind: Timestamp
unit: Millisecond
timezone: UTC
- name: event_time
type:
kind: Timestamp
unit: Milliseconds
timezone: UTC
extra: # [2]
a.com/a: foo
b.com/x: bar
- name: mri_content_hash
type:
kind: String
extra: # [2]
opendatafabric.org/description: References the MRI scan data linked to the dataset by its hash # [3]
opendatafabric.org/type: # [4]
kind: Multihash
- name: subject
description: Information about the subject
type:
kind: Struct
fields: # [5]
- name: id
type:
kind: String
description: Subject's unique identity
extra: # [2]
opendatafabric.org/type: # [4]
kind: Did
- name: gender
type:
kind: Option # [6]
inner:
kind: String
description: Subject's gender
- name: area_codes
description: List of body area codes covered by this MRI scan
type:
kind: List
itemType:
kind: String
extra: # [2]
c.com/z: baz
[1]
New schema
field replaces the old schema
field now containing the ODF schema. Note that schema includes the system columns like offset
, op
, and system_time
.
[2]
The extra
field is a container for custom attributes (see format explanation below). Extra attributes will be allowed on the schema level too and work the same as on field (column) level.
[3]
We introduce opendatafabric.org/description
attribute to contain field (column) description.
[4]
The opendatafabric.org/type
here is used only as an example of possible advanced attribute used to extend logical type beyond core schema - see RFC-017 for more details.
[5]
Types recursively work for struct fields.
[6]
The nullability of fields will be represented by the special Option
type (unlike on field level in Arrow).
Extra Attributes Format
Custom attributes is a key-value map.
Keys in the map must be in the format <domain>/<attribute-name>
. The domain prefix is used to disambiguate and avoid name collisions between different extensions.
Attribute value can be a string
or any JSON value icluding nested objects.
It will be represented in Flatbuffers as a string
containing JSON, and natively in YAML.
Logical Types vs. Encoding
Further we decide that SetDataSchema
event will be restricted to carry logical type information, deprived of encoding details. Currently it is stuck in between - capturing Arrow encoding details, without capturing Parquet encoding. This means we cannot use it as a reliable source of information to enforce similar encoding across all data chunks. At the same time by capturing Arrow encoding details we impose the encoding selection on the read phase.
Example issue: Old datasets created with earlier versions of Datafusion have
Utf8
andBinary
types in their schemas. Later versions of Datafusion have switch to useUtf8View
andBinaryView
to achieve zero-copy reads. But because when reading the files for querying we use schema from metadata to avoid expensive schema unification step - the encoding details from old schema seep into the query phase and invalidate the view optimization forcing Datafusion to use old types that imply contiguous buffer encoding.
This separation ensures that:
- Logical types remain stable across data chunk formats and engines.
- Encoding strategies (compression, dictionary usage, encoding hints) can evolve independently, both at file level and in-memory level.
- Producers and consumers can negotiate encoding during runtime while relying on consistent logical schema contracts.
Encoding Hints
The custom attributes mechanism can be used in cases when dataset author wants to provide hints about optimal file and in-memory representation of certain columns. These are left out of scope of this RFC.
ODF Schema from/to Arrow Conversion
It is desirable for us to be able to convert between ODF and Arrow schema back and forth without information loss. Since ODF focuses on logical types and Arrow on in-memory layout - we will again use custom attriibutes mechanism to preserve encoding information in ODF schema. The following ecoding hints were identified at this initial stage:
Arrow type: LargeUtf8
ODF schema:
type:
kind: String
extra:
arrow.apache.org/bufferEncoding:
kind: Contiguous
offsetBitWidth: 64
Arrow type: Utf8View
(analogously for BinaryView
, ListView
, LargeListView
)
ODF schema:
type:
kind: String
extra:
arrow.apache.org/bufferEncoding:
kind: View
offsetBitWidth: 32
Arrow type: Date64
(millisecond-based representation)
ODF schema:
type:
kind: Date
extra:
arrow.apache.org/dateEncoding:
unit: Millisecond
Arrow type: Decimal128
(analogously for Decimal256
)
ODF schema:
type:
kind: Decimal
extra:
arrow.apache.org/decimalEncoding:
bitWidth: 128
Schema in APIs
We decide that when data schema is returned in the APIs it should carry only logical type information and not the encoding, unless the encoding represents the encoding of the data that this API transmits.
Compatibility
Flatbuffer level:
- Flatbuffer schema is made backwards-compatible.
- New datasets will be created only with new
schema
field populated, leavingraw_arrow_schema
field empty - as at the current stage of evolution we can sacrifice forwards-compatibility - Full removal of
raw_arrow_schema
will be considered in later versions.
Manifest level:
- YAML manifests are backwards-compatible as
SetDataSchema
event never appeared in them before. - When new
schema
will appear in dataset manifests - only the new tooling will be able to handle it.
Field optionality during schema upgrades
Field optionality (nullability) deserves a special note. Arrow currently has very poor control over nullability of fields, and implementations had to ignore nullability when checking for compatibility between schemas. With explicit ODF schemas we would like to pave the path towards enforcing field optionality and making schemas that differ in optionality be considered incompatible. To achieve this we special-case the metadata validation logic: the first time a schema is converted from raw_arrow_schema
to ODF schema
the schema equivalence check should ignore differences in field optionality. This will allow dataset authors to define schemas they actually expect, rather than propagate often incorrect nullability inference from Arrow.
Drawbacks
It requires changing
SetDataSchema
event to store ODF schemas instead of nested Arrow Flatbuffers (see RFC-010)At present users would have to define the schema for system fields (
offset, op, system_time, event_time
) which is verbose and error-prone. We believe that this extra burden can be resolved by providing better automatic schema inference in the tooling.This approach is all-or-nothing. If user wants to annotate just a few columns - they will need to specify the entire schema. We believe that this extra burden can be resolved by providing better automatic schema inference in the tooling.
Rationale and alternatives
Rejected: External annotations
Idea: We could continue using SetDataSchema
event to store serialized schema, while using some other event to add annotations to schema and columns.
Example:
- kind: SetInfo
description: Open dataset of MRI scans performed by the X medical lab at Y university
keywords:
- MRI
- Healthcare
columns:
- name: event_time
description: Time when the MRI scan was taken
a.com/a: foo
- name: content_hash
type: Utf8
description: References the MRI scan data linked to the dataset by its hash
logicalType:
kind: ObjectLink
inner:
kind: Multihash
- name: subject
type: Struct
fileds:
- name: id
logicalType: DID
description: Subject identifier
- name: gender
description: Subject's gender
a.com/a: foo
b.com/b:
$value: baz
mustUnderstand: true
Reasoning:
External annotations allow to only annotate the desired parts of schema without necessary declaring all columns
This however introduces a chance of them going out of sync and the need for validation
We already saw the need for the ability to define output schema upfront in datasets and need the human-friendly representation regardless
Pros:
- Annotation can be partial
- Keeps annotation process separate from schema changes
Cons:
- Duplication if user has to both define a schema and annotate it
- Introduces two potentially conflicting sources of truth
- Will need to validate that annotations refer to columns that exist
- Need to handle potential desync when schema evolves
- Does not address inability of user to define schema manually
Rejected: Compatibility with Arrow Flatbuffer schema
Idea:
- Create ODF schemas for Arrow Schema that match the existing flatbuffer-serialized schema in
SetDataSchema
event - Introduce new extension fields for annotations as part of the Arrow schema object, offsetting their serialization tags to allow for Arrow schema evolution without interfering with our extensions
- New schema will both be compatible with Arrow in flatbuffers and flatbuffer-JSON serialized form
Reasoning: Binary compatibility is difficult to achieve and brittle:
- Due to flatbuffer design
nested: [ubyte]
where data is a separately-serializedNested
table is serialized differently fromnested: Nested
, making it not possible to easily match the old binary layout - Due to Arrow flatbuffers still actively evolving and sometimes breaking compatibility we would have to closely monitoring the upstream changes
- And due to flatbuffers requiring serialization IDs to be sequential, offsetting our extensions would require ugly dummy fields to pad extensions to be sufficiently far enough from base schema
While flatbuffer-JSON is the first candidate for an official “human-friendly” representation, the schema itself makes it frequently overly verbose and cumbersome to use due to multiple design issues. These issues seem severe enough that Rust implementation defined types that significantly differ from the Arrow flatbuffer layout already.
Few prominent design issues:
- The
children
field appears on theField
table, instead of being part of aType::Struct
variant - A single-item
children: [Field]
is used to define the type of elements in aList
elements, instead of item type being a part ofType::List
variant - Because of this multi-purpose use the
name
andnullable
fields are also made optional - Integer types in Arrow flatbuffer schema are structured as
Int { bitWidth: 64, is_signed: false }
which is much more verbose thanUInt64
- There are multiple factors that contribute to combinatorial explosion of types:
Large*
types only differ by using 64-bit offset encodingView*
types (prefix or “german-style” strings) are fundamentally an encoding concern, not a logical typeRunEndEncoded
is similarly an encoding concern orthogonal to the type- Even the type
Utf8
mixes the fact that it’s semantically a “unicode string” and how it is represented
Prior works like Vortex format also pointed out the Arrow’s lack of distrinction between logical and physical types.
We could introduce some hacks to make schemas easier to define manually, but this would sill mean that the verbose low-level structure is what users would see in our APIs.
We anticipate many more evolutions to the schema:
- Supporting different column-wise compressions schemes
- Supporting file, column, and page-wise encryption
Prior art
- Arrow Flatbuffers Schema
- Arrow issues related to standard JSON representation
- Logical vs. Physical data types in Vortex
- Arrow View types introduction blog post
- Apache Iceberg Schema
- JSON-LD
Unresolved questions
System Columns
The SetDataSchema
event was meant to represent the full dataset schema, including the system columns.
Requiring system columns when defining schema manually adds a lot of boilerplate to manifest-based schema definitions.
It’s unclear yet what is the best way to resolve this issue:
- Should system fields like
offset
,op
,system_time
andevent_time
be auto-populated? - What about
event_time
that often comes from ingested data, can vary in type and renamed?
Logical Types Evolution
It’s not yet fully clear what approach we should take when it comes to expanding the set of logical types:
- Will types like
Multihash
andDid
ever make it into core schema? - When should this happen - when every implementation and every engine supports them via custom attributes?
- Can we incorporate new types into the core as long as they provide a fallback? e.g. allowing
Multihash
andDid
to fall back toString
type? - Will semantics of new logical type be governed by ODF schema?
Future possibilities
ODF schema in ingest sources
Curretly our push and polling source definitions rely on DDL-style schema. For consistency we should allow defining source schemas in ODF format as well. This is left out of scope of this RFC.
Compressed enum format
Basic types in new schema will be quite verbose to spell out, e.g.:
- name: offset
type:
kind: UInt64
We could allow enum variants to be shortened to just a string when they don’t specify any additional fields, shortening the above to:
- name: offset
type: UInt64
This would be similar to @value
contracted/expanded syntax in JSON-LD.
NOTE: Compressed enum format should be input-only features, i.e. they would only be applied when reading YAML manifests. The full expanded form would be used in flatbuffers.
Replacing SetVocabulary
event
The SetVocabulary
event can be potentially replaced with annotations on system columns (e.g. opendatafabric.org/systemColumn: offset
). This could be a way to allow user skip defining system columns in the schema - unless they are specified and tagged explicitly tooling can insert them automatically.
JSON-LD extensibility
We want the set of schema and column annotations to be fully extensible. Core ODF annotations will be a part of Flatbuffer schema and will be efficient to store, but the reset can be essentially treated as additionalProperties: true
and store them as arbintrary data. Using JSON-LD’s expansion and compation mechanisms we can blur the difference between these for library users.