RFC-004: Metadata Extensibility
Start Date: 2020-12-29
Summary
This RFC proposes a new MetadataBlock schema that makes metadata less ambiguous and easier to extend.
Motivation
MetadataBlock schema today consists of multiple optional fields that appear in metadata under different conditions. For example, the outputSlice field is present when a transformation resulted in some new data added to the datasets, the outputWatermark tells us that transformation had bumped up the watermark of a dataset - it can be present even when no new output data was produced. So when you want to process only metadata blocks that correspond to transformations - you get into a position of wondering what’s the “definitive sign” of transformation taking place, is it the presence of outputWatermark, ouputSlice, or inputSclices?
These are the clear signs of “anemic data” - the problem ODF was designed to prevent in data, as compared to approaches like Change Data Capture. As the set of features ODF supports grows, such anemic ever-growing schema will become unsustainable.
This RFC will explore how to transition metadata into a descriptive event-based format that can be easily extended by different parties to implement higher-level features.
Guide-level explanation
This RFC proposes to replace the anemic fields of the MetadataBlock schema with an extensible set of MetadataEvents, with every MetadataBlock containing a single MetadataEvent of a certain type. MetadataEvent will correspond to a certain transaction on the dataset (e.g. executing query, specifying polling source, updating transformation query). Every event will contain only fields that are relevant to the transaction.
A similar transformation will be done to the DatasetSnapshot schema. Instead of directly representing the first MetadataBlock it will contain an array of MetadataEvents that bring the dataset into a desired state. This means that in most cases a dataset newly-created from a DatasetSnapshot will have more than one block.
For example, creating dataset from the following snapshot:
kind: DatasetSnapshot
version: 1
content:
name: ca.bccdc.covid19.case-details
kind: root
metadata:
- kind: setPollingSource
fetch: ...
read: ...
merge: ...
- kind: setVocab
eventTimeColumn: reported_date
will produce metadata chain with three blocks:
- Seed
- SetPollingSource
- SetVocab
Reference-level explanation
Core events:
| Event Type | Description |
|---|---|
Seed | Contains identity information of a dataset and always appears as the first metadata block. |
AddData | Signifies that data was added into the root dataset. |
SetWatermark | Signifies that watermark of the dataset has been advanced. |
SetTransform | Defines transformation of the derivative dataset. |
ExecuteQuery | Signifies data processing step on the derivative dataset. |
Extension events:
| Event Type | Description |
|---|---|
SetPollingSource | (Optional extension) Defines how externally-hosted data can be ingested into the root dataset. |
Drawbacks
- Makes
DatasetSnapshotschema a little more verbose - There will be some events that contain a similar set of fields (e.g.
AddDataandExecuteQueryboth contain data and watermark info) but will now need to be processed separately. This is a small price to pay for a robust domain model though.
Rationale and alternatives
- As in case of dataset creation, we can see that it’s no longer possible to move dataset into desired state with just one metadata block. Does this hurt consistency?
- One considered alternative was to allow multiple
MetadataEvents perMetadataBlock- this would ensure per-block consistency but at the expense of higher complexity - It was decided that this complexity is not warranted. Just like with
git, code is not guaranteed to compile when looking at it on per-commit basis - as long as a group of commits that constitutes a consistent state is pushed atomically consistency overall can be achieved.
- One considered alternative was to allow multiple
Prior art
Unresolved questions
- Forward-compatibility requires a way to differentiate between events that coordinator has to understand and events that can be safely ignored if coordinator does not support them.
- Currently, we don’t have a good way to achieve this without running into too many issues with
flatbuffers. - We postpone this issue until we get better clarity on the long-term serialization format - we’re getting more indications that
flatbuffersis not the best way forward and may consider alternatives. This also ties in with consideringIPLD.
- Currently, we don’t have a good way to achieve this without running into too many issues with
Future possibilities
- This RFC opens up the path for introducing many new
MetadataEventtypes such as for:- Storing description / readme
- Linking to additional content (e.g. notebooks)
- Embedding a license