RFC-009: Ingestion Source State
Start Date: 2023-04-28
Compatibility: Backwards-compatible, forwards-incompatible
Summary
Introduces a sourceState
field into AddData
event to store the state of the generic data source from which the data is being added, allowing for incremental ingestion.
Motivation
ODF implementations often need to store the state of the data source from which data is ingested to avoid expensive recomputations.
For example when using SetPollingSource
source state can be in the form of:
- ETag of Last-Modified header for HTTP sources
- filename of the last file matched by glob pattern
- or a height of the block of indexed blockchain.
Currently, we don’t have anywhere to store such data.
Guide-level explanation
When ingesting data, an ODF implementation will be able to attach opaque state data to the AddData
event in order to resume ingestion most efficiently on the next iteration. The source state data will allow differentiating the kind of state that is preserved (similarly to MIME type) and will specify the identity of the source.
Reference-level explanation
New SourceState
metadata fragment will be introduced.
The schema of AddData
metadata event will:
- be extended with an optional
sourceState
field - have
outputData
field made optional (as it’s possible for ingest to produce no new data but update source state, watermark, or checkpoint)
Two predefined source state kinds will be added:
odf/etag
- for state identifiers similar toETag
HTTP headerodf/last-modified
- for RFC3338 timestamps with meaning similar toLast-Modified
HTTP header
One predefined source ID will be added:
odf/polling
- referring to the source specified in theSetPollingSource
metadata event
Both fields will be plain strings and not enums allowing different implementations to define their own extensions.
It should be safe to ignore the source state that an implementation does not understand.
Example SourceState
:
sourceState:
kind: odf/etag
source: odf/polling
value: W/"95c5fde3918cba7c33eaac7ff9d02b22"
Compatibility
This change will be fully backwards-compatible. It will be forwards-incompatible due to making of AddData::outputData
field optional.
Drawbacks
- More complexity in metadata
Alternatives
- Store source state as part of the ingest checkpoint:
- Would require storing source state alongside engine checkpoints within one file (e.g. by
tar
‘ing them together) - Would complicate and slows down engine checkpoint extraction
- Would only work for a single source
- Would require storing source state alongside engine checkpoints within one file (e.g. by
Prior art
N/A
Unresolved questions
N/A
Future possibilities
- The source identity (
source
field) should allow us to support multiple concurrent ingestion sources if needed