Start Date: 2023-04-28Documentation Index
Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
Use this file to discover all available pages before exploring further.
Summary
Introduces asourceState field into AddData event to store the state of the generic data source from which the data is being added, allowing for incremental ingestion.
Motivation
ODF implementations often need to store the state of the data source from which data is ingested to avoid expensive recomputations. For example when usingSetPollingSource source state can be in the form of:
- ETag of Last-Modified header for HTTP sources
- filename of the last file matched by glob pattern
- or a height of the block of indexed blockchain.
Guide-level explanation
When ingesting data, an ODF implementation will be able to attach opaque state data to theAddData event in order to resume ingestion most efficiently on the next iteration. The source state data will allow differentiating the kind of state that is preserved (similarly to MIME type) and will specify the identity of the source.
Reference-level explanation
NewSourceState metadata fragment will be introduced.
The schema of AddData metadata event will:
- be extended with an optional
sourceStatefield - have
outputDatafield made optional (as it’s possible for ingest to produce no new data but update source state, watermark, or checkpoint)
odf/etag- for state identifiers similar toETagHTTP headerodf/last-modified- for RFC3338 timestamps with meaning similar toLast-ModifiedHTTP header
odf/polling- referring to the source specified in theSetPollingSourcemetadata event
SourceState:
Compatibility
This change will be fully backwards-compatible. It will be forwards-incompatible due to making ofAddData::outputData field optional.
Drawbacks
- More complexity in metadata
Alternatives
- Store source state as part of the ingest checkpoint:
- Would require storing source state alongside engine checkpoints within one file (e.g. by
tar’ing them together) - Would complicate and slows down engine checkpoint extraction
- Would only work for a single source
- Would require storing source state alongside engine checkpoints within one file (e.g. by
Prior art
N/AUnresolved questions
N/AFuture possibilities
- The source identity (
sourcefield) should allow us to support multiple concurrent ingestion sources if needed