Kamu Documentation > Open Data Fabric > Reference

Reference

Manifests

`Manifest`

An object that wraps the metadata resources providing versioning and type identification. All root-level resources are wrapped with a manifest when serialized to disk.

Property	Type	Required	Format	Description
`kind`	`string`	✔️	`multicodec`	Type of the resource.
`version`	`integer`	✔️	`int32`	Major version number of the resource contained in this manifest. It provides the mechanism for introducing compatibility breaking changes.
`content`	`string`	✔️	`flatbuffers`	Resource data.

`DatasetSnapshot`

Represents a projection of the dataset metadata at a single point in time. This type is typically used for defining new datasets and changing the existing ones.

Property	Type	Required	Format	Description
`name`	`string`	✔️	`dataset-alias`	Alias of the dataset.
`kind`	`DatasetKind`	✔️		Type of the dataset.
`metadata`	`array(MetadataEvent)`	✔️		An array of metadata events that will be used to populate the chain. Here you can define polling and push sources, set licenses, add attachments etc.

`MetadataBlock`

An individual block in the metadata chain that captures the history of modifications of a dataset.

Property	Type	Required	Format	Description
`systemTime`	`string`	✔️	`date-time`	System time when this block was written.
`prevBlockHash`	`string`		`multihash`	Hash sum of the preceding block.
`sequenceNumber`	`integer`	✔️	`uint64`	Block sequence number, starting from zero at the seed block.
`event`	`MetadataEvent`	✔️		Event data.

Metadata Events

`MetadataEvent`

Represents a transaction that occurred on a dataset.

Union Type	Description
`AddData`	Indicates that data has been ingested into a root dataset.
`ExecuteTransform`	Indicates that derivative transformation has been performed.
`Seed`	Establishes the identity of the dataset. Always the first metadata event in the chain.
`SetPollingSource`	Contains information on how externally-hosted data can be ingested into the root dataset.
`SetTransform`	Defines a transformation that produces data in a derivative dataset.
`SetVocab`	Lets you manipulate names of the system columns to avoid conflicts.
`SetAttachments`	Associates a set of files with this dataset.
`SetInfo`	Provides basic human-readable information about a dataset.
`SetLicense`	Defines a license that applies to this dataset.
`SetDataSchema`	Specifies the complete schema of Data Slices added to the Dataset following this event.
`AddPushSource`	Describes how to ingest data into a root dataset from a certain logical source.
`DisablePushSource`	Disables the previously defined source.
`DisablePollingSource`	Disables the previously defined polling source.

`AddData`

Indicates that data has been ingested into a root dataset.

Property	Type	Format	Description
`prevCheckpoint`	`string`	`multihash`	Hash of the checkpoint file used to restore ingestion state, if any.
`prevOffset`	`integer`	`uint64`	Last offset of the previous data slice, if any. Must be equal to the last non-empty `newData.offsetInterval.end`.
`newData`	`DataSlice`		Describes output data written during this transaction, if any.
`newCheckpoint`	`Checkpoint`		Describes checkpoint written during this transaction, if any. If an engine operation resulted in no updates to the checkpoint, but checkpoint is still relevant for subsequent runs - a hash of the previous checkpoint should be specified.
`newWatermark`	`string`	`date-time`	Last watermark of the output data stream, if any. Initial blocks may not have watermarks, but once watermark is set - all subsequent blocks should either carry the same watermark or specify a new (greater) one. Thus, watermarks are monotonically non-decreasing.
`newSourceState`	`SourceState`		The state of the source the data was added from to allow fast resuming. If the state did not change but is still relevant for subsequent runs it should be carried, i.e. only the last state per source is considered when resuming.
`extra`	`ExtraAttributes`		ODF extensions.

`AddPushSource`

Describes how to ingest data into a root dataset from a certain logical source.

Property	Type	Required	Description
`sourceName`	`string`	✔️	Identifies the source within this dataset.
`read`	`ReadStep`	✔️	Defines how data is read into structured format.
`preprocess`	`Transform`		Pre-processing query that shapes the data.
`merge`	`MergeStrategy`	✔️	Determines how newly-ingested data should be merged with existing history.

`DisablePollingSource`

Disables the previously defined polling source.

`DisablePushSource`

Disables the previously defined source.

Property	Type	Required	Format	Description
`sourceName`	`string`	✔️		Identifies the source to be disabled.

`ExecuteTransform`

Indicates that derivative transformation has been performed.

Property	Type	Required	Format	Description
`queryInputs`	`array(ExecuteTransformInput)`	✔️		Defines inputs used in this transaction. Slices corresponding to every input dataset must be present.
`prevCheckpoint`	`string`		`multihash`	Hash of the checkpoint file used to restore transformation state, if any.
`prevOffset`	`integer`		`uint64`	Last offset of the previous data slice, if any. Must be equal to the last non-empty `newData.offsetInterval.end`.
`newData`	`DataSlice`			Describes output data written during this transaction, if any.
`newCheckpoint`	`Checkpoint`			Describes checkpoint written during this transaction, if any. If an engine operation resulted in no updates to the checkpoint, but checkpoint is still relevant for subsequent runs - a hash of the previous checkpoint should be specified.
`newWatermark`	`string`		`date-time`	Last watermark of the output data stream, if any. Initial blocks may not have watermarks, but once watermark is set - all subsequent blocks should either carry the same watermark or specify a new (greater) one. Thus, watermarks are monotonically non-decreasing.

`Seed`

Establishes the identity of the dataset. Always the first metadata event in the chain.

Property	Type	Required	Format	Description
`datasetId`	`string`	✔️	`dataset-id`	Unique identity of the dataset.
`datasetKind`	`DatasetKind`	✔️		Type of the dataset.

`SetAttachments`

Associates a set of files with this dataset.

Property	Type	Required	Format	Description
`attachments`	`Attachments`	✔️		One of the supported attachment sources.

`SetDataSchema`

Specifies the complete schema of Data Slices added to the Dataset following this event.

Property	Type	Required	Format	Description
`rawArrowSchema`	`string`		`flatbuffers`	DEPRECATED: Apache Arrow schema encoded in its native flatbuffers representation.
`schema`	`DataSchema`			Defines the logical schema of the data files that follow this event. Will become a required field after migration.

`SetInfo`

Provides basic human-readable information about a dataset.

Property	Type	Required	Format	Description
`description`	`string`			Brief single-sentence summary of a dataset.
`keywords`	`array(string)`			Keywords, search terms, or tags used to describe the dataset.

`SetLicense`

Defines a license that applies to this dataset.

Property	Type	Required	Format	Description
`shortName`	`string`	✔️		Abbreviated name of the license.
`name`	`string`	✔️		Full name of the license.
`spdxId`	`string`			License identifier from the SPDX License List.
`websiteUrl`	`string`	✔️	`url`	URL where licensing terms can be found.

`SetPollingSource`

Contains information on how externally-hosted data can be ingested into the root dataset.

Property	Type	Required	Description
`fetch`	`FetchStep`	✔️	Determines where data is sourced from.
`prepare`	`array(PrepStep)`		Defines how raw data is prepared before reading.
`read`	`ReadStep`	✔️	Defines how data is read into structured format.
`preprocess`	`Transform`		Pre-processing query that shapes the data.
`merge`	`MergeStrategy`	✔️	Determines how newly-ingested data should be merged with existing history.

`SetTransform`

Defines a transformation that produces data in a derivative dataset.

Property	Type	Required	Format	Description
`inputs`	`array(TransformInput)`	✔️		Datasets that will be used as sources.
`transform`	`Transform`	✔️		Transformation that will be applied to produce new data.

`SetVocab`

Lets you manipulate names of the system columns to avoid conflicts.

Property	Type	Description
`offsetColumn`	`string`	Name of the offset column.
`operationTypeColumn`	`string`	Name of the operation type column.
`systemTimeColumn`	`string`	Name of the system time column.
`eventTimeColumn`	`string`	Name of the event time column.

Data Schema

`DataSchema`

This schema aims to be a human-friendly variant of Arrow. Arrow currently specifies only the flatbuffer format which has many legacy to it and is not suited to be defined by humans, so we had to define our own schema format. While inspired by Arrow - this format makes a clear separation between logical data types and encoding (physical layout) of data in the chunks.

Property	Type	Required	Format	Description
`fields`	`array(DataField)`	✔️		Top-level fields (columns) of the schema.
`extra`	`ExtraAttributes`			ODF extensions

`DataField`

Represents a named field (column) in a root or nested struct schema

Property	Type	Required	Description
`name`	`string`	✔️	Name of the field
`type`	`DataType`	✔️	Logical type of the field that defines its semantic behavior and value ranges
`extra`	`ExtraAttributes`		ODF extensions

`DataType`

Defines a logical type of the field. Logical type determines the semantics and boudaries of a type and how it can be operated on, without a concern about encoding and physical layout of the data in chunks.

Union Type	Description
`DataType::Binary`	A sequence of bytes. Used for arbitrary binary data.
`DataType::Bool`	A boolean value representing true or false.
`DataType::Date`	A calendar date.
`DataType::Decimal`	A fixed-point decimal number with a specified precision and scale.
`DataType::Duration`	An elapsed time interval with a specified time unit.
`DataType::Float16`	A floating-point number.
`DataType::Float32`	A floating-point number.
`DataType::Float64`	A floating-point number.
`DataType::Int8`	An integer value.
`DataType::Int16`	An integer value.
`DataType::Int32`	An integer value.
`DataType::Int64`	An integer value.
`DataType::UInt8`	An integer value.
`DataType::UInt16`	An integer value.
`DataType::UInt32`	An integer value.
`DataType::UInt64`	An integer value.
`DataType::List`	A list of values, all having the same data type.
`DataType::Map`	A map of key-value pairs, represented as a list of entries (structs with key and value fields).
`DataType::Null`	A type representing the absence of a value (null).
`DataType::Option`	A type representing an optional (nullable) value of another data type.
`DataType::Struct`	A collection of named fields, each with its own data type.
`DataType::Time`	A time of day value, without a date, with a specified unit of granularity.
`DataType::Timestamp`	A point in time, represented as an offset from the Unix epoch, with optional timezone.
`DataType::String`	A Unicode string.

`DataType::Binary`

A sequence of bytes. Used for arbitrary binary data.

Property	Type	Required	Format	Description
`fixedLength`	`integer`		`uint64`	Number of bytes per value for fixed-size binary. If omitted, the binary is variable-length.

`DataType::Bool`

A boolean value representing true or false.

`DataType::Date`

A calendar date.

`DataType::Decimal`

A fixed-point decimal number with a specified precision and scale.

Property	Type	Required	Format	Description
`precision`	`integer`	✔️	`uint32`	Total number of decimal digits that can be stored.
`scale`	`integer`	✔️	`int32`	Number of digits after the decimal point. In certain situations, scale could be negative number. For negative scale, it is the number of padding 0 to the right of the digits. For example the number 12300 could be treated as a decimal has precision 3 and scale -2.

`DataType::Duration`

An elapsed time interval with a specified time unit.

Property	Type	Required	Format	Description
`unit`	`TimeUnit`	✔️		The unit of the duration measurement.

`DataType::Float16`

A floating-point number.

`DataType::Float32`

A floating-point number.

`DataType::Float64`

A floating-point number.

`DataType::Int8`

An integer value.

`DataType::Int16`

An integer value.

`DataType::Int32`

An integer value.

`DataType::Int64`

An integer value.

`DataType::UInt8`

An integer value.

`DataType::UInt16`

An integer value.

`DataType::UInt32`

An integer value.

`DataType::UInt64`

An integer value.

`DataType::List`

A list of values, all having the same data type.

Property	Type	Required	Format	Description
`itemType`	`DataType`	✔️		Data type of list items.
`fixedLength`	`integer`		`uint64`	Number of list items per value for fixed-size lists. If omitted, the list is variable-length.

`DataType::Map`

A map of key-value pairs, represented as a list of entries (structs with key and value fields).

Property	Type	Required	Description
`keyType`	`DataType`	✔️	Data type of the map’s keys.
`valueType`	`DataType`	✔️	Data type of the map’s values.
`keysSorted`	`boolean`		Set to true if the keys within each value are sorted.

`DataType::Null`

A type representing the absence of a value (null).

`DataType::Option`

A type representing an optional (nullable) value of another data type.

Property	Type	Required	Format	Description
`inner`	`DataType`	✔️		Inner data type for the optional value.

`DataType::Struct`

A collection of named fields, each with its own data type.

Property	Type	Required	Format	Description
`fields`	`array(DataField)`	✔️		Fields that make up the struct.

`DataType::Time`

A time of day value, without a date, with a specified unit of granularity.

Property	Type	Required	Format	Description
`unit`	`TimeUnit`	✔️		The unit of the time value.

`DataType::Timestamp`

A point in time, represented as an offset from the Unix epoch, with optional timezone.

Property	Type	Required	Format	Description
`unit`	`TimeUnit`	✔️		The unit of the timestamp value that determines its precision.
`timezone`	`string`			The timezone is an optional string indicating the name of a timezone one of * As used in the Olson timezone database (the “tz database” or “tzdata”), such as “America/New_York”. * An absolute timezone offset of the form “+XX:XX” or “-XX:XX”, such as “+07:30”. Whether a timezone string is present indicates different semantics about the data (see above).

`DataType::String`

A Unicode string.

`TimeUnit`

Defines the unit of measurement of time

Enum Value
`Second`
`Millisecond`
`Microsecond`
`Nanosecond`

Engine Protocol

`RawQueryRequest`

Sent by the coordinator to an engine to perform query on raw input data, usually as part of ingest preprocessing step

Property	Type	Required	Format	Description
`inputDataPaths`	`array(string)`	✔️		Paths to input data files to perform query over. Must all have identical schema.
`transform`	`Transform`	✔️		Transformation that will be applied to produce new data.
`outputDataPath`	`string`	✔️	`path`	Path where query result will be written.

`RawQueryResponse`

Sent by an engine to coordinator when performing the raw query operation

Union Type	Description
`RawQueryResponse::Progress`	Reports query progress
`RawQueryResponse::Success`	Query executed successfully
`RawQueryResponse::InvalidQuery`	Query did not pass validation
`RawQueryResponse::InternalError`	Internal error during query execution

`RawQueryResponse::Progress`

Reports query progress

`RawQueryResponse::Success`

Query executed successfully

Property	Type	Required	Format	Description
`numRecords`	`integer`	✔️	`uint64`	Number of records produced by the query

`RawQueryResponse::InvalidQuery`

Query did not pass validation

Property	Type	Required	Format	Description
`message`	`string`	✔️		Explanation of an error

`RawQueryResponse::InternalError`

Internal error during query execution

Property	Type	Required	Format	Description
`message`	`string`	✔️		Brief description of an error
`backtrace`	`string`			Details of an error (e.g. a backtrace)

`TransformRequest`

Sent by the coordinator to an engine to perform the next step of data transformation

Property	Type	Required	Format	Description
`datasetId`	`string`	✔️	`dataset-id`	Unique identifier of the output dataset.
`datasetAlias`	`string`	✔️	`dataset-alias`	Alias of the output dataset, for logging purposes only.
`systemTime`	`string`	✔️	`date-time`	System time to use for new records.
`vocab`	`DatasetVocabulary`	✔️		Vocabulary of the output dataset.
`transform`	`Transform`	✔️		Transformation that will be applied to produce new data.
`queryInputs`	`array(TransformRequestInput)`	✔️		Defines inputs used in this transaction. Slices corresponding to every input dataset must be present.
`nextOffset`	`integer`	✔️	`uint64`	Starting offset to use for new data records.
`prevCheckpointPath`	`string`		`path`	TODO: This will be removed when coordinator will be speaking to engines purely through Arrow.
`newCheckpointPath`	`string`	✔️	`path`	TODO: This will be removed when coordinator will be speaking to engines purely through Arrow.
`newDataPath`	`string`	✔️	`path`	TODO: This will be removed when coordinator will be speaking to engines purely through Arrow.

`TransformRequestInput`

Sent as part of the engine transform request operation to describe the input

Property	Type	Required	Format	Description
`datasetId`	`string`	✔️	`dataset-id`	Unique identifier of the dataset.
`datasetAlias`	`string`	✔️	`dataset-alias`	Alias of the output dataset, for logging purposes only.
`queryAlias`	`string`	✔️		An alias of this input to be used in queries.
`vocab`	`DatasetVocabulary`	✔️		Vocabulary of the input dataset.
`offsetInterval`	`OffsetInterval`			Subset of data that goes into this transaction.
`dataPaths`	`array(string)`	✔️		TODO: This will be removed when coordinator will be slicing data for the engine.
`schemaFile`	`string`	✔️	`path`	TODO: replace with actual DDL or Parquet schema.
`explicitWatermarks`	`array(Watermark)`	✔️		Watermarks that should be injected into the stream to separate micro batches for reproducibility.

`TransformResponse`

Sent by an engine to coordinator when performing the data transformation

Union Type	Description
`TransformResponse::Progress`	Reports query progress
`TransformResponse::Success`	Query executed successfully
`TransformResponse::InvalidQuery`	Query did not pass validation
`TransformResponse::InternalError`	Internal error during query execution

`TransformResponse::Progress`

Reports query progress

`TransformResponse::Success`

Query executed successfully

Property	Type	Required	Format	Description
`newOffsetInterval`	`OffsetInterval`			Data slice produced by the transaction, if any.
`newWatermark`	`string`		`date-time`	Watermark advanced by the transaction, if any.

`TransformResponse::InvalidQuery`

Query did not pass validation

Property	Type	Required	Format	Description
`message`	`string`	✔️		Explanation of an error

`TransformResponse::InternalError`

Internal error during query execution

Property	Type	Required	Format	Description
`message`	`string`	✔️		Brief description of an error
`backtrace`	`string`			Details of an error (e.g. a backtrace)

Fragments

`AttachmentEmbedded`

Embedded attachment item.

Property	Type	Required	Format	Description
`path`	`string`	✔️		Path to an attachment if it was materialized into a file.
`content`	`string`	✔️		Content of the attachment.

`Attachments`

Defines the source of attachment files.

Union Type	Description
`Attachments::Embedded`	For attachments that are specified inline and are embedded in the metadata.

`Attachments::Embedded`

For attachments that are specified inline and are embedded in the metadata.

Property	Type	Required	Format	Description
`items`	`array(AttachmentEmbedded)`	✔️		List of embedded items.

`Checkpoint`

Describes a checkpoint produced by an engine

Property	Type	Required	Format	Description
`physicalHash`	`string`	✔️	`multihash`	Hash sum of the checkpoint file.
`size`	`integer`	✔️	`uint64`	Size of checkpoint file in bytes.

`CompressionFormat`

Defines a compression algorithm.

Enum Value
`Gzip`
`Zip`

`DataSlice`

Describes a slice of data added to a dataset or produced via transformation

Property	Type	Required	Format	Description
`logicalHash`	`string`	✔️	`multihash`	Logical hash sum of the data in this slice.
`physicalHash`	`string`	✔️	`multihash`	Hash sum of the data part file.
`offsetInterval`	`OffsetInterval`	✔️		Data slice produced by the transaction.
`size`	`integer`	✔️	`uint64`	Size of data file in bytes.

`DatasetKind`

Represents type of the dataset.

Enum Value
`Root`
`Derivative`

`DatasetVocabulary`

Specifies the mapping of system columns onto dataset schema.

Property	Type	Required	Description
`offsetColumn`	`string`	✔️	Name of the offset column.
`operationTypeColumn`	`string`	✔️	Name of the operation type column.
`systemTimeColumn`	`string`	✔️	Name of the system time column.
`eventTimeColumn`	`string`	✔️	Name of the event time column.

`EnvVar`

Defines an environment variable passed into some job.

Property	Type	Required	Format	Description
`name`	`string`	✔️		Name of the variable.
`value`	`string`			Value of the variable.

`EventTimeSource`

Defines the external source of data.

Union Type	Description
`EventTimeSource::FromMetadata`	Extracts event time from the source’s metadata.
`EventTimeSource::FromPath`	Extracts event time from the path component of the source.
`EventTimeSource::FromSystemTime`	Assigns event time from the system time source.

`EventTimeSource::FromMetadata`

Extracts event time from the source’s metadata.

`EventTimeSource::FromSystemTime`

Assigns event time from the system time source.

`EventTimeSource::FromPath`

Extracts event time from the path component of the source.

Property	Type	Required	Format	Description
`pattern`	`string`	✔️	`regex`	Regular expression where first group contains the timestamp string.
`timestampFormat`	`string`			Format of the expected timestamp in java.text.SimpleDateFormat form.

`ExecuteTransformInput`

Describes a slice of the input dataset used during a transformation

Property	Type	Required	Format	Description
`datasetId`	`string`	✔️	`dataset-id`	Input dataset identifier.
`prevBlockHash`	`string`		`multihash`	Last block of the input dataset that was previously incorporated into the derivative transformation, if any. Must be equal to the last non-empty `newBlockHash`. Together with `newBlockHash` defines a half-open `(prevBlockHash, newBlockHash]` interval of blocks that will be considered in this transaction.
`newBlockHash`	`string`		`multihash`	Hash of the last block that will be incorporated into the derivative transformation. When present, defines a half-open `(prevBlockHash, newBlockHash]` interval of blocks that will be considered in this transaction.
`prevOffset`	`integer`		`uint64`	Last data record offset in the input dataset that was previously incorporated into the derivative transformation, if any. Must be equal to the last non-empty `newOffset`. Together with `newOffset` defines a half-open `(prevOffset, newOffset]` interval of data records that will be considered in this transaction.
`newOffset`	`integer`		`uint64`	Offset of the last data record that will be incorporated into the derivative transformation, if any. When present, defines a half-open `(prevOffset, newOffset]` interval of data records that will be considered in this transaction.

`ExtraAttributes`

Container for custom key-value extension attributes. Every key must be in the form of <domain>/<path> (e.g. kamu.dev/archetype) in order to fully disambiguate the value in the face of multiple extensions. Values may be any valid JSON including nested objects.

Known Extensions

Extension	Description
`opendatafabric.net/description`	Used for human readable schema field descriptions
`opendatafabric.net/type`	An extended set of logical types that ODF recommends but does not require every implementation to support
`opendatafabric.org/linkedObjects`	When attached to `AddData` event contains a summary of how many external objects were associated with a certain transaction as well as their size
`arrow.apache.org/bufferEncoding`	Used to accurately represent buffer encoding type when converting Arrow schema to ODF schema
`arrow.apache.org/dateEncoding`	Used to accurately represent date encoding type when converting Arrow schema to ODF schema
`arrow.apache.org/decimalEncoding`	Used to accurately represent decimal encoding type when converting Arrow schema to ODF schema

Known Extended Types

Extended Type	Core Type	Description
`Did`	`String`	Decentralized identifier `did:<method>:<id>`
`Multihash`	`String`	Hash in self-describing multihash format
`ObjectLink`	`String`	Signifies that the value references an external object. The mandatory `linkType` property defines the type of the link (e.g. `Multihash`).

`FetchStep`

Defines the external source of data.

Union Type	Description
`FetchStep::Url`	Pulls data from one of the supported sources by its URL.
`FetchStep::FilesGlob`	Uses glob operator to match files on the local file system.
`FetchStep::Container`	Runs the specified OCI container to fetch data from an arbitrary source.
`FetchStep::Mqtt`	Connects to an MQTT broker to fetch events from the specified topic.
`FetchStep::EthereumLogs`	Connects to an Ethereum node to stream transaction logs.

`FetchStep::Url`

Pulls data from one of the supported sources by its URL.

Property	Type	Required	Format	Description
`url`	`string`	✔️	`url`	URL of the data source
`eventTime`	`EventTimeSource`			Describes how event time is extracted from the source metadata.
`cache`	`SourceCaching`			Describes the caching settings used for this source.
`headers`	`array(RequestHeader)`			Headers to pass during the request (e.g. HTTP Authorization)

`FetchStep::FilesGlob`

Uses glob operator to match files on the local file system.

Property	Type	Required	Description
`path`	`string`	✔️	Path with a glob pattern.
`eventTime`	`EventTimeSource`		Describes how event time is extracted from the source metadata.
`cache`	`SourceCaching`		Describes the caching settings used for this source.
`order`	`SourceOrdering`		Specifies how input files should be ordered before ingestion. Order is important as every file will be processed individually and will advance the dataset’s watermark.

`FetchStep::Container`

Runs the specified OCI container to fetch data from an arbitrary source.

Property	Type	Required	Description
`image`	`string`	✔️	Image name and and an optional tag.
`command`	`array(string)`		Specifies the entrypoint. Not executed within a shell. The default OCI image’s ENTRYPOINT is used if this is not provided.
`args`	`array(string)`		Arguments to the entrypoint. The OCI image’s CMD is used if this is not provided.
`env`	`array(EnvVar)`		Environment variables to propagate into or set in the container.

`FetchStep::Mqtt`

Connects to an MQTT broker to fetch events from the specified topic.

Property	Type	Required	Format	Description
`host`	`string`	✔️		Hostname of the MQTT broker.
`port`	`integer`	✔️	`int32`	Port of the MQTT broker.
`username`	`string`			Username to use for auth with the broker.
`password`	`string`			Password to use for auth with the broker (can be templated).
`topics`	`array(MqttTopicSubscription)`	✔️		List of topic subscription parameters.

`FetchStep::EthereumLogs`

Connects to an Ethereum node to stream transaction logs.

Property	Type	Format	Description
`chainId`	`integer`	`uint64`	Identifier of the chain to scan logs from. This parameter may be used for RPC endpoint lookup as well as asserting that provided `nodeUrl` corresponds to the expected chain.
`nodeUrl`	`string`	`url`	Url of the node.
`filter`	`string`		An SQL WHERE clause that can be used to pre-filter the logs before fetching them from the ETH node.
`signature`	`string`		Solidity log event signature to use for decoding. Using this field adds `event` to the output containing decoded log as JSON.

`MergeStrategy`

Merge strategy determines how newly ingested data should be combined with the data that already exists in the dataset.

Union Type	Description
`MergeStrategy::Append`	Append merge strategy.
`MergeStrategy::Ledger`	Ledger merge strategy.
`MergeStrategy::Snapshot`	Snapshot merge strategy.
`MergeStrategy::ChangelogStream`	Changelog stream merge strategy.
`MergeStrategy::UpsertStream`	Upsert stream merge strategy.

`MergeStrategy::Append`

Append merge strategy.

Under this strategy new data will be appended to the dataset in its entirety, without any deduplication.

`MergeStrategy::ChangelogStream`

Changelog stream merge strategy.

This is the native stream format for ODF that accurately describes the evolution of all event records including appends, retractions, and corrections as per RFC-015. No pre-processing except for format validation is done.

Property	Type	Required	Format	Description
`primaryKey`	`array(string)`	✔️		Names of the columns that uniquely identify the record throughout its lifetime

`MergeStrategy::Ledger`

Ledger merge strategy.

This strategy should be used for data sources containing ledgers of events. Currently this strategy will only perform deduplication of events using user-specified primary key columns. This means that the source data can contain partially overlapping set of records and only those records that were not previously seen will be appended.

Property	Type	Required	Format	Description
`primaryKey`	`array(string)`	✔️		Names of the columns that uniquely identify the record throughout its lifetime

`MergeStrategy::Snapshot`

Snapshot merge strategy.

This strategy can be used for data state snapshots that are taken periodically and contain only the latest state of the observed entity or system. Over time such snapshots can have new rows added, and old rows either removed or modified.

This strategy transforms snapshot data into an append-only event stream where data already added is immutable. It does so by performing Change Data Capture - essentially diffing the current state of data against the reconstructed previous state and recording differences as retractions or corrections. The Operation Type “op” column will contain:

append (+A) when a row appears for the first time
retraction (-D) when row disappears
correction (-C, +C) when row data has changed, with -C event carrying the old value of the row and +C carrying the new value.

To correctly associate rows between old and new snapshots this strategy relies on user-specified primary key columns.

To identify whether a row has changed this strategy will compare all other columns one by one. If the data contains a column that is guaranteed to change whenever any of the data columns changes (for example a last modification timestamp, an incremental version, or a data hash), then it can be specified in compareColumns property to speed up the detection of modified rows.

Property	Type	Required	Format	Description
`primaryKey`	`array(string)`	✔️		Names of the columns that uniquely identify the record throughout its lifetime.
`compareColumns`	`array(string)`			Names of the columns to compared to determine if a row has changed between two snapshots.

`MergeStrategy::UpsertStream`

Upsert stream merge strategy.

This strategy should be used for data sources containing ledgers of insert-or-update and delete events. Unlike ChangelogStream the insert-or-update events only carry the new values, so this strategy will use primary key to re-classify the events into an append or a correction from/to pair, looking up the previous values.

Property	Type	Required	Format	Description
`primaryKey`	`array(string)`	✔️		Names of the columns that uniquely identify the record throughout its lifetime

`MqttQos`

MQTT quality of service class.

Enum Value
`AtMostOnce`
`AtLeastOnce`
`ExactlyOnce`

`MqttTopicSubscription`

MQTT topic subscription parameters.

Property	Type	Required	Format	Description
`path`	`string`	✔️		Name of the topic (may include patterns).
`qos`	`MqttQos`			Quality of service class.

`OffsetInterval`

Describes a range of data as a closed arithmetic interval of offsets

Property	Type	Required	Format	Description
`start`	`integer`	✔️	`uint64`	Start of the closed interval [start; end].
`end`	`integer`	✔️	`uint64`	End of the closed interval [start; end].

`PrepStep`

Defines the steps to prepare raw data for ingestion.

Union Type	Description
`PrepStep::Decompress`	Pulls data from one of the supported sources by its URL.
`PrepStep::Pipe`	Executes external command to process the data using piped input/output.

`PrepStep::Decompress`

Pulls data from one of the supported sources by its URL.

Property	Type	Required	Format	Description
`format`	`CompressionFormat`	✔️		Name of a compression algorithm used on data.
`subPath`	`string`			Path to a data file within a multi-file archive. Can contain glob patterns.

`PrepStep::Pipe`

Executes external command to process the data using piped input/output.

Property	Type	Required	Format	Description
`command`	`array(string)`	✔️		Command to execute and its arguments.

`ReadStep`

Defines how raw data should be read into the structured form.

Union Type	Description
`ReadStep::Csv`	Reader for comma-separated files.
`ReadStep::GeoJson`	Reader for GeoJSON files. It expects one `FeatureCollection` object in the root and will create a record per each `Feature` inside it extracting the properties into individual columns and leaving the feature geometry in its own column.
`ReadStep::EsriShapefile`	Reader for ESRI Shapefile format.
`ReadStep::Parquet`	Reader for Apache Parquet format.
`ReadStep::Json`	Reader for JSON files that contain an array of objects within them.
`ReadStep::NdJson`	Reader for files containing multiple newline-delimited JSON objects with the same schema.
`ReadStep::NdGeoJson`	Reader for Newline-delimited GeoJSON files. It is similar to `GeoJson` format but instead of `FeatureCollection` object in the root it expects every individual feature object to appear on its own line.

`ReadStep::Csv`

Reader for comma-separated files.

Property	Type	Description
`schema`	`array(string)`	A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.
`separator`	`string`	Sets a single character as a separator for each field and value.
`encoding`	`string`	Decodes the CSV files by the given encoding type.
`quote`	`string`	Sets a single character used for escaping quoted values where the separator can be part of the value. Set an empty string to turn off quotations.
`escape`	`string`	Sets a single character used for escaping quotes inside an already quoted value.
`header`	`boolean`	Use the first line as names of columns.
`inferSchema`	`boolean`	Infers the input schema automatically from data. It requires one extra pass over the data.
`nullValue`	`string`	Sets the string representation of a null value.
`dateFormat`	`string`	Sets the string that indicates a date format. The `rfc3339` is the only required format, the other format strings are implementation-specific.
`timestampFormat`	`string`	Sets the string that indicates a timestamp format. The `rfc3339` is the only required format, the other format strings are implementation-specific.

`ReadStep::Json`

Reader for JSON files that contain an array of objects within them.

Property	Type	Description
`subPath`	`string`	Path in the form of `a.b.c` to a sub-element of the root JSON object that is an array or objects. If not specified it is assumed that the root element is an array.
`schema`	`array(string)`	A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.
`dateFormat`	`string`	Sets the string that indicates a date format. The `rfc3339` is the only required format, the other format strings are implementation-specific.
`encoding`	`string`	Allows to forcibly set one of standard basic or extended encodings.
`timestampFormat`	`string`	Sets the string that indicates a timestamp format. The `rfc3339` is the only required format, the other format strings are implementation-specific.

`ReadStep::NdJson`

Reader for files containing multiple newline-delimited JSON objects with the same schema.

Property	Type	Description
`schema`	`array(string)`	A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.
`dateFormat`	`string`	Sets the string that indicates a date format. The `rfc3339` is the only required format, the other format strings are implementation-specific.
`encoding`	`string`	Allows to forcibly set one of standard basic or extended encodings.
`timestampFormat`	`string`	Sets the string that indicates a timestamp format. The `rfc3339` is the only required format, the other format strings are implementation-specific.

`ReadStep::GeoJson`

Reader for GeoJSON files. It expects one FeatureCollection object in the root and will create a record per each Feature inside it extracting the properties into individual columns and leaving the feature geometry in its own column.

Property	Type	Required	Format	Description
`schema`	`array(string)`			A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.

`ReadStep::NdGeoJson`

Reader for Newline-delimited GeoJSON files. It is similar to GeoJson format but instead of FeatureCollection object in the root it expects every individual feature object to appear on its own line.

Property	Type	Required	Format	Description
`schema`	`array(string)`			A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.

`ReadStep::EsriShapefile`

Reader for ESRI Shapefile format.

Property	Type	Required	Format	Description
`schema`	`array(string)`			A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.
`subPath`	`string`			If the ZIP archive contains multiple shapefiles use this field to specify a sub-path to the desired `.shp` file. Can contain glob patterns to act as a filter.

`ReadStep::Parquet`

Reader for Apache Parquet format.

Property	Type	Required	Format	Description
`schema`	`array(string)`			A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types.

`RequestHeader`

Defines a header (e.g. HTTP) to be passed into some request.

Property	Type	Required	Format	Description
`name`	`string`	✔️		Name of the header.
`value`	`string`	✔️		Value of the header.

`SourceCaching`

Defines how external data should be cached.

Union Type	Description
`SourceCaching::Forever`	After source was processed once it will never be ingested again.

`SourceCaching::Forever`

After source was processed once it will never be ingested again.

`SourceOrdering`

Specifies how input files should be ordered before ingestion.

Enum Value
`ByEventTime`
`ByName`

`SourceState`

The state of the source the data was added from to allow fast resuming.

Property	Type	Required	Description
`sourceName`	`string`	✔️	Identifies the source that the state corresponds to.
`kind`	`string`	✔️	Identifies the type of the state. Standard types include: `odf/etag`, `odf/last-modified`.
`value`	`string`	✔️	Opaque value representing the state.

`SqlQueryStep`

Defines a query in a multi-step SQL transformation.

Property	Type	Required	Format	Description
`alias`	`string`			Name of the temporary view that will be created from result of the query. Step without this alias will be treated as an output of the transformation.
`query`	`string`	✔️		SQL query the result of which will be exposed under the alias.

`TemporalTable`

Temporary Flink-specific extension for creating temporal tables from streams.

Property	Type	Required	Format	Description
`name`	`string`	✔️		Name of the dataset to be converted into a temporal table.
`primaryKey`	`array(string)`	✔️		Column names used as the primary key for creating a table.

`Transform`

Engine-specific processing queries that shape the resulting data.

Union Type	Description
`Transform::Sql`	Transform using one of the SQL dialects.

`Transform::Sql`

Transform using one of the SQL dialects.

Property	Type	Required	Description
`engine`	`string`	✔️	Identifier of the engine used for this transformation.
`version`	`string`		Version of the engine to use.
`query`	`string`		SQL query the result of which will be used as an output. This is a convenience property meant only for defining queries by hand. When stored in the metadata this property will never be set and instead will be converted into a single-iter `queries` array.
`queries`	`array(SqlQueryStep)`		Specifies multi-step SQL transformations. Each step acts as a shorthand for `CREATE TEMPORARY VIEW <alias> AS (<query>)`. Last query in the array should have no alias and will be treated as an output.
`temporalTables`	`array(TemporalTable)`		Temporary Flink-specific extension for creating temporal tables from streams.

`TransformInput`

Describes a derivative transformation input

Property	Type	Required	Format	Description
`datasetRef`	`string`	✔️	`dataset-ref`	A local or remote dataset reference. When block is accepted this MUST be in the form of a DatasetId to guarantee reproducibility, as aliases can change over time.
`alias`	`string`			An alias under which this input will be available in queries. Will be populated from `datasetRef` if not provided before resolving it to DatasetId.

`Watermark`

Represents a watermark in the event stream.

Property	Type	Required	Format	Description
`systemTime`	`string`	✔️	`date-time`	Moment in processing time when watermark was emitted.
`eventTime`	`string`	✔️	`date-time`	Moment in event time which watermark has reached.

Edit on GitHub

Top