Reference
Manifests
Manifest
An object that wraps the metadata resources providing versioning and type identification. All root-level resources are wrapped with a manifest when serialized to disk.
Property | Type | Required | Format | Description |
---|---|---|---|---|
kind | string | ✔️ | multicodec | Type of the resource. |
version | integer | ✔️ | Major version number of the resource contained in this manifest. It provides the mechanism for introducing compatibility breaking changes. | |
content | string | ✔️ | flatbuffers | Resource data. |
DatasetSnapshot
Represents a projection of the dataset metadata at a single point in time. This type is typically used for defining new datasets and changing the existing ones.
Property | Type | Required | Format | Description |
---|---|---|---|---|
name | string | ✔️ | dataset-alias | Alias of the dataset. |
kind | DatasetKind | ✔️ | Type of the dataset. | |
metadata | array( MetadataEvent ) | ✔️ | An array of metadata events that will be used to populate the chain. Here you can define polling and push sources, set licenses, add attachments etc. |
MetadataBlock
An individual block in the metadata chain that captures the history of modifications of a dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
systemTime | string | ✔️ | date-time | System time when this block was written. |
prevBlockHash | string | multihash | Hash sum of the preceding block. | |
sequenceNumber | integer | ✔️ | uint64 | Block sequence number, starting from zero at the seed block. |
event | MetadataEvent | ✔️ | Event data. |
Metadata Events
MetadataEvent
Represents a transaction that occurred on a dataset.
Union Type | Description |
---|---|
AddData | Indicates that data has been ingested into a root dataset. |
ExecuteTransform | Indicates that derivative transformation has been performed. |
Seed | Establishes the identity of the dataset. Always the first metadata event in the chain. |
SetPollingSource | Contains information on how externally-hosted data can be ingested into the root dataset. |
SetTransform | Defines a transformation that produces data in a derivative dataset. |
SetVocab | Lets you manipulate names of the system columns to avoid conflicts. |
SetAttachments | Associates a set of files with this dataset. |
SetInfo | Provides basic human-readable information about a dataset. |
SetLicense | Defines a license that applies to this dataset. |
SetDataSchema | Specifies the complete schema of Data Slices added to the Dataset following this event. |
AddPushSource | Describes how to ingest data into a root dataset from a certain logical source. |
DisablePushSource | Disables the previously defined source. |
DisablePollingSource | Disables the previously defined polling source. |
AddData
Indicates that data has been ingested into a root dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
prevCheckpoint | string | multihash | Hash of the checkpoint file used to restore ingestion state, if any. | |
prevOffset | integer | uint64 | Last offset of the previous data slice, if any. Must be equal to the last non-empty newData.offsetInterval.end . | |
newData | DataSlice | Describes output data written during this transaction, if any. | ||
newCheckpoint | Checkpoint | Describes checkpoint written during this transaction, if any. If an engine operation resulted in no updates to the checkpoint, but checkpoint is still relevant for subsequent runs - a hash of the previous checkpoint should be specified. | ||
newWatermark | string | date-time | Last watermark of the output data stream, if any. Initial blocks may not have watermarks, but once watermark is set - all subsequent blocks should either carry the same watermark or specify a new (greater) one. Thus, watermarks are monotonically non-decreasing. | |
newSourceState | SourceState | The state of the source the data was added from to allow fast resuming. If the state did not change but is still relevant for subsequent runs it should be carried, i.e. only the last state per source is considered when resuming. |
AddPushSource
Describes how to ingest data into a root dataset from a certain logical source.
Property | Type | Required | Format | Description |
---|---|---|---|---|
sourceName | string | ✔️ | Identifies the source within this dataset. | |
read | ReadStep | ✔️ | Defines how data is read into structured format. | |
preprocess | Transform | Pre-processing query that shapes the data. | ||
merge | MergeStrategy | ✔️ | Determines how newly-ingested data should be merged with existing history. |
DisablePollingSource
Disables the previously defined polling source.
Property | Type | Required | Format | Description |
---|
DisablePushSource
Disables the previously defined source.
Property | Type | Required | Format | Description |
---|---|---|---|---|
sourceName | string | ✔️ | Identifies the source to be disabled. |
ExecuteTransform
Indicates that derivative transformation has been performed.
Property | Type | Required | Format | Description |
---|---|---|---|---|
queryInputs | array( ExecuteTransformInput ) | ✔️ | Defines inputs used in this transaction. Slices corresponding to every input dataset must be present. | |
prevCheckpoint | string | multihash | Hash of the checkpoint file used to restore transformation state, if any. | |
prevOffset | integer | uint64 | Last offset of the previous data slice, if any. Must be equal to the last non-empty newData.offsetInterval.end . | |
newData | DataSlice | Describes output data written during this transaction, if any. | ||
newCheckpoint | Checkpoint | Describes checkpoint written during this transaction, if any. If an engine operation resulted in no updates to the checkpoint, but checkpoint is still relevant for subsequent runs - a hash of the previous checkpoint should be specified. | ||
newWatermark | string | date-time | Last watermark of the output data stream, if any. Initial blocks may not have watermarks, but once watermark is set - all subsequent blocks should either carry the same watermark or specify a new (greater) one. Thus, watermarks are monotonically non-decreasing. |
Seed
Establishes the identity of the dataset. Always the first metadata event in the chain.
Property | Type | Required | Format | Description |
---|---|---|---|---|
datasetId | string | ✔️ | dataset-id | Unique identity of the dataset. |
datasetKind | DatasetKind | ✔️ | Type of the dataset. |
SetAttachments
Associates a set of files with this dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
attachments | Attachments | ✔️ | One of the supported attachment sources. |
SetDataSchema
Specifies the complete schema of Data Slices added to the Dataset following this event.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | string | ✔️ | flatbuffers | Apache Arrow schema encoded in its native flatbuffers representation. |
SetInfo
Provides basic human-readable information about a dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
description | string | Brief single-sentence summary of a dataset. | ||
keywords | array(string) | Keywords, search terms, or tags used to describe the dataset. |
SetLicense
Defines a license that applies to this dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
shortName | string | ✔️ | Abbreviated name of the license. | |
name | string | ✔️ | Full name of the license. | |
spdxId | string | License identifier from the SPDX License List. | ||
websiteUrl | string | ✔️ | url |
SetPollingSource
Contains information on how externally-hosted data can be ingested into the root dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
fetch | FetchStep | ✔️ | Determines where data is sourced from. | |
prepare | array( PrepStep ) | Defines how raw data is prepared before reading. | ||
read | ReadStep | ✔️ | Defines how data is read into structured format. | |
preprocess | Transform | Pre-processing query that shapes the data. | ||
merge | MergeStrategy | ✔️ | Determines how newly-ingested data should be merged with existing history. |
SetTransform
Defines a transformation that produces data in a derivative dataset.
Property | Type | Required | Format | Description |
---|---|---|---|---|
inputs | array( TransformInput ) | ✔️ | Datasets that will be used as sources. | |
transform | Transform | ✔️ | Transformation that will be applied to produce new data. |
SetVocab
Lets you manipulate names of the system columns to avoid conflicts.
Property | Type | Required | Format | Description |
---|---|---|---|---|
offsetColumn | string | Name of the offset column. | ||
operationTypeColumn | string | Name of the operation type column. | ||
systemTimeColumn | string | Name of the system time column. | ||
eventTimeColumn | string | Name of the event time column. |
Engine Protocol
RawQueryRequest
Sent by the coordinator to an engine to perform query on raw input data, usually as part of ingest preprocessing step
Property | Type | Required | Format | Description |
---|---|---|---|---|
inputDataPaths | array(string) | ✔️ | Paths to input data files to perform query over. Must all have identical schema. | |
transform | Transform | ✔️ | Transformation that will be applied to produce new data. | |
outputDataPath | string | ✔️ | path | Path where query result will be written. |
RawQueryResponse
Sent by an engine to coordinator when performing the raw query operation
Union Type | Description |
---|---|
RawQueryResponse::Progress | |
RawQueryResponse::Success | |
RawQueryResponse::InvalidQuery | |
RawQueryResponse::InternalError |
RawQueryResponse::Progress
Property | Type | Required | Format | Description |
---|
RawQueryResponse::Success
Property | Type | Required | Format | Description |
---|---|---|---|---|
numRecords | integer | ✔️ | uint64 | Number of records produced by the query |
RawQueryResponse::InvalidQuery
Property | Type | Required | Format | Description |
---|---|---|---|---|
message | string | ✔️ | Explanation of an error |
RawQueryResponse::InternalError
Property | Type | Required | Format | Description |
---|---|---|---|---|
message | string | ✔️ | Brief description of an error | |
backtrace | string | Details of an error (e.g. a backtrace) |
TransformRequest
Sent by the coordinator to an engine to perform the next step of data transformation
Property | Type | Required | Format | Description |
---|---|---|---|---|
datasetId | string | ✔️ | dataset-id | Unique identifier of the output dataset. |
datasetAlias | string | ✔️ | dataset-alias | Alias of the output dataset, for logging purposes only. |
systemTime | string | ✔️ | date-time | System time to use for new records. |
vocab | DatasetVocabulary | ✔️ | ||
transform | Transform | ✔️ | Transformation that will be applied to produce new data. | |
queryInputs | array( TransformRequestInput ) | ✔️ | Defines inputs used in this transaction. Slices corresponding to every input dataset must be present. | |
nextOffset | integer | ✔️ | uint64 | Starting offset to use for new data records. |
prevCheckpointPath | string | path | TODO: This will be removed when coordinator will be speaking to engines purely through Arrow. | |
newCheckpointPath | string | ✔️ | path | TODO: This will be removed when coordinator will be speaking to engines purely through Arrow. |
newDataPath | string | ✔️ | path | TODO: This will be removed when coordinator will be speaking to engines purely through Arrow. |
TransformRequestInput
Sent as part of the engine transform request operation to describe the input
Property | Type | Required | Format | Description |
---|---|---|---|---|
datasetId | string | ✔️ | dataset-id | Unique identifier of the dataset. |
datasetAlias | string | ✔️ | dataset-alias | Alias of the output dataset, for logging purposes only. |
queryAlias | string | ✔️ | An alias of this input to be used in queries. | |
vocab | DatasetVocabulary | ✔️ | ||
offsetInterval | OffsetInterval | Subset of data that goes into this transaction. | ||
dataPaths | array(string) | ✔️ | TODO: This will be removed when coordinator will be slicing data for the engine. | |
schemaFile | string | ✔️ | path | TODO: replace with actual DDL or Parquet schema. |
explicitWatermarks | array( Watermark ) | ✔️ |
TransformResponse
Sent by an engine to coordinator when performing the data transformation
Union Type | Description |
---|---|
TransformResponse::Progress | |
TransformResponse::Success | |
TransformResponse::InvalidQuery | |
TransformResponse::InternalError |
TransformResponse::Progress
Property | Type | Required | Format | Description |
---|
TransformResponse::Success
Property | Type | Required | Format | Description |
---|---|---|---|---|
newOffsetInterval | OffsetInterval | Data slice produced by the transaction, if any. | ||
newWatermark | string | date-time | Watermark advanced by the transaction, if any. |
TransformResponse::InvalidQuery
Property | Type | Required | Format | Description |
---|---|---|---|---|
message | string | ✔️ | Explanation of an error |
TransformResponse::InternalError
Property | Type | Required | Format | Description |
---|---|---|---|---|
message | string | ✔️ | Brief description of an error | |
backtrace | string | Details of an error (e.g. a backtrace) |
Fragments
AttachmentEmbedded
Embedded attachment item.
Property | Type | Required | Format | Description |
---|---|---|---|---|
path | string | ✔️ | Path to an attachment if it was materialized into a file. | |
content | string | ✔️ | Content of the attachment. |
Attachments
Defines the source of attachment files.
Union Type | Description |
---|---|
Attachments::Embedded | For attachments that are specified inline and are embedded in the metadata. |
Attachments::Embedded
For attachments that are specified inline and are embedded in the metadata.
Property | Type | Required | Format | Description |
---|---|---|---|---|
items | array( AttachmentEmbedded ) | ✔️ |
Checkpoint
Describes a checkpoint produced by an engine
Property | Type | Required | Format | Description |
---|---|---|---|---|
physicalHash | string | ✔️ | multihash | Hash sum of the checkpoint file. |
size | integer | ✔️ | uint64 | Size of checkpoint file in bytes. |
DataSlice
Describes a slice of data added to a dataset or produced via transformation
Property | Type | Required | Format | Description |
---|---|---|---|---|
logicalHash | string | ✔️ | multihash | Logical hash sum of the data in this slice. |
physicalHash | string | ✔️ | multihash | Hash sum of the data part file. |
offsetInterval | OffsetInterval | ✔️ | Data slice produced by the transaction. | |
size | integer | ✔️ | uint64 | Size of data file in bytes. |
DatasetKind
Represents type of the dataset.
Enum Value |
---|
Root |
Derivative |
DatasetVocabulary
Specifies the mapping of system columns onto dataset schema.
Property | Type | Required | Format | Description |
---|---|---|---|---|
offsetColumn | string | ✔️ | Name of the offset column. | |
operationTypeColumn | string | ✔️ | Name of the operation type column. | |
systemTimeColumn | string | ✔️ | Name of the system time column. | |
eventTimeColumn | string | ✔️ | Name of the event time column. |
EnvVar
Defines an environment variable passed into some job.
Property | Type | Required | Format | Description |
---|---|---|---|---|
name | string | ✔️ | Name of the variable. | |
value | string | Value of the variable. |
EventTimeSource
Defines the external source of data.
Union Type | Description |
---|---|
EventTimeSource::FromMetadata | Extracts event time from the source’s metadata. |
EventTimeSource::FromPath | Extracts event time from the path component of the source. |
EventTimeSource::FromSystemTime | Assigns event time from the system time source. |
EventTimeSource::FromMetadata
Extracts event time from the source’s metadata.
Property | Type | Required | Format | Description |
---|
EventTimeSource::FromSystemTime
Assigns event time from the system time source.
Property | Type | Required | Format | Description |
---|
EventTimeSource::FromPath
Extracts event time from the path component of the source.
Property | Type | Required | Format | Description |
---|---|---|---|---|
pattern | string | ✔️ | regex | Regular expression where first group contains the timestamp string. |
timestampFormat | string | Format of the expected timestamp in java.text.SimpleDateFormat form. |
ExecuteTransformInput
Describes a slice of the input dataset used during a transformation
Property | Type | Required | Format | Description |
---|---|---|---|---|
datasetId | string | ✔️ | dataset-id | Input dataset identifier. |
prevBlockHash | string | multihash | Last block of the input dataset that was previously incorporated into the derivative transformation, if any. Must be equal to the last non-empty newBlockHash . Together with newBlockHash defines a half-open (prevBlockHash, newBlockHash] interval of blocks that will be considered in this transaction. | |
newBlockHash | string | multihash | Hash of the last block that will be incorporated into the derivative transformation. When present, defines a half-open (prevBlockHash, newBlockHash] interval of blocks that will be considered in this transaction. | |
prevOffset | integer | uint64 | Last data record offset in the input dataset that was previously incorporated into the derivative transformation, if any. Must be equal to the last non-empty newOffset . Together with newOffset defines a half-open (prevOffset, newOffset] interval of data records that will be considered in this transaction. | |
newOffset | integer | uint64 | Offset of the last data record that will be incorporated into the derivative transformation, if any. When present, defines a half-open (prevOffset, newOffset] interval of data records that will be considered in this transaction. |
FetchStep
Defines the external source of data.
Union Type | Description |
---|---|
FetchStep::Url | Pulls data from one of the supported sources by its URL. |
FetchStep::FilesGlob | Uses glob operator to match files on the local file system. |
FetchStep::Container | Runs the specified OCI container to fetch data from an arbitrary source. |
FetchStep::Mqtt | Connects to an MQTT broker to fetch events from the specified topic. |
FetchStep::EthereumLogs | Connects to an Ethereum node to stream transaction logs. |
FetchStep::Url
Pulls data from one of the supported sources by its URL.
Property | Type | Required | Format | Description |
---|---|---|---|---|
url | string | ✔️ | url | URL of the data source |
eventTime | EventTimeSource | Describes how event time is extracted from the source metadata. | ||
cache | SourceCaching | Describes the caching settings used for this source. | ||
headers | array( RequestHeader ) | Headers to pass during the request (e.g. HTTP Authorization) |
FetchStep::FilesGlob
Uses glob operator to match files on the local file system.
Property | Type | Required | Format | Description |
---|---|---|---|---|
path | string | ✔️ | Path with a glob pattern. | |
eventTime | EventTimeSource | Describes how event time is extracted from the source metadata. | ||
cache | SourceCaching | Describes the caching settings used for this source. | ||
order | string | Specifies how input files should be ordered before ingestion. Order is important as every file will be processed individually and will advance the dataset’s watermark. |
FetchStep::Container
Runs the specified OCI container to fetch data from an arbitrary source.
Property | Type | Required | Format | Description |
---|---|---|---|---|
image | string | ✔️ | Image name and and an optional tag. | |
command | array(string) | Specifies the entrypoint. Not executed within a shell. The default OCI image’s ENTRYPOINT is used if this is not provided. | ||
args | array(string) | Arguments to the entrypoint. The OCI image’s CMD is used if this is not provided. | ||
env | array( EnvVar ) | Environment variables to propagate into or set in the container. |
FetchStep::Mqtt
Connects to an MQTT broker to fetch events from the specified topic.
Property | Type | Required | Format | Description |
---|---|---|---|---|
host | string | ✔️ | Hostname of the MQTT broker. | |
port | integer | ✔️ | Port of the MQTT broker. | |
username | string | Username to use for auth with the broker. | ||
password | string | Password to use for auth with the broker (can be templated). | ||
topics | array( MqttTopicSubscription ) | ✔️ | List of topic subscription parameters. |
FetchStep::EthereumLogs
Connects to an Ethereum node to stream transaction logs.
Property | Type | Required | Format | Description |
---|---|---|---|---|
chainId | integer | uint64 | Identifier of the chain to scan logs from. This parameter may be used for RPC endpoint lookup as well as asserting that provided nodeUrl corresponds to the expected chain. | |
nodeUrl | string | url | Url of the node. | |
filter | string | An SQL WHERE clause that can be used to pre-filter the logs before fetching them from the ETH node. | ||
signature | string | Solidity log event signature to use for decoding. Using this field adds event to the output containing decoded log as JSON. |
MergeStrategy
Merge strategy determines how newly ingested data should be combined with the data that already exists in the dataset.
Union Type | Description |
---|---|
MergeStrategy::Append | Append merge strategy. |
MergeStrategy::Ledger | Ledger merge strategy. |
MergeStrategy::Snapshot | Snapshot merge strategy. |
MergeStrategy::Append
Append merge strategy.
Under this strategy new data will be appended to the dataset in its entirety, without any deduplication.
Property | Type | Required | Format | Description |
---|
MergeStrategy::Ledger
Ledger merge strategy.
This strategy should be used for data sources containing ledgers of events. Currently this strategy will only perform deduplication of events using user-specified primary key columns. This means that the source data can contain partially overlapping set of records and only those records that were not previously seen will be appended.
Property | Type | Required | Format | Description |
---|---|---|---|---|
primaryKey | array(string) | ✔️ | Names of the columns that uniquely identify the record throughout its lifetime |
MergeStrategy::Snapshot
Snapshot merge strategy.
This strategy can be used for data state snapshots that are taken periodically and contain only the latest state of the observed entity or system. Over time such snapshots can have new rows added, and old rows either removed or modified.
This strategy transforms snapshot data into an append-only event stream where data already added is immutable. It does so by performing Change Data Capture - essentially diffing the current state of data against the reconstructed previous state and recording differences as retractions or corrections. The Operation Type “op” column will contain:
- append (
+A
) when a row appears for the first time - retraction (
-D
) when row disappears - correction (
-C
,+C
) when row data has changed, with-C
event carrying the old value of the row and+C
carrying the new value.
To correctly associate rows between old and new snapshots this strategy relies on user-specified primary key columns.
To identify whether a row has changed this strategy will compare all other columns one by one. If the data contains a column that is guaranteed to change whenever any of the data columns changes (for example a last modification timestamp, an incremental version, or a data hash), then it can be specified in compareColumns
property to speed up the detection of modified rows.
Property | Type | Required | Format | Description |
---|---|---|---|---|
primaryKey | array(string) | ✔️ | Names of the columns that uniquely identify the record throughout its lifetime. | |
compareColumns | array(string) | Names of the columns to compared to determine if a row has changed between two snapshots. |
MqttQos
MQTT quality of service class.
Enum Value |
---|
AtMostOnce |
AtLeastOnce |
ExactlyOnce |
MqttTopicSubscription
MQTT topic subscription parameters.
Property | Type | Required | Format | Description |
---|---|---|---|---|
path | string | ✔️ | Name of the topic (may include patterns). | |
qos | MqttQos | Quality of service class. |
OffsetInterval
Describes a range of data as a closed arithmetic interval of offsets
Property | Type | Required | Format | Description |
---|---|---|---|---|
start | integer | ✔️ | uint64 | Start of the closed interval [start; end]. |
end | integer | ✔️ | uint64 | End of the closed interval [start; end]. |
PrepStep
Defines the steps to prepare raw data for ingestion.
Union Type | Description |
---|---|
PrepStep::Decompress | Pulls data from one of the supported sources by its URL. |
PrepStep::Pipe | Executes external command to process the data using piped input/output. |
PrepStep::Decompress
Pulls data from one of the supported sources by its URL.
Property | Type | Required | Format | Description |
---|---|---|---|---|
format | string | ✔️ | Name of a compression algorithm used on data. | |
subPath | string | Path to a data file within a multi-file archive. Can contain glob patterns. |
PrepStep::Pipe
Executes external command to process the data using piped input/output.
Property | Type | Required | Format | Description |
---|---|---|---|---|
command | array(string) | ✔️ | Command to execute and its arguments. |
ReadStep
Defines how raw data should be read into the structured form.
Union Type | Description |
---|---|
ReadStep::Csv | Reader for comma-separated files. |
ReadStep::GeoJson | Reader for GeoJSON files. It expects one FeatureCollection object in the root and will create a record per each Feature inside it extracting the properties into individual columns and leaving the feature geometry in its own column. |
ReadStep::EsriShapefile | Reader for ESRI Shapefile format. |
ReadStep::Parquet | Reader for Apache Parquet format. |
ReadStep::Json | Reader for JSON files that contain an array of objects within them. |
ReadStep::NdJson | Reader for files containing multiple newline-delimited JSON objects with the same schema. |
ReadStep::NdGeoJson | Reader for Newline-delimited GeoJSON files. It is similar to GeoJson format but instead of FeatureCollection object in the root it expects every individual feature object to appear on its own line. |
ReadStep::Csv
Reader for comma-separated files.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. | ||
separator | string | Sets a single character as a separator for each field and value. | ||
encoding | string | Decodes the CSV files by the given encoding type. | ||
quote | string | Sets a single character used for escaping quoted values where the separator can be part of the value. Set an empty string to turn off quotations. | ||
escape | string | Sets a single character used for escaping quotes inside an already quoted value. | ||
header | boolean | Use the first line as names of columns. | ||
inferSchema | boolean | Infers the input schema automatically from data. It requires one extra pass over the data. | ||
nullValue | string | Sets the string representation of a null value. | ||
dateFormat | string | Sets the string that indicates a date format. The rfc3339 is the only required format, the other format strings are implementation-specific. | ||
timestampFormat | string | Sets the string that indicates a timestamp format. The rfc3339 is the only required format, the other format strings are implementation-specific. |
ReadStep::Json
Reader for JSON files that contain an array of objects within them.
Property | Type | Required | Format | Description |
---|---|---|---|---|
subPath | string | Path in the form of a.b.c to a sub-element of the root JSON object that is an array or objects. If not specified it is assumed that the root element is an array. | ||
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. | ||
dateFormat | string | Sets the string that indicates a date format. The rfc3339 is the only required format, the other format strings are implementation-specific. | ||
encoding | string | Allows to forcibly set one of standard basic or extended encodings. | ||
timestampFormat | string | Sets the string that indicates a timestamp format. The rfc3339 is the only required format, the other format strings are implementation-specific. |
ReadStep::NdJson
Reader for files containing multiple newline-delimited JSON objects with the same schema.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. | ||
dateFormat | string | Sets the string that indicates a date format. The rfc3339 is the only required format, the other format strings are implementation-specific. | ||
encoding | string | Allows to forcibly set one of standard basic or extended encodings. | ||
timestampFormat | string | Sets the string that indicates a timestamp format. The rfc3339 is the only required format, the other format strings are implementation-specific. |
ReadStep::GeoJson
Reader for GeoJSON files. It expects one FeatureCollection
object in the root and will create a record per each Feature
inside it extracting the properties into individual columns and leaving the feature geometry in its own column.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. |
ReadStep::NdGeoJson
Reader for Newline-delimited GeoJSON files. It is similar to GeoJson
format but instead of FeatureCollection
object in the root it expects every individual feature object to appear on its own line.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. |
ReadStep::EsriShapefile
Reader for ESRI Shapefile format.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. | ||
subPath | string | If the ZIP archive contains multiple shapefiles use this field to specify a sub-path to the desired .shp file. Can contain glob patterns to act as a filter. |
ReadStep::Parquet
Reader for Apache Parquet format.
Property | Type | Required | Format | Description |
---|---|---|---|---|
schema | array(string) | A DDL-formatted schema. Schema can be used to coerce values into more appropriate data types. |
RequestHeader
Defines a header (e.g. HTTP) to be passed into some request.
Property | Type | Required | Format | Description |
---|---|---|---|---|
name | string | ✔️ | Name of the header. | |
value | string | ✔️ | Value of the header. |
SourceCaching
Defines how external data should be cached.
Union Type | Description |
---|---|
SourceCaching::Forever | After source was processed once it will never be ingested again. |
SourceCaching::Forever
After source was processed once it will never be ingested again.
Property | Type | Required | Format | Description |
---|
SourceState
The state of the source the data was added from to allow fast resuming.
Property | Type | Required | Format | Description |
---|---|---|---|---|
sourceName | string | ✔️ | Identifies the source that the state corresponds to. | |
kind | string | ✔️ | Identifies the type of the state. Standard types include: odf/etag , odf/last-modified . | |
value | string | ✔️ | Opaque value representing the state. |
SqlQueryStep
Defines a query in a multi-step SQL transformation.
Property | Type | Required | Format | Description |
---|---|---|---|---|
alias | string | Name of the temporary view that will be created from result of the query. Step without this alias will be treated as an output of the transformation. | ||
query | string | ✔️ | SQL query the result of which will be exposed under the alias. |
TemporalTable
Temporary Flink-specific extension for creating temporal tables from streams.
Property | Type | Required | Format | Description |
---|---|---|---|---|
name | string | ✔️ | Name of the dataset to be converted into a temporal table. | |
primaryKey | array(string) | ✔️ | Column names used as the primary key for creating a table. |
Transform
Engine-specific processing queries that shape the resulting data.
Union Type | Description |
---|---|
Transform::Sql | Transform using one of the SQL dialects. |
Transform::Sql
Transform using one of the SQL dialects.
Property | Type | Required | Format | Description |
---|---|---|---|---|
engine | string | ✔️ | Identifier of the engine used for this transformation. | |
version | string | Version of the engine to use. | ||
query | string | SQL query the result of which will be used as an output. This is a convenience property meant only for defining queries by hand. When stored in the metadata this property will never be set and instead will be converted into a single-iter queries array. | ||
queries | array( SqlQueryStep ) | Specifies multi-step SQL transformations. Each step acts as a shorthand for CREATE TEMPORARY VIEW <alias> AS (<query>) . Last query in the array should have no alias and will be treated as an output. | ||
temporalTables | array( TemporalTable ) | Temporary Flink-specific extension for creating temporal tables from streams. |
TransformInput
Describes a derivative transformation input
Property | Type | Required | Format | Description |
---|---|---|---|---|
datasetRef | string | ✔️ | dataset-ref | A local or remote dataset reference. When block is accepted this MUST be in the form of a DatasetId to guarantee reproducibility, as aliases can change over time. |
alias | string | An alias under which this input will be available in queries. Will be populated from datasetRef if not provided before resolving it to DatasetId. |
Watermark
Represents a watermark in the event stream.
Property | Type | Required | Format | Description |
---|---|---|---|---|
systemTime | string | ✔️ | date-time | |
eventTime | string | ✔️ | date-time |