Supported Data Engines
All data processing in
kamu is done by a set of plug-in engines. This allows us to integrate many mature data processing frameworks, while having
kamu coordinate all processing to track provenance and ensure verifiability.
Known Engine Implementations
||Apache Spark||Spark Streaming SQL||Repository
|Spark is used in
||Apache Flink||Flink Streaming SQL||Repository
|Flink has most mature support for stream processing, like stream-to-stream and stream-to-table joins, windowed aggregations, watermarks etc. It’s thus the recommended engine for most derivative datasets.|
||Apache Arrow DataFusion||DataFusion Batch SQL||Repository
|Experimental engine that has limited functionality (due to being batch-oriented), but is extremely fast and low-footprint. There are ongoing attempts to add stream processing functionality. DataFusion is also embedded into
* There is currently no way to express nested and GIS data types when declaring root dataset schemas, but you still can use them through pre-processing queries
** Apache Flink has known issues with Decimal type and currently relies on our patches that have not been upstreamed yet, so stability is not guaranteed FLINK-17804.
|Projection / Temporal Table Joins||❌*||✔️||❌|
* Spark Engine is capable of stream processing but temporarily we have to use it in the batch processing mode, so only row-level operations like map and filter are currently usable, as those do not require correct stream processing and watermark semantics.