> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Supported Data Engines

> Describes strengths and development state of different engines supported by kamu

export const Diagram = ({src, alt}) => {
  return <div style={{
    display: "flex",
    "flex-direction": "column",
    "align-items": "center"
  }}>
    <img src={src} alt={alt} style={{
    background: "#dddddd",
    "margin-bottom": 0
  }} />
    <span>{alt}</span>
  </div>;
};

export const YouTubeList = ({id}) => {
  const src = `https://www.youtube.com/embed/videoseries?list=${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>;
};

export const YouTube = ({id, width}) => {
  const src = `https://www.youtube.com/embed/${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen width={width}></iframe>;
};

export const Schema = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/odf/schemas#${anchor}`;
  return <a class="schema-object" href={link}>{t}</a>;
};

export const Term = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/general/glossary#${anchor}`;
  return <a class="glossary-term" href={link}>{t}</a>;
};

All data processing in `kamu` is done by a set of plug-in <Term t="engines" />. This allows us to integrate many mature data processing frameworks, use them to transform data, while `kamu` coordinates all the advanced aspects of processing, tracks <Term t="provenance" />, ensures <Term t="verifiability" />, etc.

<Note>
  The opinions below relate to ODF adapters implemented using the described engine, not the engines themselves. Engines featured here all have very different designs, making them more suitable for some tasks than others. Information below is intended as a rough guidance for engine choice within ODF and should be taken with a big grain of salt.
</Note>

## Known Engine Implementations

| Name         |                               Technology                              |                                                                       Query Dialect                                                                      |                                                            Links                                                            | Notes                                                                                                                                                                                                                                                                                                                                                      |
| ------------ | :-------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spark`      |               [Apache Spark](https://spark.apache.org/)               | [Spark Streaming SQL](https://spark.apache.org/docs/latest/sql-ref.html) with [Sedona GIS Extensions](https://sedona.apache.org/1.5.0/api/sql/Overview/) |      [Repository](https://github.com/kamu-data/kamu-engine-spark)<br />[Image](https://ghcr.io/kamu-data/engine-spark)      | Spark is used in `kamu-cli` for all data ingestion and is default (but not only) engine for SQL shell. Spark is also used in combination with Livy to query data from embedded Jupyter Notebooks. It's currently the only engine that supports **GIS data** via Apache Sedona integration.                                                                 |
| `flink`      |               [Apache Flink](https://flink.apache.org/)               |                     [Flink Streaming SQL](https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/gettingstarted/)                     |      [Repository](https://github.com/kamu-data/kamu-engine-flink)<br />[Image](https://ghcr.io/kamu-data/engine-flink)      | Flink has most mature support for stream processing, like stream-to-stream and stream-to-table joins, windowed aggregations, watermarks etc. It's thus the recommended engine for most derivative datasets.                                                                                                                                                |
| `datafusion` | [Apache Arrow DataFusion](https://github.com/apache/arrow-datafusion) |                                      [DataFusion SQL](https://arrow.apache.org/datafusion/user-guide/sql/index.html)                                     | [Repository](https://github.com/kamu-data/kamu-engine-datafusion)<br />[Image](https://ghcr.io/kamu-data/engine-datafusion) | An extremely fast and low-footprint **batch processing** engine. DataFusion is also embedded into `kamu-cli` and is used for data ingestion, ad-hoc SQL queries, and SQL console. Despite being a batch-only engine it can still be used in pipelines for simple map/filter/union operations where temporal semantics is not needed.                       |
| `risingwave` |       [RisingWave](https://github.com/risingwavelabs/risingwave)      |                                        [RisingWave SQL](https://docs.risingwave.com/docs/current/sql-references/)                                        | [Repository](https://github.com/kamu-data/kamu-engine-risingwave)<br />[Image](https://ghcr.io/kamu-data/engine-risingwave) | Experimental stream process engine. There are [ongoing attempts](https://github.com/apache/arrow-datafusion/issues/4285) to add stream processing functionality. DataFusion is also embedded into `kamu-cli` and is used for executing ad-hoc SQL queries. We are aiming to make data ingest functionality use DataFusion for most cases instead of Spark. |

### Schema Support

| Feature      | kamu | Spark |  Flink | DataFusion | RisingWave |
| ------------ | :--: | :---: | :----: | :--------: | :--------: |
| Basic types  |  ✔️  |   ✔️  |   ✔️   |     ✔️     |     ✔️     |
| Decimal type |  ✔️  |   ✔️  | ✔️\*\* |     ✔️     |     ✔️     |
| Nested types | ✔️\* |   ✔️  |    ❌   |   ❔\*\*\*  |   ❔\*\*\*  |
| GIS types    | ✔️\* |   ✔️  |    ❌   |      ❌     |      ❌     |

✔️\* - There is currently no way to express nested and GIS data types when declaring root dataset schemas, but you still can use them through pre-processing queries

✔️\*\* - Apache Flink has known issues with Decimal type and currently relies on our patches that have not been upstreamed yet, so stability is not guaranteed [FLINK-17804](https://issues.apache.org/jira/browse/FLINK-17804).

❔ - Engine capability exists but requires more integration testing

### Operation Types

Note that ODF always operates in <Term t="event time" />, this all temporal aggregations and joins have to be supported by the engine in event-time processing mode.

| Feature                       | Spark | Flink | DataFusion | RisingWave |
| ----------------------------- | :---: | :---: | :--------: | :--------: |
| Filter                        |   ✔️  |   ✔️  |      ✅     |      ✅     |
| Map                           |   ✔️  |   ✔️  |      ✅     |      ✅     |
| Aggregation: Window functions |   ❌   |   ✔️  |      ❌     |      ✅     |
| Aggregation: Tumbling windows |   ❌   |   ✔️  |      ❌     |      ✅     |
| Aggregation: Top-N            |   ❌   |   ❔   |      ❌     |      ✅     |
| Join: Windowed                |   ❌   |   ✔️  |      ❌     |      ❔     |
| Join: Temporal Table          |   ❌   |   ✔️  |      ❌     |      ❔     |
| GIS extensions                |   ✅   |   ❌   |      ❌     |      ❌     |

✔️ - supported<br />
✅ - supported and recommended<br />
❌ - not supported<br />
❔ - engine capability exists but requires more integration testing
