Spark, Flink, Kafka Streaming
kamu
does not compete with enterprise data processing technologies - it uses them internally and builds on top:
- It specifies how data should be stored
- e.g. making sure that data is never modified or deleted
- Provides stable references to data for reproducibility
- Specifies how data & metadata are shared
- Tracks every processing step executed
- so that a person on another side of the world who downloaded your dataset could understand exactly where every single piece of data came from
- Handles dataset evolution
- so that you could update your processing steps over time without breaking other people’s downstream pipelines that depend on your data
- And much more…
So Spark and Flink to kamu
are just building blocks, while kamu
is a higher level and opinionated system.