Apache Iceberg, Delta Lake

Apache Iceberg and Delta formats in modern data lakes serve the purpose of creating a logical representation of data table that abstracts the physical layout of data on disk.

Their features include:

  • Mutability (with atomic updates and read consistency)
  • Limited “time travel” (between snapshots)
  • Automatic snapshot pruning
  • Compactions (merging small data files for query efficiency)
  • Evolution of schemas and partitions

In kamu ODF datasets serve a similar purpose as a logical representation of data stream (a ledger of events).

The greatest distinction is that these ledgers are append-only:

  • New data can only be added to the end
  • Unlike Iceberg, Delta and other formats created around Change-Data-Capture, ODF streams preserve complete and unaltered history of data (for reproducibility and verifiability)
  • Old records can only be affected by issuing explicit corrections or retractions
  • Data and metadata is cryptographically secured and signed

While Iceberg and Delta only work with conventional file systems and object storages (e.g. HDFS, S3), ODF datasets were designed to also work efficiently with content-addressable storage like IPFS and Arweave.

The rich metadata of ODF streams is designed to carry much more information such as:

  • Applied transformation for lineage and provenance of derivative data
  • License changes, semantics, governance information…
  • Attachments (data manuals, examples) etc.

So while Iceberg and Delta fill the purpose of table abstractions in the enterprise data lakes, ODF dataset is a Web3-native dynamic structured data format and a building block for multi-party exchange and transformation of data.

Spark, Flink, Kafka Streaming

kamu does not compete with enterprise data processing technologies - it uses them internally and builds on top:

  • It specifies how data should be stored
    • e.g. making sure that data is never modified or deleted
  • Provides stable references to data for reproducibility
  • Specifies how data & metadata are shared
  • Tracks every processing step executed
    • so that a person on another side of the world who downloaded your dataset could understand exactly where every single piece of data came from
  • Handles dataset evolution
    • so that you could update your processing steps over time without breaking other people’s downstream pipelines that depend on your data
  • And much more…

So Spark and Flink to kamu are just building blocks, while kamu is a higher level and opinionated system.

Snowflake

SnowflakeKamu
Is a robust data platform and enterprise data warehouseIs a Web3-native data sharing and collaborative processing solution
Centralized, cloud-only, and based on proprietary technologyDecentralized network with open protocol and code that scales from laptop to cluster, can span from edge to on-prem to multi-cloud
Best for internal data managementBest for cross-organizational data exchange and collaboration
Works with cloud storage (S3, Azure, etc.)Works with both cloud and content-addressabe storage (IPFS, Arweave)
Uses batch processing where recurrent data pipelines are expensive to execute, typically run only once a day, and are non-composable due to consistency issues when processing dynamic dataUses novel stream processing that enables near real-time latencies and for many workflows can be 1000x more efficient than batch, and provides unparalleled consistency and composability to data pipelines
Data is non-verifiable and has no provenanceData is 100% reproducible, verifiable, and with built-in provenance
Data can be shared on Snowflake Marketplace but this ties you to Snowflake productsData can be shared in a decentralized network without intermediaries and using open protocols
On Snowflake Marketplace you can only be a publisher or a consumer of dataEnables infinitely composable collaboration on data where parties can clean and combine data into higher-level data products while proving that they are not adding malicious data
Only off-chain data, limited to company-size silosSeamlessly integrates on and off-chain data, and can provide data to blockchain as an optimistic oracle

Apache Airflow, Apache NiFi

TBD

Ceramic Network

TBD

Ocean Network

TBD

Dune Analytics

TBD

The Graph

TBD

Chainlink

TBD

Space And Time

TBD

JSON APIs

TBD

Blockchains

TBD