kamu (pronounced kæmˈuː) is a command-line tool for management and verifiable processing of dynamic structured data. It’s a green-field project that aims to enable global collaboration on data on the same scale as seen today in software.
Intro
kamu is based on Open Data Fabric protocol.
Introductory video
kamu as:
- Local-first data lakehouse - a free alternative to Databricks / Snowflake / Microsoft Fabric that can run on your laptop without any accounts, and scale to a large on-prem cluster
- Kubernetes for data pipelines - an infrastructure-as-code framework for building ETL pipelines using wide range of open-source SQL engines
- Git for data - a tamper-proof ledger that handles data ownership and preserves full history of changes to source data
- Blockchain for data - a verifiable computing system for transforming data and recording fine-grain provenance and lineage
- Peer-to-peer data network - a set of open data formats and protocols for:
- Non-custodial data sharing
- Federated querying of global data as if one giant database
- Processing pipelines that can span across multiple organizations.
Quick Start
Use the installer script (Linux / MacOSX / WSL2):- Watch introductory videos to see
kamuin action - Follow the “First Steps” guide through an online demo and installation instructions.
How it Works
Ingest from any source
kamu works well with popular data extractors like Debezium and provides many built-in sources ranging from polling data on the web to MQTT broker and blockchain logs.

Track tamper-proof history
Data is stored in Open Data Fabric (ODF) format - an open Web3-native format inspired by Apache Iceberg and Delta. In addition to “table” abstraction on top of Parquet files, ODF provides:- Cryptographic integrity and commitments
- Stable references
- Decentralized identity, ownership, attribution, and permissions (based on W3C DIDs)
- Rich extensible metadata (e.g. licenses, attachments, semantics)
- Compatibility with decentralized storages like IPFS
Explore, query, document
kamu offers a wide range of integrations, including:
- Embedded SQL shell for quick EDA
- Integrated Jupyter notebooks for ML/AI
- Embedded Web UI with SQL editor and metadata explorer
- Apache Superset and many other BI solutions


Build enterprise-grade ETL pipelines
Data inkamu can only be transformed through code. An SQL query that cleans one dataset or combines two via JOIN can be used to create a derivative dataset.
kamu doesn’t implement data processing itself - it integrates many popular data engines (Flink, Spark, DataFusion…) as plugins, so you can build an ETL flow that uses the strengths of different engines at different steps of the pipeline:

Get near real-time consistent results
All derivative datasets use stream processing that results in some revolutionary qualities:- Input data is only read once, minimizing the traffic
- Configurable balance between low-latency and high-consistency
- High autonomy - once pipeline is written it can run and deliver fresh data forever with little to no maintenance.
Share datasets with others
ODF datasets can be shared via any conventional (S3, GCS, Azure) and decentralized (IPFS) storage and easily replicated. Sharing a large dataset is simple as:Reuse verifiable data
kamu will store the transformation code in the dataset metadata and ensure that it’s deterministic and reproducible. This is a form of verifiable computing.
You can send a dataset to someone else and they can confirm that the data they see in fact corresponds to the inputs and code:
Query world’s data as one big database
Through federation, data in different locations can be queried as if it was in one big data lakehouse -kamu will take care of how to compute results most optimally, potentially delegating parts of the processing to other nodes.
Every query result is accompanied by a cryptographic commitment that you can use to reproduce the same query days or even months later.
Start small and scale progressively
kamu offers unparalleled flexibility of deployment options:
- You can build, test, and debug your data projects and pipelines on a laptop
- Incorporate online storage for larger volumes, but keep processing it locally
- When you need real-time processing and 24/7 querying you can run the same pipelines with
kamu-nodeas a small server - A node can be deployed in Kubernetes and scale to a large cluster.
Get data to and from blockchains
Usingkamu you can easily read on-chain data to run analytics on smart contracts, and provide data to blockchains via novel Open Data Fabric oracle.
Use Cases
In general,kamu is a great fit for cases where data is exchanged between several independent parties, and for (low to moderate frequency & volume) mission-critical data where high degree of trustworthiness and protection from malicious actors is required.
- Open Data
- Science & Research
- Data-driven Journalism
- Business core data
- Personal analytics
To share data outside of your organization today you have limited options:
- You can publish it on some open data portal, but lose ownership and control of your data
- You can deploy and operate some open-source data portal (like CKAN or Dataverse), but you probably have neither time nor money to do so
- You can self-host it as a CSV file on some simple HTTP/FTP server, but then you are making it extremely hard for others to discover and use your data
kamu is to make data publishing cheap and effortless:- It invisibly guides publishers towards best data management practices (preserving history, making data reproducible and verifiable)
- Adds as little friction as exporting data to CSV
- Lets you host your data on any storage (S3, IPFS, GCS, FTP etc.)
- Maintain full control and ownership of your data
kamu brings publishers closer to the communities allowing them to see who and how uses their data. You no longer send data into “the ether”, but create a closed feedback loop with your consumers.Features
kamu connects publishers and consumers of data through a decentralized network and lets people collaborate on extracting insight from data. It offers many perks for everyone who participates in this first-of-a-kind data supply chain:
- For Data Publishers
- For Data Scientists
- For Data Consumers
- For Data Exploration
- Easily share your data with the world without moving it anywhere
- Retain full ownership and control of your data
- Close the feedback loop and see who and how uses your data
- Provide real-time, verifiable and reproducible data that follows the best data management practices
