kamu and IPFS (The Interplanetary File System) projects have overlapping goals - preserving and growing humanity’s knowledge.
While IPFS focuses on storing files (objects), kamu’s focus is on structured data and dynamic data processing, but under the hood even a real-time dataset in kamu is just a set of files!
Below we show how kamu builds on top of many similar concepts as IPFS and how IPFS can be used as a very efficient storage layer for sharing your datasets and pipelines with the world.
IPFS Basics
The IPFS is well-documented, so we will quickly review the basics, and if you find any of them confusing simply follow the links:- IPFS is decentralized
- IPFS is a content-addressable storage (more)
- Mutability can be simulated with special URLs that can point to different CIDs at different times
Kamu and IPFS
Unlike any other data processing system,kamu was build from ground up with reproducibility in mind - it defaults to never losing history. As the ODF spec explains in detail - the best way to do this is to represent data as append-only event streams.
These event streams are represented by a linked list of metadata blocks (think git history or blockchain) that reference portions of raw data and checkpoint files:
So, very similarly to IPFS, the composing parts of a dataset in kamu are:
- Immutable - dataset only changes by appending new blocks
- Content-addressable - a metadata block, data file, or a checkpoint are uniquely identified by their hashsum
CID1. When you add a new file to it and do ipfs add . -r again IPFS will notice that it already has CIDs for all but one file and will reuse those objects:
If CID1 remains “pinned” it basically represents the state of the same directory as CID2 but at the previous point in time.
This also works for kamu datasets:
- Each time you push to IPFS you are only adding the blocks and objects that were not seen previously. There is no duplication.
- Previous CIDs remain valid - they simply point to an older subset of an event stream.
- Having just one “pointer” to a metadata block you can “walk” the entire metadata chain and discover all dataset components.
kamu allows it achieve near real-time latencies of data propagation even when handling massive datasets and using immutable storage like IPFS.
The above diagrams are good to build up some intuition, but this is not exactly how data looks like in IPFS on the IPLD DAG level. An accurate diagram would need to consider:
- Slicing of large files into multiple objects
- Raw vs. wrapped nodes
- Balanced vs. custom DAG structure
Pulling from IPFS
kamu supports pulling data from IPFS just like from any other repository.
You can use both ipfs:// and ipns:// URLs:
Configuring IPFS Gateway
Under the hoodkamu will use the configured IPFS Gateway to fetch data converting the URL into something like http(s)://{gateway}/ipfs/{cid} or http(s)://{gateway}/ipns/{key}.
By default it will try to use your local IPFS daemon as a gateway. You can see this by running:
Pushing to IPFS
Clearly, pushing to anipfs:// URL will not work as we don’t know the CID upfront, not until all data has been hashed:
- A public/private key pair is generated
- Name is derived by hashing the public key
- Private key is used for signing publishing requests to prove that you are the owner of said name.
ipns://{key} and push the dataset:
kamu created for us: