Sharing datasets with IPFS
The kamu
and IPFS (The Interplanetary File System) projects have overlapping goals - preserving and growing humanity’s knowledge.
While IPFS focuses on storing files (objects), kamu
’s focus is on structured data and dynamic data processing, but under the hood even a real-time dataset in kamu
is just a set of files!
Below we show how kamu
builds on top of many similar concepts as IPFS and how IPFS can be used as a very efficient storage layer for sharing your datasets and pipelines with the world.
IPFS Basics
The IPFS is well-documented, so we will quickly review the basics, and if you find any of them confusing simply follow the links:
- IPFS is decentralized
- IPFS is a content-addressable storage (more)
- Mutability can be simulated with special URLs that can point to different CIDs at different times
Kamu and IPFS
Unlike any other data processing system, kamu
was build from ground up with reproducibility in mind - it defaults to never losing history. As the ODF spec explains in detail - the best way to do this is to represent data as append-only event streams.
These event streams are represented by a linked list of metadata blocks (think git history or blockchain) that reference portions of raw data and checkpoint files:
So, very similarly to IPFS, the composing parts of a dataset in kamu
are:
- Immutable - dataset only changes by appending new blocks
- Content-addressable - a metadata block, data file, or a checkpoint are uniquely identified by their hashsum
Why it matters? Imagine you store a directory containing many large files in IPFS and it is assigned CID1
. When you add a new file to it and do ipfs add . -r
again IPFS will notice that it already has CIDs for all but one file and will reuse those objects:
If CID1
remains “pinned” it basically represents the state of the same directory as CID2
but at the previous point in time.
This also works for kamu
datasets:
- Each time you push to IPFS you are only adding the blocks and objects that were not seen previously. There is no duplication.
- Previous CIDs remain valid - they simply point to an older subset of an event stream.
- Having just one “pointer” to a metadata block you can “walk” the entire metadata chain and discover all dataset components.
Unlike many other data processing systems that mutate data, the cheapness of an “append” operation in kamu
allows it achieve near real-time latencies of data propagation even when handling massive datasets and using immutable storage like IPFS.
The above diagrams are good to build up some intuition, but this is not exactly how data looks like in IPFS on the IPLD DAG level. An accurate diagram would need to consider:
- Slicing of large files into multiple objects
- Raw vs. wrapped nodes
- Balanced vs. custom DAG structure
Pulling from IPFS
kamu
supports pulling data from IPFS just like from any other repository.
You can use both ipfs://
and ipns://
URLs:
kamu pull ipfs://bafybeietcz4lxovy3ejdhb67nt3lj43vaeuyhectkqfnmmlnatfug5vqhe --as my-dataset --no-alias
kamu pull ipns://k51qzi5uqu5dic6zu9i2f4afctxmsm298ypiuy3ijmob1w6m96c092qp4ev7mn --as my-dataset
ipfs://
URL never changes we use --no-alias
flag to skip creation of the pull alias.Configuring IPFS Gateway
Under the hood kamu
will use the configured IPFS Gateway to fetch data converting the URL into something like http(s)://{gateway}/ipfs/{cid}
or http(s)://{gateway}/ipns/{key}
.
By default it will try to use your local IPFS daemon as a gateway. You can see this by running:
kamu config get --with-defaults protocol.ipfs.httpGateway
> "http://localhost:8080/"
If you only read data from IPFS and not planning to write - you can avoid running an IPFS daemon by switching to one of the public IPFS Gateways:
kamu config set --user protocol.ipfs.httpGateway "https://dweb.link/"
Pushing to IPFS
Clearly, pushing to an ipfs://
URL will not work as we don’t know the CID upfront, not until all data has been hashed:
kamu push my-dataset --to ipfs://<???>
We need a URL that remains stable when the underlying dataset is updated - we can use IPNS for this.
IPNS is like a named pointer to a CID that can be re-pointed to a different CID only by its owner. To create unique collision-free names in a decentralized way and to have a way for you to prove ownership of this name at the same time IPNS is using cryptographic key pairs:
- A public/private key pair is generated
- Name is derived by hashing the public key
- Private key is used for signing publishing requests to prove that you are the owner of said name.
To push data to IPFS you will need a local IPFS daemon running.
We first generate a new key pair that we will use for our dataset:
ipfs key gen my-dataset
> k51qzi5uqu5dgl4gf3uayepenee5tzix3r8oiwyimxsfuikr7alq928xxtwmew
We can now use this key to form a destination URL as ipns://{key}
and push the dataset:
kamu push my-dataset --to ipns://k51qzi5uqu5dgl4gf3uayepenee5tzix3r8oiwyimxsfuikr7alq928xxtwmew
ipfs name publish --help
). This means you’ll need to run push command periodically (e.g. on a cron job) to prevent it from expiring. Otherwise your URL will become non-resolvable.Since IPNS pointers are mutable the next time we update the dataset we can push it again using the push alias kamu
created for us:
# Update the dataset locally
kamu pull my-dataset
> <Dataset is updated>
# Push updates via IPNS push alias
kamu push my-dataset
> Syncing dataset (my-dataset > ipns://k51q...wmew)
> Updated to <head> (1 block(s))
And that’s all there is to it! You now have all the basic building blocks to create fully decentralized data pipelines, both on storage and compute levels!