> ## Documentation Index > Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt > Use this file to discover all available pages before exploring further. # Sharing datasets with IPFS > Describes integration with content-addressable decentralized storage export const Diagram = ({src, alt}) => { return

{alt}

; }; export const YouTubeList = ({id}) => { const src = `https://www.youtube.com/embed/videoseries?list=${id}`; return ; }; export const YouTube = ({id, width}) => { const src = `https://www.youtube.com/embed/${id}`; return ; }; export const Schema = ({t, id}) => { const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-"); const link = `/odf/schemas#${anchor}`; return {t}; }; export const Term = ({t, id}) => { const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-"); const link = `/general/glossary#${anchor}`; return {t}; }; The `kamu` and [IPFS](https://ipfs.io/) (The Interplanetary File System) projects have overlapping goals - preserving and growing humanity's knowledge. While IPFS focuses on storing files (objects), `kamu`'s focus is on structured data and dynamic data processing, but under the hood even a real-time dataset in `kamu` is just a set of files! Below we show how `kamu` builds on top of many similar concepts as IPFS and how IPFS can be used as a very efficient **storage layer** for sharing your datasets and pipelines with the world. ## IPFS Basics The IPFS is well-documented, so we will quickly review the basics, and if you find any of them confusing simply follow the links: * IPFS is **decentralized** * You are fully responsible for your own data's persistence, replication, and availability ([more](https://docs.ipfs.io/concepts/persistence/)) * To publish data you need to run a node ([more](https://docs.ipfs.io/concepts/nodes/)) * To read data you can use your node or a public IPFS Gateway ([more](https://docs.ipfs.io/concepts/ipfs-gateway/)) * IPFS is a **content-addressable** storage ([more](https://docs.ipfs.io/concepts/content-addressing/)) * The IPFS URL is based on a CID (content identifier) which is derived from a hash sum of the data it points to ([more](https://docs.ipfs.io/concepts/hashing/)) * Every time the content of a file changes you get a **different URL** * Old and new file are different objects - the data in IPFS is **immutable** ([more](https://docs.ipfs.io/concepts/immutability/)) * **Mutability can be simulated** with special URLs that can point to different CIDs at different times * Currently supported naming mechanisms are [IPNS](https://docs.ipfs.io/concepts/ipns/) and [DNSLink](https://docs.ipfs.io/concepts/dnslink/) * IPNS uses a hashsum of a public key to create a unique name entry in a decentralized way ([more](https://docs.ipfs.io/concepts/ipns/)). ## Kamu and IPFS Unlike any other data processing system, `kamu` was build from ground up with reproducibility in mind - it defaults to never losing history. As the [ODF spec](/odf/spec) explains in detail - the best way to do this is to represent data as **append-only** event streams. These event streams are represented by a linked list of metadata blocks (think git history or blockchain) that reference portions of raw data and checkpoint files: So, very similarly to IPFS, the composing parts of a dataset in `kamu` are: * **Immutable** - dataset only changes by appending new blocks * **Content-addressable** - a metadata block, data file, or a checkpoint are uniquely identified by their hashsum Why it matters? Imagine you store a directory containing many large files in IPFS and it is assigned `CID1`. When you add a new file to it and do `ipfs add . -r` again IPFS will notice that it already has CIDs for all but one file and will reuse those objects: If `CID1` remains "pinned" it basically represents the state of the same directory as `CID2` but at the previous point in time. This also works for `kamu` datasets: * Each time you push to IPFS you are only adding the blocks and objects that were not seen previously. There is no duplication. * Previous CIDs remain valid - they simply point to an older subset of an event stream. * Having just one "pointer" to a metadata block you can "walk" the entire metadata chain and discover all dataset components. Unlike many other data processing systems that mutate data, the cheapness of an "append" operation in `kamu` allows it achieve near real-time latencies of data propagation even when handling massive datasets and using immutable storage like IPFS. The above diagrams are good to build up some intuition, but this is not exactly how data looks like in IPFS on the [IPLD DAG](https://ipld.io/) level. An accurate diagram would need to consider: * Slicing of large files into multiple objects * Raw vs. wrapped nodes * Balanced vs. custom DAG structure ## Pulling from IPFS `kamu` supports pulling data from IPFS just like from any [other repository](/cli/collab/repositories). You can use both `ipfs://` and `ipns://` URLs: ```bash theme={null} kamu pull ipfs://bafybeietcz4lxovy3ejdhb67nt3lj43vaeuyhectkqfnmmlnatfug5vqhe --as my-dataset --no-alias kamu pull ipns://k51qzi5uqu5dic6zu9i2f4afctxmsm298ypiuy3ijmob1w6m96c092qp4ev7mn --as my-dataset ``` Since the content behind an `ipfs://` URL never changes we use `--no-alias` flag to skip creation of the pull alias. ### Configuring IPFS Gateway Under the hood `kamu` will use the configured IPFS Gateway to fetch data converting the URL into something like `http(s)://{gateway}/ipfs/{cid}` or `http(s)://{gateway}/ipns/{key}`. By default it will try to use your local IPFS daemon as a gateway. You can see this by running: ```bash theme={null} kamu config get --with-defaults protocol.ipfs.httpGateway > "http://localhost:8080/" ``` If you only read data from IPFS and not planning to write - you can avoid running an IPFS daemon by switching to one of the [public IPFS Gateways](https://ipfs.github.io/public-gateway-checker/): ```bash theme={null} kamu config set --user protocol.ipfs.httpGateway "https://dweb.link/" ``` ## Pushing to IPFS Clearly, pushing to an `ipfs://` URL will not work as we don't know the CID upfront, not until all data has been hashed: ```bash theme={null} kamu push my-dataset --to ipfs:// ``` We need a URL that remains stable when the underlying dataset is updated - we can use [IPNS](https://docs.ipfs.io/concepts/ipns/) for this. IPNS is like a named pointer to a CID that can be re-pointed to a different CID only by its owner. To create unique collision-free names in a decentralized way and to have a way for you to prove ownership of this name at the same time IPNS is using cryptographic key pairs: * A public/private key pair is generated * Name is derived by hashing the public key * Private key is used for signing publishing requests to prove that you are the owner of said name. To push data to IPFS you will need a local IPFS daemon running. We first generate a new key pair that we will use for our dataset: ```bash theme={null} ipfs key gen my-dataset > k51qzi5uqu5dgl4gf3uayepenee5tzix3r8oiwyimxsfuikr7alq928xxtwmew ``` We can now use this key to form a destination URL as `ipns://{key}` and push the dataset: ```bash theme={null} kamu push my-dataset --to ipns://k51qzi5uqu5dgl4gf3uayepenee5tzix3r8oiwyimxsfuikr7alq928xxtwmew ``` Currently IPNS keys have a default lifetime of 24h (see `ipfs name publish --help`). This means you'll need to run push command periodically (e.g. on a cron job) to prevent it from expiring. Otherwise your URL will become non-resolvable. Since IPNS pointers are mutable the next time we update the dataset we can push it again using the push alias `kamu` created for us: ```bash theme={null} # Update the dataset locally kamu pull my-dataset > # Push updates via IPNS push alias kamu push my-dataset > Syncing dataset (my-dataset > ipns://k51q...wmew) > Updated to (1 block(s)) ``` And that's all there is to it! You now have all the basic building blocks to create **fully decentralized data pipelines**, both on storage and compute levels!