For a quick overview of key functionality you can also view this tutorial:

This tutorial will give you a high-level tour of kamu and show you how it works through examples.

We assume that you have already followed the installation steps and have kamu tool ready.

Not ready to install just yet?

Try kamu in this self-serve demo without needing to install anything.

Don’t forget to set up shell completions - they make using kamu a lot more fun!

The help command

When you execute kamu or kamu -h - the help message about all top-level commands will be displayed.

To get help on individual commands type:

kamu <command> -h

This will usually contain a detailed description of what command does along with usage examples.

Note that some command also have sub-commands, e.g. kamu repo {add,list,...}, same help pattern applies to those as well:

kamu repo add -h

Command help is also available online on CLI Reference page.

Ingesting data

Throughout this tutorial we will be using the Modified Zip Code Areas dataset from New York Open Data Portal.

Initializing the workspace

To work with kamu you first need a workspace. Workspace is where kamu will store important information about datasets and cached data. Let’s create one:

kamu init
mkdir my_workspace
cd my_workspace
kamu init

A workspace is just a directory with .kamu folder where all sorts of data and metadata are stored. It behaves very similarly to .git directory version-controlled repositories.

As you’d expect the workspace is currently empty:

kamu list

Creating a dataset

One of the design principles of kamu is to know exactly where any piece of data came from. So it never blindly copies data around - instead we establish ownership and links to external sources.

We’ll get into the details of that later, but for now let’s create such link.

kamu pull

Datasets are created from dataset snapshots - special files that describe the desired state of the metadata upon creation.

We will use a DatasetSnapshot file from kamu-contrib repo that looks like this:

kind: DatasetSnapshot
version: 1
content:
  name: us.cityofnewyork.data.zipcode-boundaries
  kind: Root
  metadata:
    - kind: SetPollingSource
      fetch:
        kind: Url
        # Dataset home: https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk
        url: https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?date=20240115&accessType=DOWNLOAD&method=export&format=Shapefile
      read:
        kind: EsriShapefile
      merge:
        kind: Snapshot
        primaryKey:
          # Modified ZIP Code Tabulation Area (ZCTA)
          # See for explanation: https://nychealth.github.io/covid-maps/modzcta-geo/about.html
          - modzcta

You can either copy the above into a example.yaml file and run:

kamu add example.yaml

Or add it directly from URL like so:

kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/us.cityofnewyork.data/zipcode-boundaries.yaml

Such YAML files are called manifests. First two lines specify that the file contains DatasetSnapshot object and then specify the version of the schema, for upgradeability:

kind: DatasetSnapshot
version: 1
content: ...

Next we give dataset a name and declare its kind:

name: us.cityofnewyork.data.zipcode-boundaries
kind: Root

Datasets that ingest external data in kamu are called Root datasets.

Next we have:

metadata:
  - kind: ...
    ...
  - kind: ...
    ...

This section contains one or many metadata events that can describe different aspects of a dataset, like:

  • where data comes from
  • its schema
  • license
  • relevant documentation
  • query examples
  • data quality checks
  • and much more.
To create your own snapshot file use kamu new command - it outputs a well-annotated template that you can customize for your needs.

Dataset Identity

During the dataset creation it is assigned a very special identifier.

You can see it by running:

kamu list -w
┌───────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────┬─────┐
│                                      ID                                       │                   Name                   │ ... │
├───────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┼─────┤
│ did:odf:fed01057c94fb0378e3222704bb70a261d3ebeaa0d1b38c056a0bdd476360b8548db1 │ us.cityofnewyork.data.zipcode-boundaries │ ... │
└───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┴─────┘

Or:

kamu log us.cityofnewyork.data.zipcode-boundaries
Block #0: ...
SystemTime: ...
Kind: Seed
DatasetKind: Root
DatasetID: did:odf:fed01057c94fb0378e3222704bb70a261d3ebeaa0d1b38c056a0bdd476360b8548db1

This is a globally unique identity which is based on a cryptographic key-pair that only you control.

Thus by creating an ODF dataset you get both:

  • a way to uniquely identify it on the web
  • and a way to prove ownership over it.

This will be extremely useful when we get to sharing data with others.

Fetching data

At this point our new dataset is still empty:

kamu list
┌──────────────────────────────────────────┬──────┬────────┬─────────┬──────┐
│                   Name                   │ Kind │ Pulled │ Records │ Size │
├──────────────────────────────────────────┼──────┼────────┼─────────┼──────┤
│ us.cityofnewyork.data.zipcode-boundaries │ Root │   -    │       - │    - │
└──────────────────────────────────────────┴──────┴────────┴─────────┴──────┘

But the SetPollingSource event that we specified in the snapshot describes where from and how kamu can ingest external data.

Polling sources perform following steps:

  • fetch - downloading the data from some external source (e.g. HTTP/FTP)
  • prepare (optional) - preparing raw binary data for ingestion (e.g. extracting an archive or converting between formats)
  • read - reading the data into a structured form (e.g. from CSV or Parquet)
  • preprocess (optional) - shaping the structured data with queries (e.g. to convert types into best suited form)
  • merge - merging the new data from the source with the history of previously seen data

You can find more information about data sources and ingestion stages in this section.

Note that the data file we are ingesting is in ESRI Shapefile format, which is a widespread format for geo-spatial data, so we are using a special EsriShapefile reader.

To fetch data from the source run:

kamu pull --all

At this point the source data will be downloaded, decompressed, parsed into the structured form, preprocessed and saved locally.

kamu list
┌──────────────────────────────────────────┬──────┬───────────────┬─────────┬──────────┐
│                   Name                   │ Kind │    Pulled     │ Records │   Size   │
├──────────────────────────────────────────┼──────┼───────────────┼─────────┼──────────┤
│ us.cityofnewyork.data.zipcode-boundaries │ Root │ X seconds ago │     178 │ 1.87 MiB │
└──────────────────────────────────────────┴──────┴───────────────┴─────────┴──────────┘

Note that when you pull a dataset, only the new records that kamu hasn’t previously seen will be added. In fact kamu preserves the complete history of all data - this is what enables you to have stable references to data, lets you “time travel”, and establish from where and how certain data was obtained (provenance). We will discuss this in depth in further tutorials.

For now it suffices to say that all data is tracked by kamu in a series of blocks. The Committed new block X message you’ve seen during the pull tells us that the new data block was appended. You can inspect these blocks using log command:

$ kamu log us.cityofnewyork.data.zipcode-boundaries

Exploring data

Since you might not have worked with this dataset before you’d want to explore it first.

For this kamu provides many tools (from basic to advanced):

  • tail command
  • SQL shell
  • Jupyter notebooks integration
  • Web UI

Tail command

To quickly preview few last events of any dataset use tail command:

$ kamu tail us.cityofnewyork.data.zipcode-boundaries

SQL shell

SQL is the lingua franca of the data science and kamu uses it extensively. So naturally it provides you a simple way to run ad-hoc queries on data.

kamu sql

Following command will drop you into the SQL shell:

kamu sql

By default this command uses the Apache Datafusion engine, so its powerful SQL is now available to you.

You can also select other engines, e.g. Apache Spark!

All datasets in your workspace should be available to you as tables:

show tables;

You can use describe to inspect the dataset schema:

describe "us.cityofnewyork.data.zipcode-boundaries";
The extra quotes are needed to treat the dataset name containing dots as a table name.

And of course you can run queries against any dataset:

select
  *
from "us.cityofnewyork.data.zipcode-boundaries"
order by pop_est desc
limit 5;

Use Ctrl+D to exit the SQL shell.

SQL is a widely supported language, so kamu can be used in conjunction with many other tools that support it, such as Tableau and Power BI. See integrations for details.

The kamu sql is a very powerful command that you can use both interactively or for scripting. We encourage you to explore more of its options through kamu sql --help.

Notebooks

Kamu also connects the power of Apache Spark with the Jupyter Notebook server. You can get started by running:

kamu notebook -e MAPBOX_ACCESS_TOKEN=<your mapbox token>
Above we also tell kamu to pass the MapBox access token as MAPBOX_ACCESS_TOKEN environment variable into Jupyter, which we will use for plotting. If you don’t have a MapBox token - simply run kamu notebook.

Executing this should open your default browser with a Jupyter running in it.

From here let’s create a PySpark notebook. We start our notebook by loading kamu extension:

%load_ext kamu

After this the import_dataset command becomes available and we can load the dataset and alias it by doing:

%import_dataset us.cityofnewyork.data.zipcode-boundaries --alias zipcodes
kamu notebook 001

This will take a few seconds as in the background it creates Apache Spark session, and it is Spark that loads the dataset into what it calls a “dataframe”.

You can then start using the zipcodes dataframe in the exact same way you would in an interactive spark-shell.

A few very important things to understand here:

  • Spark and Jupyter are running in separate processes
  • The commands you execute in the notebook are executed “remotely” and the results are transferred back
  • This means that it doesn’t really matter if your data is located on your machine or somewhere else - the notebook will work the same

The dataframe is automatically exposed in the SQL engine too, and you can run SQL queries using %%sql annotation. You can copy and try some queries from this existing notebook.

kamu notebook 002

Thanks to the sparkmagic library you also get some simple instant visualizations for results of your queries.

kamu notebook 003

After you are done joining, filtering, and shaping the data via SQL or PySpark - you can choose to get it out into the Jupyter notebook kernel by adding -o population_density flag. This will transfer the results of the query into the notebook as Pandas dataframe, which you can further process and visualize:

kamu notebook 004

Example of visualizing population density data as a choropleth chart using mapboxgl library:

kamu notebook 005

You can find this as well as many other notebooks in kamu-contrib repo.

Web UI

All the above and more is also available to you via embedded Web UI which you can launch by running:

kamu ui

Web UI is especially useful once you start developing complex stream processing pipelines, to explore them more visually:

Kamu Web UI

Conclusion

We hope this quick overview inspires you to give kamu a try!

Don’t get distracted by the pretty notebooks and UIs though - we covered only the tip of the iceberg. The true power of kamu lies in how it manages data, letting you to reliably track it, transform it, and share results with your peers in an easily reproducible an verifiable way.

Please continue to the self-serve demo for some hands-on walkthroughs and tutorials, and check out our other learning materials.