First Steps
For a quick overview of key functionality you can also view this tutorial:
This tutorial will give you a high-level tour of kamu
and show you how it works through examples.
We assume that you have already followed the installation steps and have kamu
tool ready.
Not ready to install just yet?
Try kamu
in this self-serve demo without needing to install anything.
Don’t forget to set up shell completions - they make kamu
a lot more fun to use!
Using the help command
When you execute kamu
or kamu -h
- the help message about all top-level commands will be displayed.
To get help on individual commands type kamu <command> -h
- this will usually contain a detailed description of what command does along with usage examples.
Note that some command also have sub-commands, e.g. kamu repo {add,list,...}
, same help pattern applies to those as well, e.g. kamu repo add -h
.
Getting data in
Throughout this tutorial we will be using the Zip Code Boundaries dataset, which can be found on New York Open Data Portal.
Initializing the workspace
To work with kamu
you first need a workspace - this is where kamu will store the important information about datasets and the cached data. Let’s create one:

$ mkdir my_workspace
$ cd my_workspace
$ kamu init
$ kamu list
As you’d expect the workspace is currently empty.
Adding a dataset
One of the design principles of kamu
is to always know exactly where any piece of data came from, so it never simply copies data - instead we create links to an external data (we’ll get into the details of that later). For now let’s create such link.

Datasets that ingest external data in kamu
are called root datasets. To create one, we will use a DatasetSnapshot
manifest from the kamu-contrib repo which looks like this:
kind: DatasetSnapshot
version: 1
content:
name: us.cityofnewyork.data.zipcode-boundaries
kind: root
metadata:
- kind: setPollingSource
fetch:
kind: url
url: https://data.cityofnewyork.us/api/views/i8iw-xf4u/files/YObIR0MbpUVA0EpQzZSq5x55FzKGM2ejSeahdvjqR20?filename=ZIP_CODE_040114.zip
read:
kind: esriShapefile
merge:
kind: snapshot
primaryKey:
- ZIPCODE
A DatasetSnapshot
manifest contains a name
of the dataset, its kind
(root
or derivative
) and a series of metadata
events that define dataset’s desired state.
In this example we only have one such event - setPollingSource
, which describes how kamu
can ingest external data by performing following operations:
fetch
- obtaining the data from some external source (e.g. HTTP/FTP)prepare
(optional) - steps for preparing data for ingestion (e.g. extracting an archive or converting between formats)read
- reading the data into a structured formpreprocess
(optional) - shaping the structured data and converting types into best suited form using query enginesmerge
- merging the new data from the source with the history of previously seen data
kamu new
command - it outputs a well-annotated template that you can customize for your needs.Note that the data file we are ingesting is in ESRI Shapefile format, which is a common format for geo-spatial data, so we are using a special esriShapefile
reader in our dataset manifest.
Let’s add it to our workspace by giving kamu
this file’s URL:
$ kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/us.cityofnewyork.data/zipcode-boundaries.yaml
At this point no data was yet loaded from the source, so let’s fetch it:
$ kamu pull --all
When you pull
, kamu
will go and check the data source for any new data that we didn’t see yet. If there was any - it will be downloaded, decompressed, parsed into the structured form, preprocessed and saved locally.
Whenever you pull
data in future only the new data that kamu
haven’t seen yet will be added to the dataset. In fact kamu
preserves the complete history of all data - this is what enables you to have stable references to data, lets you “time travel”, and establish from where and how certain data was obtained (provenance). We will discuss this in depth in further tutorials.
For now it suffices to say that all data is tracked by kamu
in a series of blocks. The Committed new block X
message you’ve seen during the pull
tells us that the new data block was appended. You can inspect those blocks using the log
command:
$ kamu log us.cityofnewyork.data.zipcode-boundaries
Exploring data
Since you might not have worked with this dataset before you’d want to explore it first. For this kamu
provides many tools:
tail
command- SQL shell
- Jupyter notebooks integration
- Web UI
Tail command
To quickly preview few last events of any dataset use tail
command:
$ kamu tail us.cityofnewyork.data.zipcode-boundaries
SQL shell
SQL is the lingua franca of the data science and kamu
uses it extensively. So naturally it provides you a simple way to run ad-hoc queries on data.

Following comand will drop you into the SQL shell:
$ kamu sql
Under the hood it starts Apache Spark, so its powerful SQL engine is now available to you.
All datasets in your workspace should be available to you as tables:
kamu> show tables;
You can use describe
to inspect the dataset’s schema:
kamu> describe `us.cityofnewyork.data.zipcode-boundaries`;
Note the extra back ticks needed to treat the dataset ID containing dots as a table name.
For brevity you can create aliases as:
kamu> create temp view zipcodes as (select * from `us.cityofnewyork.data.zipcode-boundaries`);
And of course you can run queries against any dataset:
0: kamu> select po_name, sum(population) from zipcodes group by po_name;
Use Ctrl+D
to exit the SQL shell.
SQL is a widely supported language, so kamu
can be used in conjuction with many other tools that support it, such as Tableau and Power BI. Use following command to expose kamu
data through the JDBC server:
$ kamu sql server
The kamu sql
is a very powerful command that you can use both interactively or for scripting. We encourage you to explore more of its options through kamu sql --help
.
Notebooks
Kamu also connects the power of Apache Spark with the Jupyter Notebook server. You can get started by running:
$ kamu notebook -e MAPBOX_ACCESS_TOKEN
Note: Above we do one extra thing - we tell
kamu
to pass the MapBox access token from theMAPBOX_ACCESS_TOKEN
environment variable I have on my machine into Jupyter - we’ll make use of it later. If you don’t have a MapBox token - simply runkamu notebook
.
Executing this should open your default browser with a Jupyter running in it.
From here let’s create a PySpark
notebook. We start our notebook by loading kamu
extension:
%load_ext kamu
After this the import_dataset
command becomes available and we can load the dataset and alias it by doing:
%import_dataset us.cityofnewyork.data.zipcode-boundaries --alias zipcodes

This will take a few seconds as in the background it creates Apache Spark session, and it is Spark that loads the dataset into what it calls a “dataframe”.
You can then start using the zipcodes
dataframe in the exact same way you would in an interactive spark-shell
.
There few very important things to understand here:
- Spark and Jupyter are running in separate processes
- The commands you execute in the notebook are executed “remotely” and the results are transferred back
- This means that it doesn’t really matter if your data is located on your machine or somewhere else - the notebook will work the same
The dataframe is automatically exposed in the SQL engine too, and you can run SQL queries using %%sql
annotation:

Thanks to the sparkmagic library you also get some simple instant visualizations for results of your queries.

After you are done joining, filtering, and shaping the data you can choose to get it out of the Spark into the Jupyter notebook kernel.
Let’s make this more interesting and try to visualize the population density map of New York’s zipcodes.
Following command executes an SQL query and using -o count_per_zipcode
transfers the result into the notebook as Pandas dataframe:

Note that we had to convert the geometry data into text here, as its stored in a binary format which Pandas doesn’t understand. We also had to change projections, which is very easy using Apache Sedona (formerly known as GeoSpark) spatial function. More on geometry later.
Now that we have the data in Jupyter - we can use any of your favorite tools and libraries to further process it or visualize it. With a little bit of tinkering we can use the mapboxgl library to display our density map as a choropleth chart:

You can find this as well as many other notebooks in kamu-contrib repo.
Web UI
Once your pipelines become more complex it can be really useful to explore them more visually. For this kamu
comes with embedded Web UI which you can launch by running:
$ kamu ui
It should look something like this:

Don’t get distracted by the pretty notebooks and UIs though - the true power of kamu
lies in how it manages data, letting you to reliably track it, transform it, and share results with your peers in an easily reproducible an verifiable way.
To learn more make sure to try the self-serve demo and check out our other learning materials!