kamu and show you how it works through examples.
We assume that you have already followed the installation steps and have kamu tool ready.
Not ready to install just yet?Try
kamu in this online tutorial without needing to install anything.kamu a lot more fun!
The help command
When you executekamu or kamu -h - the help message about all top-level commands will be displayed.
To get help on individual commands type:
kamu repo {add,list,...}, same help pattern applies to those as well:
Ingesting data
Throughout this tutorial we will be using the Modified Zip Code Areas dataset from New York Open Data Portal.Initializing the workspace
To work withkamu you first need a workspace. is where kamu will store important information about and cached data. Let’s create one:
.kamu folder where all sorts of and are stored. It behaves very similarly to .git directory version-controlled repositories.
As you’d expect the is currently empty:
Creating a dataset
One of the design principles ofkamu is to know exactly where any piece of data came from. So it never blindly copies data around - instead we establish ownership and links to external sources.
We’ll get into the details of that later, but for now let’s create such link.
are created from - special files that describe the desired state of the metadata upon creation.
We will use a file from kamu-contrib repo that looks like this:
example.yaml file and run:
kamu are called datasets.
Next we have:
- where data comes from
- its schema
- license
- relevant documentation
- query examples
- data quality checks
- and much more.
Dataset Identity
During the dataset creation it is assigned a very special identifier. You can see it by running:- a way to uniquely identify it on the web
- and a way to prove ownership over it.
Fetching data
At this point our new dataset is still empty:kamu can ingest external data.
Polling sources perform following steps:
fetch- downloading the data from some external source (e.g. HTTP/FTP)prepare(optional) - preparing raw binary data for ingestion (e.g. extracting an archive or converting between formats)read- reading the data into a structured form (e.g. from CSV or Parquet)preprocess(optional) - shaping the structured data with queries (e.g. to convert types into best suited form)merge- merging the new data from the source with the history of previously seen data
pull a dataset, only the new records that kamu hasn’t previously seen will be added. In fact kamu preserves the complete history of all data - this is what enables you to have stable references to data, lets you “time travel”, and establish from where and how certain data was obtained (provenance). We will discuss this in depth in further tutorials.
For now it suffices to say that all data is tracked by kamu in a series of blocks. The Committed new block X message you’ve seen during the pull tells us that the new data block was appended. You can inspect these blocks using log command:
Exploring data
Since you might not have worked with this dataset before you’d want to explore it first. For thiskamu provides many tools (from basic to advanced):
tailcommand- SQL shell
- Jupyter notebooks integration
- Web UI
Tail command
To quickly preview few last of any dataset usetail command:
SQL shell
SQL is the lingua franca of the data science andkamu uses it extensively. So naturally it provides you a simple way to run ad-hoc queries on data.
Following command will drop you into the SQL shell:
describe to inspect the dataset schema:
The extra quotes are needed to treat the dataset name containing dots as a table name.
Ctrl+D to exit the SQL shell.
SQL is a widely supported language, so kamu can be used in conjunction with many other tools that support it, such as Tableau and Power BI. See integrations for details.
The kamu sql is a very powerful command that you can use both interactively or for scripting. We encourage you to explore more of its options through kamu sql --help.
Notebooks
Kamu lets you access the power of multiple data engines through a convenient interface of Jupyter Notebooks. Get started by running:Above we also tell
kamu to pass the MapBox access token as MAPBOX_ACCESS_TOKEN environment variable into Jupyter, which we will use for plotting. You can get the token for free or skip this step and simply run kamu notebook.kamu extension:
kamu Python library will automatically connect to your local workspace node.
After this the import_dataset command becomes available and we can load the dataset and alias it by doing:
You can now query the datasets using the connection:
%%sql cell magic:
%%sql cell to a variable use:
Web UI
All the above and more is also available to you via embedded Web UI which you can launch by running:Conclusion
We hope this quick overview inspires you to givekamu a try!
Don’t get distracted by the pretty notebooks and UIs though - we covered only the tip of the iceberg. The true power of kamu lies in how it manages data, letting you to reliably track it, transform it, and share results with your peers in an easily way.
Please continue to the online tutorial for some hands-on walkthroughs and tutorials, and check out our other learning materials.