Exploring Data
While kamu
focuses primarily on the problem of data management, you often may want to do some basic data exploration before exporting data for further use in your data science projects, so we decided to provide a few simple exploration tools for you to assess the state of data without leaving the comfort of one tool.
Tail Command
To quickly view a sample of last events in a dataset:
$ kamu tail ca.bccdc.covid19.case-details
Lineage Command
To display the lineage of a certain dataset in a browser:
$ kamu inspect lineage ca.covid19.daily-cases -b

SQL Console
kamu
provides a simple way to run ad-hoc queries and explore data using SQL language.

Following comand will drop you into the SQL shell:
$ kamu sql
SQL console by default uses the Apache Spark engine.
All datasets in your workspace should be available to you as tables:
kamu> show tables;
You can use describe
to inspect the dataset’s schema:
kamu> describe `us.cityofnewyork.data.zipcode-boundaries`;
For brevity you can create aliases:
kamu> create temp view zipcodes as (select * from `us.cityofnewyork.data.zipcode-boundaries`);
And of course you can run queries against any dataset:
0: kamu> select po_name, sum(population) from zipcodes group by po_name;
Use Ctrl+D
to exit the SQL shell.
SQL is a widely supported language, so kamu
can be used in conjuction with many other tools that support it, such as Tableau and Power BI. Use following command to expose kamu
data through the JDBC server:
$ kamu sql server
The kamu sql
is a very powerful command that you can use both interactively or for scripting. We encourage you to explore more of its options through kamu sql --help
.
Jupyter Notebooks
kamu
also connects the power of Apache Spark with the Jupyter Notebook server. You can get started by running:
$ kamu notebook
-e ENV_VAR
option to pass additional environment variable into the notebook server. This can be very useful for different access and security tokens needed by different visualization APIs.Executing this should open your default browser with a Jupyter running in it.
From here create a PySpark
notebook. We start all notebooks by loading kamu
extension:
%load_ext kamu
After this the import_dataset
command becomes available and we can load the dataset and alias it by doing:
%import_dataset us.cityofnewyork.data.zipcode-boundaries --alias zipcodes

This will take a few seconds as in the background it creates Apache Spark session, and it is Spark that loads the dataset into what it calls a “dataframe”.
You can then start using the zipcodes
dataframe in the exact same way you would in an interactive spark-shell
.
There few very important things to understand here:
- Spark and Jupyter are running in separate processes
- The commands you execute in the notebook are executed “remotely” and the results are transferred back
- This means that it doesn’t really matter if your data is located on your machine or somewhere else - the notebook will work the same
The dataframe is automatically exposed in the SQL engine too, and you can run SQL queries using %%sql
annotation:

Thanks to the sparkmagic library you also get some simple instant visualizations for results of your queries.

After you are done joining, filtering, and shaping the data you can choose to get it out of the Spark into the Jupyter notebook kernel using %%sql -o alias
command

Now that you have the data in Jupyter - you can use any of your favorite tools and libraries to further process it or visualize it.
Web UI
Once your pipelines become more complex it can be really useful to explore them more visually. For this kamu
comes with embedded Web UI which you can launch by running:
$ kamu ui
It should look something like this:
