> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Quick Start

> A quick rundown of the key features to get a feel for the tool

export const Diagram = ({src, alt}) => {
  return <div style={{
    display: "flex",
    "flex-direction": "column",
    "align-items": "center"
  }}>
    <img src={src} alt={alt} style={{
    background: "#dddddd",
    "margin-bottom": 0
  }} />
    <span>{alt}</span>
  </div>;
};

export const YouTubeList = ({id}) => {
  const src = `https://www.youtube.com/embed/videoseries?list=${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>;
};

export const YouTube = ({id, width}) => {
  const src = `https://www.youtube.com/embed/${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen width={width}></iframe>;
};

export const Schema = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/odf/schemas#${anchor}`;
  return <a class="schema-object" href={link}>{t}</a>;
};

export const Term = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/general/glossary#${anchor}`;
  return <a class="glossary-term" href={link}>{t}</a>;
};

For a quick overview of key functionality you can also view this tutorial:

<YouTube id="oUTiWW6W78A" />

This tutorial will give you a high-level tour of `kamu` and show you how it works through examples.

We assume that you have already followed the [installation steps](/cli/install) and have `kamu` tool ready.

<Info>
  **Not ready to install just yet?**

  Try `kamu` in this [online tutorial](/start/tutorial) without needing to install anything.
</Info>

Don't forget to set up **shell completions** - they make using `kamu` a lot more fun!

## The help command

When you execute `kamu` or `kamu -h` - the help message about all top-level commands will be displayed.

To get help on individual commands type:

```bash theme={null}
kamu <command> -h
```

This will usually contain a detailed description of what command does along with usage examples.

Note that some command also have sub-commands, e.g. `kamu repo {add,list,...}`, same help pattern applies to those as well:

```bash theme={null}
kamu repo add -h
```

Command help is also available online on [CLI Commands Reference](/cli/commands) page.

## Ingesting data

Throughout this tutorial we will be using the [Modified Zip Code Areas](https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk) dataset from New York Open Data Portal.

### Initializing the workspace

To work with `kamu` you first need a **workspace**. <Term t="Workspace" /> is where kamu will store important information about <Term t="datasets" id="dataset" /> and cached data. Let's create one:

<Diagram src="/images/cli/first-steps/init.gif" alt="kamu init" />

```bash theme={null}
mkdir my_workspace
cd my_workspace
kamu init
```

A <Term t="workspace" /> is just a directory with `.kamu` folder where all sorts of <Term t="data" /> and <Term t="metadata" /> are stored. It behaves very similarly to `.git` directory version-controlled repositories.

As you'd expect the <Term t="workspace" /> is currently empty:

```bash theme={null}
kamu list
```

### Creating a dataset

One of the design principles of `kamu` is to know exactly where any piece of data came from. So it never blindly copies data around - instead we establish ownership and links to external sources.

We'll get into the details of that later, but for now let's create such link.

<Diagram src="/images/cli/first-steps/pull.gif" alt="kamu pull" />

<Term t="Datasets" id="dataset" /> are created from <Term t="dataset snapshots" id="dataset-snapshot" /> - special files that describe the **desired state** of the metadata upon creation.

We will use a <Schema t="DatasetSnapshot" /> file from [kamu-contrib repo](https://github.com/kamu-data/kamu-contrib/blob/master/us.cityofnewyork.data/zipcode-boundaries.yaml) that looks like this:

```yaml theme={null}
kind: DatasetSnapshot
version: 1
content:
  name: us.cityofnewyork.data.zipcode-boundaries
  kind: Root
  metadata:
    - kind: SetPollingSource
      fetch:
        kind: Url
        # Dataset home: https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk
        url: https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?date=20240115&accessType=DOWNLOAD&method=export&format=Shapefile
      read:
        kind: EsriShapefile
      merge:
        kind: Snapshot
        primaryKey:
          # Modified ZIP Code Tabulation Area (ZCTA)
          # See for explanation: https://nychealth.github.io/covid-maps/modzcta-geo/about.html
          - modzcta
```

You can either copy the above into a `example.yaml` file and run:

```bash theme={null}
kamu add example.yaml
```

Or add it directly from URL like so:

```bash theme={null}
kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/us.cityofnewyork.data/zipcode-boundaries.yaml
```

Such YAML files are called <Term t="manifests" id="manifest" />. First two lines specify that the file contains <Schema t="DatasetSnapshot" /> object and then specify the version of the schema, for upgradeability:

```yaml theme={null}
kind: DatasetSnapshot
version: 1
content: ...
```

Next we give dataset a name and declare its kind:

```yaml theme={null}
name: us.cityofnewyork.data.zipcode-boundaries
kind: Root
```

<Term t="Datasets" id="dataset" /> that ingest external data in `kamu` are called <Term t="Root" id="root-dataset" /> datasets.

Next we have:

```yaml theme={null}
metadata:
  - kind: ...
    ...
  - kind: ...
    ...
```

This section contains one or many <Term t="metadata events" id="metadata-chain" /> that can describe different aspects of a dataset, like:

* where data comes from
* its schema
* license
* relevant documentation
* query examples
* data quality checks
* and much more.

<Tip>
  To create your own snapshot file use `kamu new` command - it outputs a well-annotated template that you can customize for your needs.
</Tip>

### Dataset Identity

During the dataset creation it is assigned a very special identifier.

You can see it by running:

```sh theme={null}
kamu list -w
```

```
┌───────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────┬─────┐
│                                      ID                                       │                   Name                   │ ... │
├───────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┼─────┤
│ did:odf:fed01057c94fb0378e3222704bb70a261d3ebeaa0d1b38c056a0bdd476360b8548db1 │ us.cityofnewyork.data.zipcode-boundaries │ ... │
└───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┴─────┘
```

Or:

```sh theme={null}
kamu log us.cityofnewyork.data.zipcode-boundaries
```

```
Block #0: ...
SystemTime: ...
Kind: Seed
DatasetKind: Root
DatasetID: did:odf:fed01057c94fb0378e3222704bb70a261d3ebeaa0d1b38c056a0bdd476360b8548db1
```

This is a [globally unique identity](/odf/spec#dataset-identity) which is based on a cryptographic key-pair that only you control.

Thus by creating an ODF dataset you get both:

* a way to **uniquely identify** it on the web
* and a way to **prove ownership** over it.

This will be extremely useful when we get to [sharing data with others](/cli/collab).

### Fetching data

At this point our new dataset is still empty:

```bash theme={null}
kamu list
```

```
┌──────────────────────────────────────────┬──────┬────────┬─────────┬──────┐
│                   Name                   │ Kind │ Pulled │ Records │ Size │
├──────────────────────────────────────────┼──────┼────────┼─────────┼──────┤
│ us.cityofnewyork.data.zipcode-boundaries │ Root │   -    │       - │    - │
└──────────────────────────────────────────┴──────┴────────┴─────────┴──────┘
```

But the <Schema t="SetPollingSource" /> event that we specified in the snapshot describes where from and how `kamu` can ingest external data.

Polling sources perform following steps:

* `fetch` - downloading the data from some external source (e.g. HTTP/FTP)
* `prepare` (optional) - preparing raw binary data for ingestion (e.g. extracting an archive or converting between formats)
* `read` - reading the data into a structured form (e.g. from CSV or Parquet)
* `preprocess` (optional) - shaping the structured data with queries (e.g. to convert types into best suited form)
* `merge` - merging the new data from the source with the **history of previously seen data**

You can find more information about data sources and ingestion stages in [this section](/cli/ingest).

Note that the data file we are ingesting is in [ESRI Shapefile](https://en.wikipedia.org/wiki/Shapefile) format, which is a widespread format for geo-spatial data, so we are using a special <Schema t="EsriShapefile" id="ReadStep::EsriShapefile" /> reader.

To fetch data from the source run:

```bash theme={null}
kamu pull --all
```

At this point the source data will be downloaded, decompressed, parsed into the structured form, preprocessed and saved locally.

```bash theme={null}
kamu list
```

```
┌──────────────────────────────────────────┬──────┬───────────────┬─────────┬──────────┐
│                   Name                   │ Kind │    Pulled     │ Records │   Size   │
├──────────────────────────────────────────┼──────┼───────────────┼─────────┼──────────┤
│ us.cityofnewyork.data.zipcode-boundaries │ Root │ X seconds ago │     178 │ 1.87 MiB │
└──────────────────────────────────────────┴──────┴───────────────┴─────────┴──────────┘
```

Note that when you `pull` a dataset, only the new records that `kamu` hasn't previously seen will be added. In fact `kamu` preserves the complete history of all data - this is what enables you to have stable references to data, lets you "time travel", and establish from where and how certain data was obtained (provenance). We will discuss this in depth in further tutorials.

For now it suffices to say that all data is tracked by `kamu` in a series of blocks. The `Committed new block X` message you've seen during the `pull` tells us that the new data block was appended. You can inspect these blocks using `log` command:

```bash theme={null}
$ kamu log us.cityofnewyork.data.zipcode-boundaries
```

## Exploring data

Since you might not have worked with this dataset before you'd want to explore it first.

For this `kamu` provides many tools (from basic to advanced):

* `tail` command
* SQL shell
* Jupyter notebooks integration
* Web UI

### Tail command

To quickly preview few last <Term t="events" id="event" /> of any dataset use `tail` command:

```bash theme={null}
$ kamu tail us.cityofnewyork.data.zipcode-boundaries
```

### SQL shell

SQL is the *lingua franca* of the data science and `kamu` uses it extensively. So naturally it provides you a simple way to run ad-hoc queries on data.

<Diagram src="/images/cli/first-steps/sql.gif" alt="kamu sql" />

Following command will drop you into the SQL shell:

```bash theme={null}
kamu sql
```

By default this command uses the [Apache Datafusion](https://arrow.apache.org/datafusion/) <Term t="engine" />, so its [powerful SQL](https://arrow.apache.org/datafusion/user-guide/sql/index.html) is now available to you.

<Tip>
  You can also select other engines, e.g. [Apache Spark](https://spark.apache.org/)!
</Tip>

All datasets in your <Term t="workspace" /> should be available to you as tables:

```sql theme={null}
show tables;
```

You can use `describe` to inspect the dataset schema:

```sql theme={null}
describe "us.cityofnewyork.data.zipcode-boundaries";
```

<Note>
  The extra quotes are needed to treat the dataset name containing dots as a table name.
</Note>

And of course you can run queries against any dataset:

```sql theme={null}
select
  *
from "us.cityofnewyork.data.zipcode-boundaries"
order by pop_est desc
limit 5;
```

Use `Ctrl+D` to exit the SQL shell.

SQL is a widely supported language, so `kamu` can be used in conjunction with many other tools that support it, such as Tableau and Power BI. See [integrations](/integrations) for details.

The `kamu sql` is a very powerful command that you can use both interactively or for scripting. We encourage you to explore more of its options through `kamu sql --help`.

### Notebooks

Kamu lets you access the power of multiple data engines through a convenient interface of [Jupyter Notebooks](https://jupyter.org/).

Get started by running:

```bash theme={null}
kamu notebook -e MAPBOX_ACCESS_TOKEN=<your mapbox token>
```

<Note>
  Above we also tell `kamu` to pass the [MapBox](https://www.mapbox.com/) access token as `MAPBOX_ACCESS_TOKEN` environment variable into Jupyter, which we will use for plotting. You can get the token for free or skip this step and simply run `kamu notebook`.
</Note>

Executing this should open your default browser with a Jupyter running in it.

From here let's create a notebook and start it by loading `kamu` extension:

```
%load_ext kamu
```

We then need to create a connection:

```python theme={null}
import kamu
con = kamu.connect()
```

The `kamu` Python library will automatically connect to your local workspace node.

After this the `import_dataset` command becomes available and we can load the dataset and alias it by doing:

You can now query the datasets using the connection:

```python theme={null}
con.query("select * from 'us.cityofnewyork.data.zipcode-boundaries' limit 3")
```

Or, more conveniently, using the `%%sql` cell magic:

```sql theme={null}
%%sql
select * from 'us.cityofnewyork.data.zipcode-boundaries' limit 3
```

The queries return regular [Pandas](https://pandas.pydata.org/) dataframe.

<Diagram src="/images/cli/first-steps/notebook-002.png" alt="kamu notebook 002" />

Thanks to the [autovizwidget](https://github.com/jupyter-incubator/sparkmagic) library you also get some simple instant visualizations for results of your queries.

<Diagram src="/images/cli/first-steps/notebook-003.png" alt="kamu notebook 003" />

To assign the result of `%%sql` cell to a variable use:

```sql theme={null}
%%sql -o population_density
select ...
```

<Diagram src="/images/cli/first-steps/notebook-004.png" alt="kamu notebook 004" />

You can use any of your favorite libraries to further process and visualize it:

Example of visualizing population density data as a *choropleth* chart using [mapboxgl](https://github.com/mapbox/mapboxgl-jupyter) library:

<Diagram src="/images/cli/first-steps/notebook-005.png" alt="kamu notebook 005" />

You can find this as well as many other notebooks in [kamu-contrib](https://github.com/kamu-data/kamu-contrib) repo.

### Web UI

All the above and more is also available to you via embedded Web UI which you can launch by running:

```bash theme={null}
kamu ui
```

Web UI is especially useful once you start developing complex stream processing pipelines, to explore them more visually:

<Diagram src="/images/cli/first-steps/kamu-ui.png" alt="Kamu Web UI" />

## Conclusion

We hope this quick overview inspires you to give `kamu` a try!

Don't get distracted by the pretty notebooks and UIs though - we covered only the tip of the iceberg. The true power of `kamu` lies in how it manages data, letting you to reliably track it, transform it, and share results with your peers in an easily **<Term t="reproducible an verifiable" id="verifiability" />** way.

Please continue to the [online tutorial](/start/tutorial) for some hands-on walkthroughs and tutorials, and check out our other [learning materials](/start/learning-materials).
