kamu

Usage: kamu [OPTIONS] <COMMAND>

Subcommands:

  • add — Add a new dataset or modify an existing one
  • completions — Generate tab-completion scripts for your shell
  • config — Get or set configuration options
  • delete [rm] — Delete a dataset
  • ingest — Adds data to the root dataset according to its push source configuration
  • init — Initialize an empty workspace in the current directory
  • inspect — Group of commands for exploring dataset metadata
  • list [ls] — List all datasets in the workspace
  • log — Shows dataset metadata history
  • login — Authenticates with a remote ODF server interactively
  • logout — Logs out from a remote Kamu server
  • new — Creates a new dataset manifest from a template
  • notebook — Starts the notebook server for exploring the data in the workspace
  • pull — Pull new data into the datasets
  • push — Push local data into a repository
  • rename [mv] — Rename a dataset
  • reset — Revert the dataset back to the specified state
  • repo — Manage set of tracked repositories
  • search — Searches for datasets in the registered repositories
  • sql — Executes an SQL query or drops you into an SQL shell
  • system — Command group for system-level functionality
  • tail — Displays a sample of most recent records in a dataset
  • ui — Opens web interface
  • verify — Verifies the validity of a dataset
  • version — Outputs build information

Options:

  • -v — Sets the level of verbosity (repeat for more)
  • --no-color — Disable color output in the terminal
  • -q, --quiet — Suppress all non-essential output
  • -y, --yes — Do not ask for confirmation and assume the ‘yes’ answer
  • --trace — Record and visualize the command execution as perfetto.dev trace
  • --metrics — Dump all metrics at the end of command execution

To get help for individual commands use: kamu -h kamu -h

kamu add

Add a new dataset or modify an existing one

Usage: kamu add [OPTIONS] [MANIFEST]...

Arguments:

  • <MANIFEST> — Dataset manifest reference(s) (path, or URL)

Options:

  • -r, --recursive — Recursively search for all manifest in the specified directory
  • --replace — Delete and re-add datasets that already exist
  • --stdin — Read manifests from standard input
  • --name <N> — Overrides the name in a loaded manifest

This command creates a new dataset from the provided DatasetSnapshot manifest.

Note that after kamu creates a dataset the changes in the source file will not have any effect unless you run the add command again. When you are experimenting with adding new dataset you currently may need to delete and re-add it multiple times until you get your parameters and schema right.

In future versions the add command will allow you to modify the structure of already existing datasets (e.g. changing schema in a compatible way).

Examples:

Add a root/derivative dataset from local manifest:

kamu add org.example.data.yaml

Add datasets from all manifests found in the current directory:

kamu add --recursive .

Add a dataset from manifest hosted externally (e.g. on GihHub):

kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/ca.bankofcanada/ca.bankofcanada.exchange-rates.daily.yaml

To add dataset from a repository see kamu pull command.

kamu completions

Generate tab-completion scripts for your shell

Usage: kamu completions <SHELL>

Arguments:

  • <SHELL>

    Possible values: bash, elvish, fish, powershell, zsh

The command outputs to STDOUT, allowing you to re-direct the output to the file of your choosing. Where you place the file will depend on which shell and which operating system you are using. Your particular configuration may also determine where these scripts need to be placed.

Here are some common set ups:

Bash:

Append the following to your ~/.bashrc:

source <(kamu completions bash)

You will need to reload your shell session (or execute the same command in your current one) for changes to take effect.

Zsh:

Append the following to your ~/.zshrc:

autoload -U +X bashcompinit && bashcompinit
source <(kamu completions bash)

Please contribute a guide for your favorite shell!

kamu config

Get or set configuration options

Usage: kamu config <COMMAND>

Subcommands:

  • list [ls] — Display current configuration combined from all config files
  • get — Get current configuration value
  • set — Set or unset configuration value

Configuration in kamu is managed very similarly to git. Starting with your current workspace and going up the directory tree you can have multiple .kamuconfig YAML files which are all merged together to get the resulting config.

Most commonly you will have a workspace-scoped config inside the .kamu directory and the user-scoped config residing in your home directory.

Examples:

List current configuration as combined view of config files:

kamu config list

Get current configuration value:

kamu config get engine.runtime

Set configuration value in workspace scope:

kamu config set engine.runtime podman

Set configuration value in user scope:

kamu config set --user engine.runtime podman

Unset or revert to default value:

kamu config set --user engine.runtime

kamu config list

Display current configuration combined from all config files

Usage: kamu config list [OPTIONS]

Options:

  • --user — Show only user scope configuration
  • --with-defaults — Show configuration with all default values applied

kamu config get

Get current configuration value

Usage: kamu config get [OPTIONS] <CFGKEY>

Arguments:

  • <CFGKEY> — Path to the config option

Options:

  • --user — Operate on the user scope configuration file
  • --with-defaults — Get default value if config option is not explicitly set

kamu config set

Set or unset configuration value

Usage: kamu config set [OPTIONS] <CFGKEY> [VALUE]

Arguments:

  • <CFGKEY> — Path to the config option
  • <VALUE> — New value to set

Options:

  • --user — Operate on the user scope configuration file

kamu delete

Delete a dataset

Usage: kamu delete [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -a, --all — Delete all datasets in the workspace
  • -r, --recursive — Also delete all transitive dependencies of specified datasets

This command deletes the dataset from your workspace, including both metadata and the raw data.

Take great care when deleting root datasets. If you have not pushed your local changes to a repository - the data will be lost.

Deleting a derivative dataset is usually not a big deal, since they can always be reconstructed, but it will disrupt downstream consumers.

Examples:

Delete a local dataset:

kamu delete my.dataset

Delete local datasets matching pattern:

kamu delete my.dataset.%

kamu ingest

Adds data to the root dataset according to its push source configuration

Usage: kamu ingest [OPTIONS] <DATASET> [FILE]...

Arguments:

  • <DATASET> — Local dataset reference
  • <FILE> — Data file(s) to ingest

Options:

  • --source-name <SRC> — Name of the push source to use for ingestion

  • --event-time <T> — Event time to be used if data does not contain one

  • --stdin — Read data from the standard input

  • -r, --recursive — Recursively propagate the updates into all downstream datasets

  • --input-format <FMT> — Overrides the media type of the data expected by the push source

    Possible values: csv, json, ndjson, geojson, ndgeojson, parquet, esrishapefile

Examples:

Ingest data from files:

kamu ingest org.example.data path/to/data.csv

Ingest data from standard input (assumes source is defined to use NDJSON):

echo '{"key": "value1"}\n{"key": "value2"}' | kamu ingest org.example.data --stdin

Ingest data with format conversion:

echo '[{"key": "value1"}, {"key": "value2"}]' | kamu ingest org.example.data --stdin --input-format json

Ingest data with event time hint:

kamu ingest org.example.data data.json --event-time 2050-01-02T12:00:00Z

kamu init

Initialize an empty workspace in the current directory

Usage: kamu init [OPTIONS]

Options:

  • --exists-ok — Don’t return an error if workspace already exists
  • --pull-images — Only pull container images and exit

A workspace is where kamu stores all the important information about datasets (metadata) and in some cases raw data.

It is recommended to create one kamu workspace per data science project, grouping all related datasets together.

Initializing a workspace creates a .kamu directory contains dataset metadata, data, and all supporting files (configs, known repositories etc.).

kamu inspect

Group of commands for exploring dataset metadata

Usage: kamu inspect <COMMAND>

Subcommands:

  • lineage — Shows the dependency tree of a dataset
  • query — Shows the transformations used by a derivative dataset
  • schema — Shows the dataset schema

kamu inspect lineage

Shows the dependency tree of a dataset

Usage: kamu inspect lineage [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -o, --output-format <FMT> — Format of the output

    Possible values: shell, dot, csv, html

  • -b, --browse — Produce HTML and open it in a browser

Presents the dataset-level lineage that includes current and past dependencies.

Examples:

Show lineage of a single dataset:

kamu inspect lineage my.dataset

Show lineage graph of all datasets in a browser:

kamu inspect lineage --browse

Render the lineage graph into a png image (needs graphviz installed):

kamu inspect lineage -o dot | dot -Tpng > depgraph.png

kamu inspect query

Shows the transformations used by a derivative dataset

Usage: kamu inspect query <DATASET>

Arguments:

  • <DATASET> — Local dataset reference

This command allows you to audit the transformations performed by a derivative dataset and their evolution. Such audit is an important step in validating the trustworthiness of data (see kamu verify command).

kamu inspect schema

Shows the dataset schema

Usage: kamu inspect schema [OPTIONS] <DATASET>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format of the output

    Possible values: ddl, parquet, parquet-json, arrow-json

Displays the schema of the dataset. Note that dataset schemas can evolve over time and by default the latest schema will be shown.

Examples:

Show logical schema of a dataset in the DDL format:

kamu inspect schema my.dataset

Show physical schema of the underlying Parquet files:

kamu inspect schema my.dataset -o parquet

kamu list

List all datasets in the workspace

Usage: kamu list [OPTIONS]

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table
  • -w, --wide — Show more details (repeat for more)

Examples:

To see a human-friendly list of datasets in your workspace:

kamu list

To see more details:

kamu list -w

To get a machine-readable list of datasets:

kamu list -o csv

kamu log

Shows dataset metadata history

Usage: kamu log [OPTIONS] <DATASET>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format of the output

    Possible values: shell, yaml

  • -f, --filter <FLT> — Types of events to include

  • --limit <LIMIT> — Maximum number of blocks to display

    Default value: 500

Metadata of a dataset contains historical record of everything that ever influenced how data currently looks like.

This includes events such as:

  • Data ingestion / transformation
  • Change of query
  • Change of schema
  • Change of source URL or other ingestion steps in a root dataset

Use this command to explore how dataset evolved over time.

Examples:

Show brief summaries of individual metadata blocks:

kamu log org.example.data

Show detailed content of all blocks:

kamu log -o yaml org.example.data

Using a filter to inspect blocks containing query changes of a derivative dataset:

kamu log -o yaml --filter source org.example.data

kamu login

Authenticates with a remote ODF server interactively

Usage: kamu login [OPTIONS] [SERVER] [COMMAND]

Subcommands:

  • oauth — Performs non-interactive login to a remote Kamu server via OAuth provider token
  • password — Performs non-interactive login to a remote Kamu server via login and password

Arguments:

  • <SERVER> — ODF server URL (defaults to kamu.dev)

Options:

  • --user — Store access token in the user home folder rather than in the workspace
  • --check — Check whether existing authorization is still valid without triggering a login flow
  • --access-token <ACCESS_TOKEN> — Provide an existing access token
  • --repo-name <REPO_NAME> — Repository name which will be used to store in repositories list
  • --skip-add-repo — Don’t automatically add a remote repository for this host

kamu login oauth

Performs non-interactive login to a remote Kamu server via OAuth provider token

Usage: kamu login oauth [OPTIONS] <PROVIDER> <ACCESS_TOKEN> [SERVER]

Arguments:

  • <PROVIDER> — Name of the OAuth provider, i.e. ‘github’
  • <ACCESS_TOKEN> — OAuth provider access token
  • <SERVER> — ODF backend server URL (defaults to kamu.dev)

Options:

  • --user — Store access token in the user home folder rather than in the workspace

kamu login password

Performs non-interactive login to a remote Kamu server via login and password

Usage: kamu login password [OPTIONS] <LOGIN> <PASSWORD> [SERVER]

Arguments:

  • <LOGIN> — Specify user name
  • <PASSWORD> — Specify password
  • <SERVER> — ODF backend server URL (defaults to kamu.dev)

Options:

  • --user — Store access token in the user home folder rather than in the workspace

kamu logout

Logs out from a remote Kamu server

Usage: kamu logout [OPTIONS] [SERVER]

Arguments:

  • <SERVER> — ODF server URL (defaults to kamu.dev)

Options:

  • --user — Drop access token stored in the user home folder rather than in the workspace
  • -a, --all — Log out of all servers

kamu new

Creates a new dataset manifest from a template

Usage: kamu new [OPTIONS] <NAME>

Arguments:

  • <NAME> — Name of the new dataset

Options:

  • --root — Create a root dataset
  • --derivative — Create a derivative dataset

This command will create a dataset manifest from a template allowing you to customize the most relevant parts without having to remember the exact structure of the yaml file.

Examples:

Create org.example.data.yaml file from template in the current directory:

kamu new org.example.data --root

kamu notebook

Starts the notebook server for exploring the data in the workspace

Usage: kamu notebook [OPTIONS]

Options:

  • --address <ADDRESS> — Expose HTTP server on specific network interface
  • --http-port <HTTP_PORT> — Expose HTTP server on specific port
  • -e, --env <VAR> — Propagate or set an environment variable in the notebook (e.g. -e VAR or -e VAR=foo)

This command will run the Jupyter server and the Spark engine connected together, letting you query data with SQL before pulling it into the notebook for final processing and visualization.

For more information check out notebook examples at https://github.com/kamu-data/kamu-cli

kamu pull

Pull new data into the datasets

Usage: kamu pull [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local or remote dataset reference(s)

Options:

  • -a, --all — Pull all datasets in the workspace
  • -r, --recursive — Also pull all transitive dependencies of specified datasets
  • --fetch-uncacheable — Pull latest data from uncacheable data sources
  • --as <NAME> — Local name of a dataset to use when syncing from a repository
  • --no-alias — Don’t automatically add a remote push alias for this destination
  • --set-watermark <TIME> — Injects a manual watermark into the dataset to signify that no data is expected to arrive with event time that precedes it
  • -f, --force — Overwrite local version with remote, even if revisions have diverged
  • --reset-derivatives-on-diverged-input — Run hard compaction of derivative dataset if transformation failed due to root dataset compaction

Pull is a multi-functional command that lets you update a local dataset. Depending on the parameters and the types of datasets involved it can be used to:

  • Run polling ingest to pull data into a root dataset from an external source
  • Run transformations on a derivative dataset to process previously unseen data
  • Pull dataset from a remote repository into your workspace
  • Update watermark on a dataset

Examples:

Fetch latest data in a specific dataset:

kamu pull org.example.data

Fetch latest data in datasets matching pattern:

kamu pull org.example.%

Fetch latest data for the entire dependency tree of a dataset:

kamu pull --recursive org.example.derivative

Refresh data of all datasets in the workspace:

kamu pull --all

Fetch dataset from a registered repository:

kamu pull kamu/org.example.data

Fetch dataset from a URL (see kamu repo add -h for supported sources):

kamu pull ipfs://bafy...a0dx/data
kamu pull s3://my-bucket.example.org/odf/org.example.data
kamu pull s3+https://example.org:5000/data --as org.example.data

Advance the watermark of a dataset:

kamu pull --set-watermark 2020-01-01 org.example.data

kamu push

Push local data into a repository

Usage: kamu push [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local or remote dataset reference(s)

Options:

  • -a, --all — Push all datasets in the workspace

  • -r, --recursive — Also push all transitive dependencies of specified datasets

  • --no-alias — Don’t automatically add a remote push alias for this destination

  • --to <REM> — Remote alias or a URL to push to

  • -f, --force — Overwrite remote version with local, even if revisions have diverged

  • --visibility <VIS> — Changing the visibility of the initially pushed dataset(s)

    Default value: private

    Possible values: private, public

Use this command to share your new dataset or new data with others. All changes performed by this command are atomic and non-destructive. This command will analyze the state of the dataset at the repository and will only upload data and metadata that wasn’t previously seen.

Similarly to git, if someone else modified the dataset concurrently with you - your push will be rejected, and you will have to resolve the conflict.

Examples:

Sync dataset to a destination URL (see kamu repo add -h for supported protocols):

kamu push org.example.data --to s3://my-bucket.example.org/odf/org.example.data

Sync dataset to a named repository (see kamu repo command group):

kamu push org.example.data --to kamu-hub/org.example.data

Sync dataset that already has a push alias:

kamu push org.example.data

Sync datasets matching pattern that already have push aliases:

kamu push org.example.%

Add dataset to local IPFS node and update IPNS entry to the new CID:

kamu push org.example.data --to ipns://k5..zy

kamu rename

Rename a dataset

Usage: kamu rename <DATASET> <NAME>

Arguments:

  • <DATASET> — Dataset reference
  • <NAME> — The new name to give it

Use this command to rename a dataset in your local workspace. Renaming is safe in terms of downstream derivative datasets as they use stable dataset IDs to define their inputs.

Examples:

Renaming is often useful when you pull a remote dataset by URL, and it gets auto-assigned not the most convenient name:

kamu pull ipfs://bafy...a0da
kamu rename bafy...a0da my.dataset

kamu reset

Revert the dataset back to the specified state

Usage: kamu reset <DATASET> <HASH>

Arguments:

  • <DATASET> — Dataset reference
  • <HASH> — Hash of the block to reset to

Resetting a dataset to the specified block erases all metadata blocks that followed it and deletes all data added since that point. This can sometimes be useful to resolve conflicts, but otherwise should be used with care.

Keep in mind that blocks that were pushed to a repository could’ve been already observed by other people, so resetting the history will not let you take that data back and instead create conflicts for the downstream consumers of your data.

kamu repo

Manage set of tracked repositories

Usage: kamu repo <COMMAND>

Subcommands:

  • add — Adds a repository
  • delete [rm] — Deletes a reference to repository
  • list [ls] — Lists known repositories
  • alias — Manage set of remote aliases associated with datasets

Repositories are nodes on the network that let users exchange datasets. In the most basic form, a repository can simply be a location where the dataset files are hosted over one of the supported file or object-based data transfer protocols. The owner of a dataset will have push privileges to this location, while other participants can pull data from it.

Examples:

Show available repositories:

kamu repo list

Add S3 bucket as a repository:

kamu repo add example-repo s3://bucket.my-company.example/

kamu repo add

Adds a repository

Usage: kamu repo add <NAME> <URL>

Arguments:

  • <NAME> — Local alias of the repository
  • <URL> — URL of the repository

For local file system repositories use the following URL formats:

file:///home/me/example/repository/
file:///c:/Users/me/example/repository/

For S3-compatible basic repositories use:

s3://bucket.my-company.example/
s3+http://my-minio-server:9000/bucket/
s3+https://my-minio-server:9000/bucket/

For ODF-compatible smart repositories use:

odf+http://odf-server/
odf+https://odf-server/

kamu repo delete

Deletes a reference to repository

Usage: kamu repo delete [OPTIONS] [REPOSITORY]...

Arguments:

  • <REPOSITORY> — Repository name(s)

Options:

  • -a, --all — Delete all known repositories

kamu repo list

Lists known repositories

Usage: kamu repo list [OPTIONS]

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table

kamu repo alias

Manage set of remote aliases associated with datasets

Usage: kamu repo alias <COMMAND>

Subcommands:

  • add — Adds a remote alias to a dataset
  • delete [rm] — Deletes a remote alias associated with a dataset
  • list [ls] — Lists remote aliases

When you pull and push datasets from repositories kamu uses aliases to let you avoid specifying the full remote reference each time. Aliases are usually created the first time you do a push or pull and saved for later. If you have an unusual setup (e.g. pushing to multiple repositories) you can use this command to manage the aliases.

Examples:

List all aliases:

kamu repo alias list

List all aliases of a specific dataset:

kamu repo alias list org.example.data

Add a new pull alias:

kamu repo alias add --pull org.example.data kamu.dev/me/org.example.data

kamu repo alias add

Adds a remote alias to a dataset

Usage: kamu repo alias add [OPTIONS] <DATASET> <ALIAS>

Arguments:

  • <DATASET> — Local dataset reference
  • <ALIAS> — Remote dataset name

Options:

  • --push — Add a push alias
  • --pull — Add a pull alias

kamu repo alias delete

Deletes a remote alias associated with a dataset

Usage: kamu repo alias delete [OPTIONS] [DATASET] [ALIAS]

Arguments:

  • <DATASET> — Local dataset reference
  • <ALIAS> — Remote dataset name

Options:

  • -a, --all — Delete all aliases
  • --push — Delete a push alias
  • --pull — Delete a pull alias

kamu repo alias list

Lists remote aliases

Usage: kamu repo alias list [OPTIONS] [DATASET]

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table

Searches for datasets in the registered repositories

Usage: kamu search [OPTIONS] [QUERY]

Arguments:

  • <QUERY> — Search terms

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table
  • --repo <REPO> — Repository name(s) to search in

Search is delegated to the repository implementations and its capabilities depend on the type of the repo. Whereas smart repos may support advanced full-text search, simple storage-only repos may be limited to a substring search by dataset name.

Examples:

Search all repositories:

kamu search covid19

Search only specific repositories:

kamu search covid19 --repo kamu --repo statcan.gc.ca

kamu sql

Executes an SQL query or drops you into an SQL shell

Usage: kamu sql [OPTIONS] [COMMAND]

Subcommands:

  • server — Run JDBC server only

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table
  • --engine <ENG> — Engine type to use for this SQL session

    Possible values: datafusion, spark

  • --url <URL> — URL of a running JDBC server (e.g. jdbc:hive2://example.com:10000)

  • -c, --command <CMD> — SQL command to run

  • --script <FILE> — SQL script file to execute

SQL shell allows you to explore data of all dataset in your workspace using one of the supported data processing engines. This can be a great way to prepare and test a query that you cal later turn into derivative dataset.

Examples:

Drop into SQL shell:

kamu sql

Execute SQL command and return its output in CSV format:

kamu sql -c 'SELECT * FROM `org.example.data` LIMIT 10' -o csv

Run SQL server to use with external data processing tools:

kamu sql server --address 0.0.0.0 --port 8080

Connect to a remote SQL server:

kamu sql --url jdbc:hive2://example.com:10000

Note: Currently when connecting to a remote SQL kamu server you will need to manually instruct it to load datasets from the data files. This can be done using the following command:

CREATE TEMP VIEW `my.dataset` AS (SELECT * FROM parquet.`kamu_data/my.dataset`);

kamu sql server

Run JDBC server only

Usage: kamu sql server [OPTIONS]

Options:

  • --address <ADDRESS> — Expose JDBC server on specific network interface
  • --port <PORT> — Expose JDBC server on specific port
  • --livy — Run Livy server instead of Spark JDBC
  • --flight-sql — Run Flight SQL server instead of Spark JDBC

kamu system

Command group for system-level functionality

Usage: kamu system <COMMAND>

Subcommands:

  • api-server — Run HTTP + GraphQL server
  • compact — Compact a dataset
  • debug-token — Validate a Kamu token
  • diagnose — Run basic system diagnose check
  • generate-token — Generate a platform token from a known secret for debugging
  • gc — Runs garbage collection to clean up cached and unreachable objects in the workspace
  • info — Summary of the system information
  • ipfs — IPFS helpers
  • upgrade-workspace — Upgrade the layout of a local workspace to the latest version

kamu system api-server

Run HTTP + GraphQL server

Usage: kamu system api-server [OPTIONS] [COMMAND]

Subcommands:

  • gql-query — Executes the GraphQL query and prints out the result
  • gql-schema — Prints the GraphQL schema

Options:

  • --address <ADDRESS> — Bind to a specific network interface
  • --http-port <HTTP_PORT> — Expose HTTP+GraphQL server on specific port
  • --get-token — Output a JWT token you can use to authorize API queries
  • --external-address <EXTERNAL_ADDRESS> — Allows changing the base URL used in the API. Can be handy when launching inside a container

Examples:

Run API server on a specified port:

kamu system api-server --http-port 12312

Execute a single GraphQL query and print result to stdout:

kamu system api-server gql-query '{ apiVersion }'

Print out GraphQL API schema:

kamu system api-server gql-schema

kamu system api-server gql-query

Executes the GraphQL query and prints out the result

Usage: kamu system api-server gql-query [OPTIONS] <QUERY>

Arguments:

  • <QUERY> — GQL query

Options:

  • --full — Display the full result including extensions

kamu system api-server gql-schema

Prints the GraphQL schema

Usage: kamu system api-server gql-schema

kamu system compact

Compact a dataset

Usage: kamu system compact [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local dataset references

Options:

  • --max-slice-size <SIZE> — Maximum size of a single data slice file in bytes

    Default value: 300000000

  • --max-slice-records <RECORDS> — Maximum amount of records in a single data slice file

    Default value: 10000

  • --hard — Perform ‘hard’ compaction that rewrites the history of a dataset

  • --keep-metadata-only — Perform compaction without saving data blocks

  • --verify — Perform verification of the dataset before running a compaction

For datasets that get frequent small appends the number of data slices can grow over time and affect the performance of querying. This command allows to merge multiple small data slices into a few large files, which can be beneficial in terms of size from more compact encoding, and in query performance, as data engines will have to scan through far fewer file headers.

There are two types of compactions: soft and hard.

Soft compactions produce new files while leaving the old blocks intact. This allows for faster queries, while still preserving the accurate history of how dataset evolved over time.

Hard compactions rewrite the history of the dataset as if data was originally written in big batches. They allow to shrink the history of a dataset to just a few blocks, reclaim the space used by old data files, but at the expense of history loss. Hard compactions will rewrite the metadata chain, changing block hashes. Therefore, they will break all downstream datasets that depend on them.

Examples:

Perform a history-altering hard compaction:

kamu system compact --hard my.dataset

kamu system debug-token

Validate a Kamu token

Usage: kamu system debug-token <TOKEN>

Arguments:

  • <TOKEN> — Access token

kamu system diagnose

Run basic system diagnose check

Usage: kamu system diagnose

kamu system generate-token

Generate a platform token from a known secret for debugging

Usage: kamu system generate-token [OPTIONS]

Options:

  • --subject <SUBJECT> — Account ID to generate token for

  • --login <LOGIN> — Account name to derive ID from (for predefined accounts only)

  • --expiration-time-sec <EXPIRATION_TIME_SEC> — Token expiration time in seconds

    Default value: 3600

kamu system gc

Runs garbage collection to clean up cached and unreachable objects in the workspace

Usage: kamu system gc

kamu system info

Summary of the system information

Usage: kamu system info [OPTIONS]

Options:

  • -o, --output-format <FMT> — Format of the output

    Possible values: shell, json, yaml

kamu system ipfs

IPFS helpers

Usage: kamu system ipfs <COMMAND>

Subcommands:

  • add — Adds the specified dataset to IPFS and returns the CID

kamu system ipfs add

Adds the specified dataset to IPFS and returns the CID

Usage: kamu system ipfs add <DATASET>

Arguments:

  • <DATASET> — Dataset reference

kamu system upgrade-workspace

Upgrade the layout of a local workspace to the latest version

Usage: kamu system upgrade-workspace

kamu tail

Displays a sample of most recent records in a dataset

Usage: kamu tail [OPTIONS] <DATASET>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values:

    • csv: Comma-separated values
    • json: Array of Structures format
    • ndjson: One Json object per line - easily splittable format
    • json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
    • json-aoa: Array of arrays - compact and efficient and preserves column order
    • table: A pretty human-readable table
  • -n, --num-records <NUM> — Number of records to display

    Default value: 10

  • -s, --skip-records <SKP> — Number of initial records to skip before applying the limit

    Default value: 0

This command can be thought of as a shortcut for:

kamu sql --engine datafusion --command 'select * from "{dataset}" order by {offset_col} desc limit {num_records}'

kamu ui

Opens web interface

Usage: kamu ui [OPTIONS]

Options:

  • --address <ADDRESS> — Expose HTTP server on specific network interface
  • --http-port <HTTP_PORT> — Which port to run HTTP server on
  • --get-token — Output a JWT token you can use to authorize API queries

Starts a built-in HTTP + GraphQL server and opens a pre-packaged Web UI application in your browser.

Examples:

Starts server and opens UI in your default browser:

kamu ui

Start server on a specific port:

kamu ui --http-port 12345

kamu verify

Verifies the validity of a dataset

Usage: kamu verify [OPTIONS] [DATASET]...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -r, --recursive — Verify the entire transformation chain starting with root datasets
  • --integrity — Check only the hashes of metadata and data without replaying transformations

Validity of derivative data is determined by:

  • Trustworthiness of the source data that went into it
  • Soundness of the derivative transformation chain that shaped it
  • Guaranteeing that derivative data was in fact produced by declared transformations

For the first two, you can inspect the dataset lineage so see which root datasets the data is coming from and whether their publishers are credible. Then you can audit all derivative transformations to ensure they are sound and non-malicious.

This command can help you with the last stage. It uses the history of transformations stored in metadata to first compare the hashes of data with ones stored in metadata (i.e. verify that data corresponds to metadata). Then it repeats all declared transformations locally to ensure that what’s declared in metadata actually produces the presented result.

The combination of the above steps can give you a high certainty that the data you’re using is trustworthy.

When called on a root dataset the command will only perform the integrity check of comparing data hashes to metadata.

Examples:

Verify the data in a dataset starting from its immediate inputs:

kamu verify com.example.deriv

Verify the data in datasets matching pattern:

kamu verify com.example.%

Verify the entire transformation chain starting with root datasets (may download a lot of data):

kamu pull --recursive com.example.deriv

Verify only the hashes of metadata and data, without replaying the transformations. This is useful when you trust the peers performing transformations but want to ensure data was not tampered in storage or during the transmission:

kamu verify --integrity com.example.deriv

kamu version

Outputs build information

Usage: kamu version [OPTIONS]

Options:

  • -o, --output-format <FMT> — Format of the output

    Possible values: shell, json, yaml