Kamu Documentation > Kamu CLI > CLI Reference

CLI Reference

`kamu`

Usage: kamu [OPTIONS] <COMMAND>

Subcommands:

add — Add a new dataset or modify an existing one
completions — Generate tab-completion scripts for your shell
config — Get or set configuration options
delete [rm] — Delete a dataset
export — Exports a dataset
ingest — Adds data to the root dataset according to its push source configuration
init — Initialize an empty workspace in the current directory
inspect — Group of commands for exploring dataset metadata
list [ls] — List all datasets in the workspace
log — Shows dataset metadata history
login — Authenticates with a remote ODF server interactively
logout — Logs out from a remote Kamu server
new — Creates a new dataset manifest from a template
notebook — Starts the notebook server for exploring the data in the workspace
pull — Pull new data into the datasets
push — Push local data into a repository
rename [mv] — Rename a dataset
reset — Revert the dataset back to the specified state
repo — Manage set of tracked repositories
search — Searches for datasets in the registered repositories
sql — Executes an SQL query or drops you into an SQL shell
system — Command group for system-level functionality
tail — Displays a sample of most recent records in a dataset
ui — Opens web interface
verify — Verifies the validity of a dataset
version — Outputs build information

Options:

-v — Sets the level of verbosity (repeat for more)
--no-color — Disable color output in the terminal
-q, --quiet — Suppress all non-essential output
-y, --yes — Do not ask for confirmation and assume the ‘yes’ answer
--trace — Record and visualize the command execution as perfetto.dev trace
--metrics — Dump all metrics at the end of command execution

To get help for individual commands use: kamu -h kamu -h

`kamu add`

Add a new dataset or modify an existing one

Usage: kamu add [OPTIONS] [MANIFEST]...

Arguments:

<MANIFEST> — Dataset manifest reference(s) (path, or URL)

Options:

-r, --recursive — Recursively search for all manifest in the specified directory
--replace — Delete and re-add datasets that already exist
--stdin — Read manifests from standard input
--name <N> — Overrides the name in a loaded manifest

This command creates a new dataset from the provided DatasetSnapshot manifest.

Note that after kamu creates a dataset the changes in the source file will not have any effect unless you run the add command again. When you are experimenting with adding new dataset you currently may need to delete and re-add it multiple times until you get your parameters and schema right.

In future versions the add command will allow you to modify the structure of already existing datasets (e.g. changing schema in a compatible way).

Examples:

Add a root/derivative dataset from local manifest:

kamu add org.example.data.yaml

Add datasets from all manifests found in the current directory:

kamu add --recursive .

Add a dataset from manifest hosted externally (e.g. on GihHub):

kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/ca.bankofcanada/ca.bankofcanada.exchange-rates.daily.yaml

To add dataset from a repository see kamu pull command.

`kamu completions`

Generate tab-completion scripts for your shell

Usage: kamu completions <SHELL>

Arguments:

<SHELL>
Possible values: bash, elvish, fish, powershell, zsh

The command outputs to STDOUT, allowing you to re-direct the output to the file of your choosing. Where you place the file will depend on which shell and which operating system you are using. Your particular configuration may also determine where these scripts need to be placed.

Here are some common set ups:

Bash:

Append the following to your ~/.bashrc:

source <(kamu completions bash)

You will need to reload your shell session (or execute the same command in your current one) for changes to take effect.

Zsh:

Append the following to your ~/.zshrc:

autoload -U +X bashcompinit && bashcompinit
source <(kamu completions bash)

Please contribute a guide for your favorite shell!

`kamu config`

Get or set configuration options

Usage: kamu config <COMMAND>

Subcommands:

list [ls] — Display current configuration combined from all config files
get — Get current configuration value
set — Set or unset configuration value

Configuration in kamu is managed very similarly to git. Starting with your current workspace and going up the directory tree you can have multiple .kamuconfig YAML files which are all merged together to get the resulting config.

Most commonly you will have a workspace-scoped config inside the .kamu directory and the user-scoped config residing in your home directory.

Examples:

List current configuration as combined view of config files:

kamu config list

Get current configuration value:

kamu config get engine.runtime

Set configuration value in workspace scope:

kamu config set engine.runtime podman

Set configuration value in user scope:

kamu config set --user engine.runtime podman

Unset or revert to default value:

kamu config set --user engine.runtime

`kamu config list`

Display current configuration combined from all config files

Usage: kamu config list [OPTIONS]

Options:

--user — Show only user scope configuration
--with-defaults — Show configuration with all default values applied

`kamu config get`

Get current configuration value

Usage: kamu config get [OPTIONS] <CFGKEY>

Arguments:

<CFGKEY> — Path to the config option

Options:

--user — Operate on the user scope configuration file
--with-defaults — Get default value if config option is not explicitly set

`kamu config set`

Set or unset configuration value

Usage: kamu config set [OPTIONS] <CFGKEY> [VALUE]

Arguments:

<CFGKEY> — Path to the config option
<VALUE> — New value to set

Options:

--user — Operate on the user scope configuration file

`kamu delete`

Delete a dataset

Usage: kamu delete [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local dataset reference(s)

Options:

-a, --all — Delete all datasets in the workspace
-r, --recursive — Also delete all transitive dependencies of specified datasets

This command deletes the dataset from your workspace, including both metadata and the raw data.

Take great care when deleting root datasets. If you have not pushed your local changes to a repository - the data will be lost.

Deleting a derivative dataset is usually not a big deal, since they can always be reconstructed, but it will disrupt downstream consumers.

Examples:

Delete a local dataset:

kamu delete my.dataset

Delete local datasets matching pattern:

kamu delete my.dataset.%

`kamu export`

Exports a dataset

Usage: kamu export [OPTIONS] --output-format <OUTPUT_FORMAT> <DATASET>

Arguments:

<DATASET> — Local dataset reference

Options:

--output-path <OUTPUT_PATH> — Export destination. Dafault is <current workdir>/<dataset name>
--output-format <OUTPUT_FORMAT> — Output format
--records-per-file <RECORDS_PER_FILE> — Number of records per file, if stored into a directory. It’s a soft limit. For the sake of export performance the actual number of records may be slightly different

This command exports a dataset to a file or set of files of a given format.

Output path may be either file or directory. When a path contains extention, and no trailing separator, it is considered as a file. In all other cases a path is considered as a directory. Examples:

export/dataset.csv is a file path
export/dataset.csv/ is a directory path
export/dataset/ is a directory path
export/dataset is a directory path

`kamu ingest`

Adds data to the root dataset according to its push source configuration

Usage: kamu ingest [OPTIONS] <DATASET> [FILE]...

Arguments:

<DATASET> — Local dataset reference
<FILE> — Data file(s) to ingest

Options:

--source-name <SRC> — Name of the push source to use for ingestion
--event-time <T> — Event time to be used if data does not contain one
--stdin — Read data from the standard input
-r, --recursive — Recursively propagate the updates into all downstream datasets
--input-format <FMT> — Overrides the media type of the data expected by the push source
Possible values: csv, json, ndjson, geojson, ndgeojson, parquet, esrishapefile

Examples:

Ingest data from files:

kamu ingest org.example.data path/to/data.csv

Ingest data from standard input (assumes source is defined to use NDJSON):

echo '{"key": "value1"}\n{"key": "value2"}' | kamu ingest org.example.data --stdin

Ingest data with format conversion:

echo '[{"key": "value1"}, {"key": "value2"}]' | kamu ingest org.example.data --stdin --input-format json

Ingest data with event time hint:

kamu ingest org.example.data data.json --event-time 2050-01-02T12:00:00Z

`kamu init`

Initialize an empty workspace in the current directory

Usage: kamu init [OPTIONS]

Options:

--exists-ok — Don’t return an error if workspace already exists
--pull-images — Only pull container images and exit

A workspace is where kamu stores all the important information about datasets (metadata) and in some cases raw data.

It is recommended to create one kamu workspace per data science project, grouping all related datasets together.

Initializing a workspace creates a .kamu directory contains dataset metadata, data, and all supporting files (configs, known repositories etc.).

`kamu inspect`

Group of commands for exploring dataset metadata

Usage: kamu inspect <COMMAND>

Subcommands:

lineage — Shows the dependency tree of a dataset
query — Shows the transformations used by a derivative dataset
schema — Shows the dataset schema

`kamu inspect lineage`

Shows the dependency tree of a dataset

Usage: kamu inspect lineage [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local dataset reference(s)

Options:

-o, --output-format <FMT> — Format of the output
Possible values: shell, dot, csv, html
-b, --browse — Produce HTML and open it in a browser

Presents the dataset-level lineage that includes current and past dependencies.

Examples:

Show lineage of a single dataset:

kamu inspect lineage my.dataset

Show lineage graph of all datasets in a browser:

kamu inspect lineage --browse

Render the lineage graph into a png image (needs graphviz installed):

kamu inspect lineage -o dot | dot -Tpng > depgraph.png

`kamu inspect query`

Shows the transformations used by a derivative dataset

Usage: kamu inspect query <DATASET>

Arguments:

<DATASET> — Local dataset reference

This command allows you to audit the transformations performed by a derivative dataset and their evolution. Such audit is an important step in validating the trustworthiness of data (see kamu verify command).

`kamu inspect schema`

Shows the dataset schema

Usage: kamu inspect schema [OPTIONS] <DATASET>

Arguments:

<DATASET> — Local dataset reference

Options:

-o, --output-format <FMT> — Format of the output
Possible values: ddl, parquet, parquet-json, arrow-json

Displays the schema of the dataset. Note that dataset schemas can evolve over time and by default the latest schema will be shown.

Examples:

Show logical schema of a dataset in the DDL format:

kamu inspect schema my.dataset

Show physical schema of the underlying Parquet files:

kamu inspect schema my.dataset -o parquet

`kamu list`

List all datasets in the workspace

Usage: kamu list [OPTIONS]

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)
-w, --wide — Show more details (repeat for more)

Examples:

To see a human-friendly list of datasets in your workspace:

kamu list

To see more details:

kamu list -w

To get a machine-readable list of datasets:

kamu list -o csv

`kamu log`

Shows dataset metadata history

Usage: kamu log [OPTIONS] <DATASET>

Arguments:

<DATASET> — Local dataset reference

Options:

-o, --output-format <FMT> — Format of the output
Possible values: shell, yaml
-f, --filter <FLT> — Types of events to include
--limit <LIMIT> — Maximum number of blocks to display
Default value: 500

Metadata of a dataset contains historical record of everything that ever influenced how data currently looks like.

This includes events such as:

Data ingestion / transformation
Change of query
Change of schema
Change of source URL or other ingestion steps in a root dataset

Use this command to explore how dataset evolved over time.

Examples:

Show brief summaries of individual metadata blocks:

kamu log org.example.data

Show detailed content of all blocks:

kamu log -o yaml org.example.data

Using a filter to inspect blocks containing query changes of a derivative dataset:

kamu log -o yaml --filter source org.example.data

`kamu login`

Authenticates with a remote ODF server interactively

Usage: kamu login [OPTIONS] [SERVER] [COMMAND]

Subcommands:

oauth — Performs non-interactive login to a remote Kamu server via OAuth provider token
password — Performs non-interactive login to a remote Kamu server via login and password

Arguments:

<SERVER> — ODF server URL (defaults to kamu.dev)

Options:

--user — Store access token in the user home folder rather than in the workspace
--check — Check whether existing authorization is still valid without triggering a login flow
--access-token <ACCESS_TOKEN> — Provide an existing access token
--repo-name <REPO_NAME> — Repository name which will be used to store in repositories list
--skip-add-repo — Don’t automatically add a remote repository for this host

`kamu login oauth`

Performs non-interactive login to a remote Kamu server via OAuth provider token

Usage: kamu login oauth [OPTIONS] <PROVIDER> <ACCESS_TOKEN> [SERVER]

Arguments:

<PROVIDER> — Name of the OAuth provider, i.e. ‘github’
<ACCESS_TOKEN> — OAuth provider access token
<SERVER> — ODF backend server URL (defaults to kamu.dev)

Options:

--user — Store access token in the user home folder rather than in the workspace

`kamu login password`

Performs non-interactive login to a remote Kamu server via login and password

Usage: kamu login password [OPTIONS] <LOGIN> <PASSWORD> [SERVER]

Arguments:

<LOGIN> — Specify user name
<PASSWORD> — Specify password
<SERVER> — ODF backend server URL (defaults to kamu.dev)

Options:

--user — Store access token in the user home folder rather than in the workspace

`kamu logout`

Logs out from a remote Kamu server

Usage: kamu logout [OPTIONS] [SERVER]

Arguments:

<SERVER> — ODF server URL (defaults to kamu.dev)

Options:

--user — Drop access token stored in the user home folder rather than in the workspace
-a, --all — Log out of all servers

`kamu new`

Creates a new dataset manifest from a template

Usage: kamu new [OPTIONS] <NAME>

Arguments:

<NAME> — Name of the new dataset

Options:

--root — Create a root dataset
--derivative — Create a derivative dataset

This command will create a dataset manifest from a template allowing you to customize the most relevant parts without having to remember the exact structure of the yaml file.

Examples:

Create org.example.data.yaml file from template in the current directory:

kamu new org.example.data --root

`kamu notebook`

Starts the notebook server for exploring the data in the workspace

Usage: kamu notebook [OPTIONS]

Options:

--address <ADDRESS> — Expose HTTP server on specific network interface
--http-port <HTTP_PORT> — Expose HTTP server on specific port
--engine <ENG> — Engine type to use for the notebook
Possible values: datafusion, spark
-e, --env <VAR> — Propagate or set an environment variable in the notebook (e.g. -e VAR or -e VAR=foo)

This command will run the Jupyter server and the Spark engine connected together, letting you query data with SQL before pulling it into the notebook for final processing and visualization.

For more information check out notebook examples at https://github.com/kamu-data/kamu-cli

`kamu pull`

Pull new data into the datasets

Usage: kamu pull [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local or remote dataset reference(s)

Options:

-a, --all — Pull all datasets in the workspace
-r, --recursive — Also pull all transitive dependencies of specified datasets
--fetch-uncacheable — Pull latest data from uncacheable data sources
--as <NAME> — Local name of a dataset to use when syncing from a repository
--no-alias — Don’t automatically add a remote push alias for this destination
--set-watermark <TIME> — Injects a manual watermark into the dataset to signify that no data is expected to arrive with event time that precedes it
-f, --force — Overwrite local version with remote, even if revisions have diverged
--reset-derivatives-on-diverged-input — Run hard compaction of derivative dataset if transformation failed due to root dataset compaction

Pull is a multi-functional command that lets you update a local dataset. Depending on the parameters and the types of datasets involved it can be used to:

Run polling ingest to pull data into a root dataset from an external source
Run transformations on a derivative dataset to process previously unseen data
Pull dataset from a remote repository into your workspace
Update watermark on a dataset

Examples:

Fetch latest data in a specific dataset:

kamu pull org.example.data

Fetch latest data in datasets matching pattern:

kamu pull org.example.%

Fetch latest data for the entire dependency tree of a dataset:

kamu pull --recursive org.example.derivative

Refresh data of all datasets in the workspace:

kamu pull --all

Fetch dataset from a registered repository:

kamu pull kamu/org.example.data

Fetch dataset from a URL (see kamu repo add -h for supported sources):

kamu pull ipfs://bafy...a0dx/data
kamu pull s3://my-bucket.example.org/odf/org.example.data
kamu pull s3+https://example.org:5000/data --as org.example.data

Advance the watermark of a dataset:

kamu pull --set-watermark 2020-01-01 org.example.data

`kamu push`

Push local data into a repository

Usage: kamu push [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local or remote dataset reference(s)

Options:

-a, --all — Push all datasets in the workspace
-r, --recursive — Also push all transitive dependencies of specified datasets
--no-alias — Don’t automatically add a remote push alias for this destination
--to <REM> — Remote alias or a URL to push to
-f, --force — Overwrite remote version with local, even if revisions have diverged
--visibility <VIS> — Changing the visibility of the initially pushed dataset(s)
Default value: private
Possible values: private, public

Use this command to share your new dataset or new data with others. All changes performed by this command are atomic and non-destructive. This command will analyze the state of the dataset at the repository and will only upload data and metadata that wasn’t previously seen.

Similarly to git, if someone else modified the dataset concurrently with you - your push will be rejected, and you will have to resolve the conflict.

Examples:

Sync dataset to a destination URL (see kamu repo add -h for supported protocols):

kamu push org.example.data --to s3://my-bucket.example.org/odf/org.example.data

Sync dataset to a named repository (see kamu repo command group):

kamu push org.example.data --to kamu-hub/org.example.data

Sync dataset that already has a push alias:

kamu push org.example.data

Sync datasets matching pattern that already have push aliases:

kamu push org.example.%

Add dataset to local IPFS node and update IPNS entry to the new CID:

kamu push org.example.data --to ipns://k5..zy

`kamu rename`

Rename a dataset

Usage: kamu rename <DATASET> <NAME>

Arguments:

<DATASET> — Dataset reference
<NAME> — The new name to give it

Use this command to rename a dataset in your local workspace. Renaming is safe in terms of downstream derivative datasets as they use stable dataset IDs to define their inputs.

Examples:

Renaming is often useful when you pull a remote dataset by URL, and it gets auto-assigned not the most convenient name:

kamu pull ipfs://bafy...a0da
kamu rename bafy...a0da my.dataset

`kamu reset`

Revert the dataset back to the specified state

Usage: kamu reset <DATASET> <HASH>

Arguments:

<DATASET> — Dataset reference
<HASH> — Hash of the block to reset to

Resetting a dataset to the specified block erases all metadata blocks that followed it and deletes all data added since that point. This can sometimes be useful to resolve conflicts, but otherwise should be used with care.

Keep in mind that blocks that were pushed to a repository could’ve been already observed by other people, so resetting the history will not let you take that data back and instead create conflicts for the downstream consumers of your data.

`kamu repo`

Manage set of tracked repositories

Usage: kamu repo <COMMAND>

Subcommands:

add — Adds a repository
delete [rm] — Deletes a reference to repository
list [ls] — Lists known repositories
alias — Manage set of remote aliases associated with datasets

Repositories are nodes on the network that let users exchange datasets. In the most basic form, a repository can simply be a location where the dataset files are hosted over one of the supported file or object-based data transfer protocols. The owner of a dataset will have push privileges to this location, while other participants can pull data from it.

Examples:

Show available repositories:

kamu repo list

Add S3 bucket as a repository:

kamu repo add example-repo s3://bucket.my-company.example/

`kamu repo add`

Adds a repository

Usage: kamu repo add <NAME> <URL>

Arguments:

<NAME> — Local alias of the repository
<URL> — URL of the repository

For local file system repositories use the following URL formats:

file:///home/me/example/repository/
file:///c:/Users/me/example/repository/

For S3-compatible basic repositories use:

s3://bucket.my-company.example/
s3+http://my-minio-server:9000/bucket/
s3+https://my-minio-server:9000/bucket/

For ODF-compatible smart repositories use:

odf+http://odf-server/
odf+https://odf-server/

`kamu repo delete`

Deletes a reference to repository

Usage: kamu repo delete [OPTIONS] [REPOSITORY]...

Arguments:

<REPOSITORY> — Repository name(s)

Options:

-a, --all — Delete all known repositories

`kamu repo list`

Lists known repositories

Usage: kamu repo list [OPTIONS]

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)

`kamu repo alias`

Manage set of remote aliases associated with datasets

Usage: kamu repo alias <COMMAND>

Subcommands:

add — Adds a remote alias to a dataset
delete [rm] — Deletes a remote alias associated with a dataset
list [ls] — Lists remote aliases

When you pull and push datasets from repositories kamu uses aliases to let you avoid specifying the full remote reference each time. Aliases are usually created the first time you do a push or pull and saved for later. If you have an unusual setup (e.g. pushing to multiple repositories) you can use this command to manage the aliases.

Examples:

List all aliases:

kamu repo alias list

List all aliases of a specific dataset:

kamu repo alias list org.example.data

Add a new pull alias:

kamu repo alias add --pull org.example.data kamu.dev/me/org.example.data

`kamu repo alias add`

Adds a remote alias to a dataset

Usage: kamu repo alias add [OPTIONS] <DATASET> <ALIAS>

Arguments:

<DATASET> — Local dataset reference
<ALIAS> — Remote dataset name

Options:

--push — Add a push alias
--pull — Add a pull alias

`kamu repo alias delete`

Deletes a remote alias associated with a dataset

Usage: kamu repo alias delete [OPTIONS] [DATASET] [ALIAS]

Arguments:

<DATASET> — Local dataset reference
<ALIAS> — Remote dataset name

Options:

-a, --all — Delete all aliases
--push — Delete a push alias
--pull — Delete a pull alias

`kamu repo alias list`

Lists remote aliases

Usage: kamu repo alias list [OPTIONS] [DATASET]

Arguments:

<DATASET> — Local dataset reference

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)

`kamu search`

Searches for datasets in the registered repositories

Usage: kamu search [OPTIONS] [QUERY]

Arguments:

<QUERY> — Search terms

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)
--repo <REPO> — Repository name(s) to search in

Search is delegated to the repository implementations and its capabilities depend on the type of the repo. Whereas smart repos may support advanced full-text search, simple storage-only repos may be limited to a substring search by dataset name.

Examples:

Search all repositories:

kamu search covid19

Search only specific repositories:

kamu search covid19 --repo kamu --repo statcan.gc.ca

`kamu sql`

Executes an SQL query or drops you into an SQL shell

Usage: kamu sql [OPTIONS] [COMMAND]

Subcommands:

server — Runs an SQL engine in a server mode

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)
--engine <ENG> — Engine type to use for this SQL session
Possible values: datafusion, spark
--url <URL> — URL of a running JDBC server (e.g. jdbc:hive2://example.com:10000)
-c, --command <CMD> — SQL command to run
--script <FILE> — SQL script file to execute
--output-path <OUTPUT_PATH> — When set, result will be stored to a given path instead of being printed to stdout
--records-per-file <RECORDS_PER_FILE> — Number of records per file, if stored into a directory. It’s a soft limit. For the sake of export performance the actual number records may be slightly different

SQL shell allows you to explore data of all dataset in your workspace using one of the supported data processing engines. This can be a great way to prepare and test a query that you cal later turn into derivative dataset.

Output path may be either file or directory. When a path contains extention, and no trailing separator, it is considered as a file. In all other cases a path is considered as a directory. Examples:

export/dataset.csv is a file path
export/dataset.csv/ is a directory path
export/dataset/ is a directory path
export/dataset is a directory path

Examples:

Drop into SQL shell:

kamu sql

Execute SQL command and return its output in CSV format:

kamu sql -c 'SELECT * FROM `org.example.data` LIMIT 10' -o csv

Run SQL server to use with external data processing tools:

kamu sql server --address 0.0.0.0 --port 8080

Connect to a remote SQL server:

kamu sql --url jdbc:hive2://example.com:10000

Note: Currently when connecting to a remote SQL kamu server you will need to manually instruct it to load datasets from the data files. This can be done using the following command:

CREATE TEMP VIEW `my.dataset` AS (SELECT * FROM parquet.`kamu_data/my.dataset`);

`kamu sql server`

Runs an SQL engine in a server mode

Usage: kamu sql server [OPTIONS]

Options:

--address <ADDRESS> — Expose server on specific network interface
--port <PORT> — Expose server on specific port
--engine <ENG> — Engine type to use for this server
Possible values: datafusion, spark
--livy — Run Livy server instead of JDBC

Examples:

By default runs the DataFusion engine exposing the FlightSQL protocol:

kamu sql server

To customize interface and port:

kamu sql server --address 0.0.0.0 --port 50050

To run with Spark engine:

kamu sql server --engine spark

By default Spark runs with JDBC protocol, to instead run with Livy HTTP gateway:

kamu sql server --engine spark --livy

`kamu system`

Command group for system-level functionality

Usage: kamu system <COMMAND>

Subcommands:

api-server — Run HTTP + GraphQL server
compact — Compact a dataset
debug-token — Validate a Kamu token
decode — Decode a manifest file
diagnose — Run basic system diagnose check
generate-token — Generate a platform token from a known secret for debugging
gc — Runs garbage collection to clean up cached and unreachable objects in the workspace
info — Summary of the system information
ipfs — IPFS helpers
upgrade-workspace — Upgrade the layout of a local workspace to the latest version

`kamu system api-server`

Run HTTP + GraphQL server

Usage: kamu system api-server [OPTIONS] [COMMAND]

Subcommands:

gql-query — Executes the GraphQL query and prints out the result
gql-schema — Prints the GraphQL schema

Options:

--address <ADDRESS> — Bind to a specific network interface
--http-port <HTTP_PORT> — Expose HTTP+GraphQL server on specific port
--get-token — Output a JWT token you can use to authorize API queries
--external-address <EXTERNAL_ADDRESS> — Allows changing the base URL used in the API. Can be handy when launching inside a container

Examples:

Run API server on a specified port:

kamu system api-server --http-port 12312

Execute a single GraphQL query and print result to stdout:

kamu system api-server gql-query '{ apiVersion }'

Print out GraphQL API schema:

kamu system api-server gql-schema

`kamu system api-server gql-query`

Executes the GraphQL query and prints out the result

Usage: kamu system api-server gql-query [OPTIONS] <QUERY>

Arguments:

<QUERY> — GQL query

Options:

--full — Display the full result including extensions

`kamu system api-server gql-schema`

Prints the GraphQL schema

Usage: kamu system api-server gql-schema

`kamu system compact`

Compact a dataset

Usage: kamu system compact [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local dataset references

Options:

--max-slice-size <SIZE> — Maximum size of a single data slice file in bytes
Default value: 300000000
--max-slice-records <RECORDS> — Maximum amount of records in a single data slice file
Default value: 10000
--hard — Perform ‘hard’ compaction that rewrites the history of a dataset
--keep-metadata-only — Perform compaction without saving data blocks
--verify — Perform verification of the dataset before running a compaction

For datasets that get frequent small appends the number of data slices can grow over time and affect the performance of querying. This command allows to merge multiple small data slices into a few large files, which can be beneficial in terms of size from more compact encoding, and in query performance, as data engines will have to scan through far fewer file headers.

There are two types of compactions: soft and hard.

Soft compactions produce new files while leaving the old blocks intact. This allows for faster queries, while still preserving the accurate history of how dataset evolved over time.

Hard compactions rewrite the history of the dataset as if data was originally written in big batches. They allow to shrink the history of a dataset to just a few blocks, reclaim the space used by old data files, but at the expense of history loss. Hard compactions will rewrite the metadata chain, changing block hashes. Therefore, they will break all downstream datasets that depend on them.

Examples:

Perform a history-altering hard compaction:

kamu system compact --hard my.dataset

`kamu system debug-token`

Validate a Kamu token

Usage: kamu system debug-token <TOKEN>

Arguments:

<TOKEN> — Access token

`kamu system decode`

Decode a manifest file

Usage: kamu system decode [OPTIONS] [MANIFEST]

Arguments:

<MANIFEST> — Manifest reference (path, or URL)

Options:

--stdin — Read manifests from standard input

`kamu system diagnose`

Run basic system diagnose check

Usage: kamu system diagnose

`kamu system generate-token`

Generate a platform token from a known secret for debugging

Usage: kamu system generate-token [OPTIONS]

Options:

--subject <SUBJECT> — Account ID to generate token for
--login <LOGIN> — Account name to derive ID from (for predefined accounts only)
--expiration-time-sec <EXPIRATION_TIME_SEC> — Token expiration time in seconds
Default value: 3600

`kamu system gc`

Runs garbage collection to clean up cached and unreachable objects in the workspace

Usage: kamu system gc

`kamu system info`

Summary of the system information

Usage: kamu system info [OPTIONS]

Options:

-o, --output-format <FMT> — Format of the output
Possible values: shell, json, yaml

`kamu system ipfs`

IPFS helpers

Usage: kamu system ipfs <COMMAND>

Subcommands:

add — Adds the specified dataset to IPFS and returns the CID

`kamu system ipfs add`

Adds the specified dataset to IPFS and returns the CID

Usage: kamu system ipfs add <DATASET>

Arguments:

<DATASET> — Dataset reference

`kamu system upgrade-workspace`

Upgrade the layout of a local workspace to the latest version

Usage: kamu system upgrade-workspace

`kamu tail`

Displays a sample of most recent records in a dataset

Usage: kamu tail [OPTIONS] <DATASET>

Arguments:

<DATASET> — Local dataset reference

Options:

-o, --output-format <FMT> — Format to display the results in
Possible values:
- csv: Comma-separated values
- json: Array of Structures format
- ndjson: One Json object per line - easily splittable format
- json-soa: Structure of arrays - more compact and efficient format for encoding entire dataframe
- json-aoa: Array of arrays - compact and efficient and preserves column order
- table: A pretty human-readable table
- parquet: Parquet columnar storage. Only available when exporting to file(s)
-n, --num-records <NUM> — Number of records to display
Default value: 10
-s, --skip-records <SKP> — Number of initial records to skip before applying the limit
Default value: 0

This command can be thought of as a shortcut for:

kamu sql --engine datafusion --command 'select * from "{dataset}" order by {offset_col} desc limit {num_records}'

`kamu ui`

Opens web interface

Usage: kamu ui [OPTIONS]

Options:

--address <ADDRESS> — Expose HTTP server on specific network interface
--http-port <HTTP_PORT> — Which port to run HTTP server on
--get-token — Output a JWT token you can use to authorize API queries

Starts a built-in HTTP + GraphQL server and opens a pre-packaged Web UI application in your browser.

Examples:

Starts server and opens UI in your default browser:

kamu ui

Start server on a specific port:

kamu ui --http-port 12345

`kamu verify`

Verifies the validity of a dataset

Usage: kamu verify [OPTIONS] [DATASET]...

Arguments:

<DATASET> — Local dataset reference(s)

Options:

-r, --recursive — Verify the entire transformation chain starting with root datasets
--integrity — Check only the hashes of metadata and data without replaying transformations

Validity of derivative data is determined by:

Trustworthiness of the source data that went into it
Soundness of the derivative transformation chain that shaped it
Guaranteeing that derivative data was in fact produced by declared transformations

For the first two, you can inspect the dataset lineage so see which root datasets the data is coming from and whether their publishers are credible. Then you can audit all derivative transformations to ensure they are sound and non-malicious.

This command can help you with the last stage. It uses the history of transformations stored in metadata to first compare the hashes of data with ones stored in metadata (i.e. verify that data corresponds to metadata). Then it repeats all declared transformations locally to ensure that what’s declared in metadata actually produces the presented result.

The combination of the above steps can give you a high certainty that the data you’re using is trustworthy.

When called on a root dataset the command will only perform the integrity check of comparing data hashes to metadata.

Examples:

Verify the data in a dataset starting from its immediate inputs:

kamu verify com.example.deriv

Verify the data in datasets matching pattern:

kamu verify com.example.%

Verify the entire transformation chain starting with root datasets (may download a lot of data):

kamu pull --recursive com.example.deriv

Verify only the hashes of metadata and data, without replaying the transformations. This is useful when you trust the peers performing transformations but want to ensure data was not tampered in storage or during the transmission:

kamu verify --integrity com.example.deriv

`kamu version`

Outputs build information

Usage: kamu version [OPTIONS]

Options:

-o, --output-format <FMT> — Format of the output
Possible values: shell, json, yaml

Edit on GitHub

Top