kamu

Usage: kamu [OPTIONS] <COMMAND>

Subcommands:

  • add — Add a new dataset or modify an existing one
  • completions — Generate tab-completion scripts for your shell
  • config — Get or set configuration options
  • delete — Delete a dataset
  • ingest — Adds data to the root dataset according to its push source configuration
  • init — Initialize an empty workspace in the current directory
  • inspect — Group of commands for exploring dataset metadata
  • list — List all datasets in the workspace
  • log — Shows dataset metadata history
  • login — Logs in to a remote Kamu server
  • logout — Logs out from a remote Kamu server
  • new — Creates a new dataset manifest from a template
  • notebook — Starts the notebook server for exploring the data in the workspace
  • pull — Pull new data into the datasets
  • push — Push local data into a repository
  • rename — Rename a dataset
  • reset — Revert the dataset back to the specified state
  • repo — Manage set of tracked repositories
  • search — Searches for datasets in the registered repositories
  • sql — Executes an SQL query or drops you into an SQL shell
  • system — Command group for system-level functionality
  • tail — Displays a sample of most recent records in a dataset
  • ui — Opens web interface
  • verify — Verifies the validity of a dataset
  • version — Outputs build information

Options:

  • -v — Sets the level of verbosity (repeat for more)
  • -q, --quiet — Suppress all non-essential output
  • --trace — Record and visualize the command execution as perfetto.dev trace
  • -a, --account <ACCOUNT>

To get help for individual commands use: kamu -h kamu -h

kamu add

Add a new dataset or modify an existing one

Usage: kamu add [OPTIONS] [manifest]...

Arguments:

  • <MANIFEST> — Dataset manifest reference(s) (path, or URL)

Options:

  • -r, --recursive — Recursively search for all manifest in the specified directory
  • --replace — Delete and re-add datasets that already exist
  • --stdin — Read manifests from standard input

This command creates a new dataset from the provided DatasetSnapshot manifest.

Note that after kamu creates a dataset the changes in the source file will not have any effect unless you run the add command again. When you are experimenting with adding new dataset you currently may need to delete and re-add it multiple times until you get your parameters and schema right.

In future versions the add command will allow you to modify the structure of already existing datasets (e.g. changing schema in a compatible way).

Examples:

Add a root/derivative dataset from local manifest:

kamu add org.example.data.yaml

Add datasets from all manifests found in the current directory:

kamu add --recursive .

Add a dataset from manifest hosted externally (e.g. on GihHub):

kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/ca.bankofcanada/ca.bankofcanada.exchange-rates.daily.yaml

To add dataset from a repository see kamu pull command.

kamu completions

Generate tab-completion scripts for your shell

Usage: kamu completions <shell>

Arguments:

  • <SHELL>

    Possible values: bash, elvish, fish, powershell, zsh

The command outputs to STDOUT, allowing you to re-direct the output to the file of your choosing. Where you place the file will depend on which shell and which operating system you are using. Your particular configuration may also determine where these scripts need to be placed.

Here are some common set ups:

Bash:

Append the following to your ~/.bashrc:

source <(kamu completions bash)

You will need to reload your shell session (or execute the same command in your current one) for changes to take effect.

Zsh:

Append the following to your ~/.zshrc:

autoload -U +X bashcompinit && bashcompinit
source <(kamu completions bash)

Please contribute a guide for your favorite shell!

kamu config

Get or set configuration options

Usage: kamu config <COMMAND>

Subcommands:

  • list — Display current configuration combined from all config files
  • get — Get current configuration value
  • set — Set or unset configuration value

Configuration in kamu is managed very similarly to git. Starting with your current workspace and going up the directory tree you can have multiple .kamuconfig YAML files which are all merged together to get the resulting config.

Most commonly you will have a workspace-scoped config inside the .kamu directory and the user-scoped config residing in your home directory.

Examples:

List current configuration as combined view of config files:

kamu config list

Get current configuration value:

kamu config get engine.runtime

Set configuration value in workspace scope:

kamu config set engine.runtime podman

Set configuration value in user scope:

kamu config set --user engine.runtime podman

Unset or revert to default value:

kamu config set --user engine.runtime

kamu config list

Display current configuration combined from all config files

Usage: kamu config list [OPTIONS]

Options:

  • --user — Show only user scope configuration
  • --with-defaults — Show configuration with all default values applied

kamu config get

Get current configuration value

Usage: kamu config get [OPTIONS] <cfgkey>

Arguments:

  • <CFGKEY> — Path to the config option

Options:

  • --user — Operate on the user scope configuration file
  • --with-defaults — Get default value if config option is not explicitly set

kamu config set

Set or unset configuration value

Usage: kamu config set [OPTIONS] <cfgkey> [value]

Arguments:

  • <CFGKEY> — Path to the config option
  • <VALUE> — New value to set

Options:

  • --user — Operate on the user scope configuration file

kamu delete

Delete a dataset

Usage: kamu delete [OPTIONS] <dataset>...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -a, --all — Delete all datasets in the workspace
  • -r, --recursive — Also delete all transitive dependencies of specified datasets
  • -y, --yes — Don’t ask for confirmation

This command deletes the dataset from your workspace, including both metadata and the raw data.

Take great care when deleting root datasets. If you have not pushed your local changes to a repository - the data will be lost.

Deleting a derivative dataset is usually not a big deal, since they can always be reconstructed, but it will disrupt downstream consumers.

Examples:

Delete a local dataset:

kamu delete my.dataset

Delete local datasets matching pattern:

kamu delete my.dataset.%

kamu ingest

Adds data to the root dataset according to its push source configuration

Usage: kamu ingest [OPTIONS] <dataset> [FILE]...

Arguments:

  • <DATASET> — Local dataset reference
  • <FILE> — Data file(s) to ingest

Options:

  • --source-name <SRC> — Name of the push source to use for ingestion

  • --stdin — Read data from the standard input

  • -r, --recursive — Recursively propagate the updates into all downstream datasets

  • --input-format <FMT> — Overrides the media type of the data expected by the push source

    Possible values: csv, json, ndjson, geojson, ndgeojson, parquet, esrishapefile

Examples:

Ingest data from files:

kamu ingest org.example.data path/to/data.csv

Ingest data from standard input (assumes source is defined to use NDJSON):

echo '{"key": "value1"}\n{"key": "value2"}' | kamu ingest org.example.data --stdin

Ingest data with format conversion:

echo '[{"key": "value1"}, {"key": "value2"}]' | kamu ingest org.example.data --stdin --input-format json

kamu init

Initialize an empty workspace in the current directory

Usage: kamu init [OPTIONS]

Options:

  • --exists-ok — Don’t return an error if workspace already exists
  • --pull-images — Only pull container images and exit
  • --list-only — List image names instead of pulling
  • --multi-tenant — Initialize a workspace for multiple tenants

A workspace is where kamu stores all the important information about datasets (metadata) and in some cases raw data.

It is recommended to create one kamu workspace per data science project, grouping all related datasets together.

Initializing a workspace creates a .kamu directory contains dataset metadata, data, and all supporting files (configs, known repositories etc.).

kamu inspect

Group of commands for exploring dataset metadata

Usage: kamu inspect <COMMAND>

Subcommands:

  • lineage — Shows the dependency tree of a dataset
  • query — Shows the transformations used by a derivative dataset
  • schema — Shows the dataset schema

kamu inspect lineage

Shows the dependency tree of a dataset

Usage: kamu inspect lineage [OPTIONS] <dataset>...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -o, --output-format <FMT> — Format of an output

    Possible values: shell, dot, csv, html

  • -b, --browse — Produce HTML and open it in a browser

Presents the dataset-level lineage that includes current and past dependencies.

Examples:

Show lineage of a single dataset:

kamu inspect lineage my.dataset

Show lineage graph of all datasets in a browser:

kamu inspect lineage --browse

Render the lineage graph into a png image (needs graphviz installed):

kamu inspect lineage -o dot | dot -Tpng > depgraph.png

kamu inspect query

Shows the transformations used by a derivative dataset

Usage: kamu inspect query <dataset>

Arguments:

  • <DATASET> — Local dataset reference

This command allows you to audit the transformations performed by a derivative dataset and their evolution. Such audit is an important step in validating the trustworthiness of data (see kamu verify command).

kamu inspect schema

Shows the dataset schema

Usage: kamu inspect schema [OPTIONS] <dataset>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format of an output

    Possible values: ddl, parquet, parquet-json, arrow-json

  • --from-data-file

Displays the schema of the dataset. Note that dataset schemas can evolve over time and by default the latest schema will be shown.

Examples:

Show logical schema of a dataset in the DDL format:

kamu inspect schema my.dataset

Show physical schema of the underlying Parquet files:

kamu inspect schema my.dataset -o parquet

kamu list

List all datasets in the workspace

Usage: kamu list [OPTIONS]

Options:

  • -w, --wide — Show more details (repeat for more)

  • --target-account <TARGET-ACCOUNT>

  • --all-accounts

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

Examples:

To see a human-friendly list of datasets in your workspace:

kamu list

To see more details:

kamu list -w

To get a machine-readable list of datasets:

kamu list -o csv

kamu log

Shows dataset metadata history

Usage: kamu log [OPTIONS] <dataset>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT>

    Possible values: yaml

  • -f, --filter <FLT>

  • --limit <LIMIT> — Maximum number of blocks to display

    Default value: 500

Metadata of a dataset contains historical record of everything that ever influenced how data currently looks like.

This includes events such as:

  • Data ingestion / transformation
  • Change of query
  • Change of schema
  • Change of source URL or other ingestion steps in a root dataset

Use this command to explore how dataset evolved over time.

Examples:

Show brief summaries of individual metadata blocks:

kamu log org.example.data

Show detailed content of all blocks:

kamu log -o yaml org.example.data

Using a filter to inspect blocks containing query changes of a derivative dataset:

kamu log -o yaml --filter source org.example.data

kamu login

Logs in to a remote Kamu server

Usage: kamu login [OPTIONS] [server]

Arguments:

  • <SERVER> — ODF server URL (defaults to kamu.dev)

Options:

  • --user — Store access token in the user home folder rather than in the workspace
  • --check — Check whether existing authorization is still valid without triggering a login flow
  • --access-token <ACCESS-TOKEN> — Provide an existing access token

kamu logout

Logs out from a remote Kamu server

Usage: kamu logout [OPTIONS] [server]

Arguments:

  • <SERVER> — ODF server URL (defaults to kamu.dev)

Options:

  • --user — Drop access token stored in the user home folder rather than in the workspace
  • -a, --all — Log out of all logged in servers

kamu new

Creates a new dataset manifest from a template

Usage: kamu new [OPTIONS] <name>

Arguments:

  • <NAME> — Name of the new dataset

Options:

  • --root — Create a root dataset
  • --derivative — Create a derivative dataset

This command will create a dataset manifest from a template allowing you to customize the most relevant parts without having to remember the exact structure of the yaml file.

Examples:

Create org.example.data.yaml file from template in the current directory:

kamu new org.example.data --root

kamu notebook

Starts the notebook server for exploring the data in the workspace

Usage: kamu notebook [OPTIONS]

Options:

  • --address <ADDRESS> — Expose HTTP server on specific network interface
  • --http-port <HTTP-PORT> — Expose HTTP server on specific port
  • -e, --env <VAR> — Propagate or set an environment variable in the notebook (e.g. -e VAR or -e VAR=foo)

This command will run the Jupyter server and the Spark engine connected together, letting you query data with SQL before pulling it into the notebook for final processing and visualization.

For more information check out notebook examples at https://github.com/kamu-data/kamu-cli

kamu pull

Pull new data into the datasets

Usage: kamu pull [OPTIONS] [dataset]...

Arguments:

  • <DATASET> — Local or remote dataset reference(s)

Options:

  • -a, --all — Pull all datasets in the workspace
  • -r, --recursive — Also pull all transitive dependencies of specified datasets
  • --fetch-uncacheable — Pull latest data from the uncacheable data sources
  • --as <NAME> — Local name of a dataset to use when syncing from a repository
  • --no-alias — Don’t automatically add a remote push alias for this destination
  • --set-watermark <TIME> — Injects a manual watermark into the dataset to signify that no data is expected to arrive with event time that precedes it
  • -f, --force — Overwrite local version with remote, even if revisions have diverged

Pull is a multi-functional command that lets you update a local dataset. Depending on the parameters and the types of datasets involved it can be used to:

  • Run polling ingest to pull data into a root dataset from an external source
  • Run transformations on a derivative dataset to process previously unseen data
  • Pull dataset from a remote repository into your workspace
  • Update watermark on a dataset

Examples:

Fetch latest data in a specific dataset:

kamu pull org.example.data

Fetch latest data in datasets matching pattern:

kamu pull org.example.%

Fetch latest data for the entire dependency tree of a dataset:

kamu pull --recursive org.example.derivative

Refresh data of all datasets in the workspace:

kamu pull --all

Fetch dataset from a registered repository:

kamu pull kamu/org.example.data

Fetch dataset from a URL (see kamu repo add -h for supported sources):

kamu pull ipfs://bafy...a0dx/data
kamu pull s3://my-bucket.example.org/odf/org.example.data
kamu pull s3+https://example.org:5000/data --as org.example.data

Advance the watermark of a dataset:

kamu pull --set-watermark 2020-01-01 org.example.data

kamu push

Push local data into a repository

Usage: kamu push [OPTIONS] [dataset]...

Arguments:

  • <DATASET> — Local or remote dataset reference(s)

Options:

  • -a, --all — Push all datasets in the workspace
  • -r, --recursive — Also push all transitive dependencies of specified datasets
  • --no-alias — Don’t automatically add a remote push alias for this destination
  • --to <REM> — Remote alias or a URL to push to
  • -f, --force — Overwrite remote version with local, even if revisions have diverged

Use this command to share your new dataset or new data with others. All changes performed by this command are atomic and non-destructive. This command will analyze the state of the dataset at the repository and will only upload data and metadata that wasn’t previously seen.

Similarly to git, if someone else modified the dataset concurrently with you - your push will be rejected, and you will have to resolve the conflict.

Examples:

Sync dataset to a destination URL (see kamu repo add -h for supported protocols):

kamu push org.example.data --to s3://my-bucket.example.org/odf/org.example.data

Sync dataset to a named repository (see kamu repo command group):

kamu push org.example.data --to kamu-hub/org.example.data

Sync dataset that already has a push alias:

kamu push org.example.data

Sync datasets matching pattern that already have push aliases:

kamu push org.example.%

Add dataset to local IPFS node and update IPNS entry to the new CID:

kamu push org.example.data --to ipns://k5..zy

kamu rename

Rename a dataset

Usage: kamu rename <dataset> <name>

Arguments:

  • <DATASET> — Dataset reference
  • <NAME> — The new name to give it

Use this command to rename a dataset in your local workspace. Renaming is safe in terms of downstream derivative datasets as they use stable dataset IDs to define their inputs.

Examples:

Renaming is often useful when you pull a remote dataset by URL, and it gets auto-assigned not the most convenient name:

kamu pull ipfs://bafy...a0da
kamu rename bafy...a0da my.dataset

kamu reset

Revert the dataset back to the specified state

Usage: kamu reset [OPTIONS] <dataset> <hash>

Arguments:

  • <DATASET> — ID of the dataset
  • <HASH> — Hash of the block to reset to

Options:

  • -y, --yes — Don’t ask for confirmation

Resetting a dataset to the specified block erases all metadata blocks that followed it and deletes all data added since that point. This can sometimes be useful to resolve conflicts, but otherwise should be used with care.

Keep in mind that blocks that were pushed to a repository could’ve been already observed by other people, so resetting the history will not let you take that data back and instead create conflicts for the downstream consumers of your data.

kamu repo

Manage set of tracked repositories

Usage: kamu repo <COMMAND>

Subcommands:

  • add — Adds a repository
  • delete — Deletes a reference to repository
  • list — Lists known repositories
  • alias — Manage set of remote aliases associated with datasets

Repositories are nodes on the network that let users exchange datasets. In the most basic form, a repository can simply be a location where the dataset files are hosted over one of the supported file or object-based data transfer protocols. The owner of a dataset will have push privileges to this location, while other participants can pull data from it.

Examples:

Show available repositories:

kamu repo list

Add S3 bucket as a repository:

kamu repo add example-repo s3://bucket.my-company.example/

kamu repo add

Adds a repository

Usage: kamu repo add <name> <url>

Arguments:

  • <NAME> — Local alias of the repository
  • <URL> — URL of the repository

For local file system repositories use the following URL formats:

file:///home/me/example/repository/
file:///c:/Users/me/example/repository/

For S3-compatible basic repositories use:

s3://bucket.my-company.example/
s3+http://my-minio-server:9000/bucket/
s3+https://my-minio-server:9000/bucket/

For ODF-compatible smart repositories use:

odf+http://odf-server/
odf+https://odf-server/

kamu repo delete

Deletes a reference to repository

Usage: kamu repo delete [OPTIONS] [repository]...

Arguments:

  • <REPOSITORY> — Repository name(s)

Options:

  • -a, --all — Delete all known repositories
  • -y, --yes — Don’t ask for confirmation

kamu repo list

Lists known repositories

Usage: kamu repo list [OPTIONS]

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

kamu repo alias

Manage set of remote aliases associated with datasets

Usage: kamu repo alias <COMMAND>

Subcommands:

  • list — Lists remote aliases
  • add — Adds a remote alias to a dataset
  • delete — Deletes a remote alias associated with a dataset

When you pull and push datasets from repositories kamu uses aliases to let you avoid specifying the full remote reference each time. Aliases are usually created the first time you do a push or pull and saved for later. If you have an unusual setup (e.g. pushing to multiple repositories) you can use this command to manage the aliases.

Examples:

List all aliases:

kamu repo alias list

List all aliases of a specific dataset:

kamu repo alias list org.example.data

Add a new pull alias:

kamu repo alias add --pull org.example.data kamu.dev/me/org.example.data

kamu repo alias list

Lists remote aliases

Usage: kamu repo alias list [OPTIONS] [dataset]

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

kamu repo alias add

Adds a remote alias to a dataset

Usage: kamu repo alias add [OPTIONS] <dataset> <alias>

Arguments:

  • <DATASET> — Local dataset reference
  • <ALIAS> — Remote dataset name

Options:

  • --push — Add a push alias
  • --pull — Add a pull alias

kamu repo alias delete

Deletes a remote alias associated with a dataset

Usage: kamu repo alias delete [OPTIONS] <dataset> [alias]

Arguments:

  • <DATASET> — Local dataset reference
  • <ALIAS> — Remote dataset name

Options:

  • -a, --all — Delete all aliases
  • --push — Add a push alias
  • --pull — Add a pull alias

Searches for datasets in the registered repositories

Usage: kamu search [OPTIONS] [QUERY]

Arguments:

  • <QUERY> — Search terms

Options:

  • --repo <REPO> — Repository name(s) to search in

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

Search is delegated to the repository implementations and its capabilities depend on the type of the repo. Whereas smart repos may support advanced full-text search, simple storage-only repos may be limited to a substring search by dataset name.

Examples:

Search all repositories:

kamu search covid19

Search only specific repositories:

kamu search covid19 --repo kamu --repo statcan.gc.ca

kamu sql

Executes an SQL query or drops you into an SQL shell

Usage: kamu sql [OPTIONS] [COMMAND]

Subcommands:

  • server — Run JDBC server only

Options:

  • --url <URL> — URL of a running JDBC server (e.g. jdbc:hive2://example.com:10000)

  • -c, --command <CMD> — SQL command to run

  • --script <FILE> — SQL script file to execute

  • --engine <ENG> — Engine type to use for this SQL session

    Possible values: spark, datafusion

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

SQL shell allows you to explore data of all dataset in your workspace using one of the supported data processing engines. This can be a great way to prepare and test a query that you cal later turn into derivative dataset.

Examples:

Drop into SQL shell:

kamu sql

Execute SQL command and return its output in CSV format:

kamu sql -c 'SELECT * FROM `org.example.data` LIMIT 10' -o csv

Run SQL server to use with external data processing tools:

kamu sql server --address 0.0.0.0 --port 8080

Connect to a remote SQL server:

kamu sql --url jdbc:hive2://example.com:10000

Note: Currently when connecting to a remote SQL kamu server you will need to manually instruct it to load datasets from the data files. This can be done using the following command:

CREATE TEMP VIEW `my.dataset` AS (SELECT * FROM parquet.`kamu_data/my.dataset`);

kamu sql server

Run JDBC server only

Usage: kamu sql server [OPTIONS]

Options:

  • --address <ADDRESS> — Expose JDBC server on specific network interface

    Default value: 127.0.0.1

  • --port <PORT> — Expose JDBC server on specific port

    Default value: 10000

  • --livy — Run Livy server instead of Spark JDBC

  • --flight-sql — Run Flight SQL server instead of Spark JDBC

kamu system

Command group for system-level functionality

Usage: kamu system <COMMAND>

Subcommands:

  • gc — Runs garbage collection to clean up cached and unreachable objects in the workspace
  • upgrade-workspace — Upgrade the layout of a local workspace to the latest version
  • api-server — Run HTTP + GraphQL server
  • info — Summary of the system information
  • diagnose — Run basic system diagnose check
  • ipfs — IPFS helpers
  • check-token — Validate a Kamu token
  • generate-token — Generate a platform token from a known secret for debugging
  • compact — Compact a dataset

kamu system gc

Runs garbage collection to clean up cached and unreachable objects in the workspace

Usage: kamu system gc

kamu system upgrade-workspace

Upgrade the layout of a local workspace to the latest version

Usage: kamu system upgrade-workspace

kamu system api-server

Run HTTP + GraphQL server

Usage: kamu system api-server [OPTIONS] [COMMAND]

Subcommands:

  • gql-query — Executes the GraphQL query and prints out the result
  • gql-schema — Prints the GraphQL schema

Options:

  • --address <ADDRESS> — Bind to a specific network interface
  • --http-port <HTTP-PORT> — Expose HTTP+GraphQL server on specific port
  • --get-token — Output a JWT token you can use to authorize API queries

Examples:

Run API server on a specified port:

kamu system api-server --http-port 12312

Execute a single GraphQL query and print result to stdout:

kamu system api-server gql-query '{ apiVersion }'

Print out GraphQL API schema:

kamu system api-server gql-schema

kamu system api-server gql-query

Executes the GraphQL query and prints out the result

Usage: kamu system api-server gql-query [OPTIONS] <query>

Arguments:

  • <QUERY>

Options:

  • --full — Display the full result including extensions

kamu system api-server gql-schema

Prints the GraphQL schema

Usage: kamu system api-server gql-schema

kamu system info

Summary of the system information

Usage: kamu system info [OPTIONS]

Options:

  • -o, --output-format <FMT>

    Possible values: shell, json, yaml

kamu system diagnose

Run basic system diagnose check

Usage: kamu system diagnose

kamu system ipfs

IPFS helpers

Usage: kamu system ipfs <COMMAND>

Subcommands:

  • add — Adds the specified dataset to IPFS and returns the CID

kamu system ipfs add

Adds the specified dataset to IPFS and returns the CID

Usage: kamu system ipfs add <dataset>

Arguments:

  • <DATASET> — Dataset reference

kamu system check-token

Validate a Kamu token

Usage: kamu system check-token <token>

Arguments:

  • <TOKEN> — Kamu token

kamu system generate-token

Generate a platform token from a known secret for debugging

Usage: kamu system generate-token [OPTIONS] --login <login>

Options:

  • --login <LOGIN> — Account name

  • --gh-access-token <GH-ACCESS-TOKEN> — An existing GitHub access token

  • --expiration-time-sec <EXPIRATION-TIME-SEC> — Token expiration time in seconds

    Default value: 3600

kamu system compact

Compact a dataset

Usage: kamu system compact [OPTIONS] <dataset>...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • --max-slice-size <SIZE> — Maximum size of a single data slice file in bytes

    Default value: 1073741824

  • --max-slice-records <RECORDS> — Maximum amount of records in a single data slice file

    Default value: 10000

  • --hard — Perform ‘hard’ compaction that rewrites the history of a dataset

  • --verify — Perform verification of the dataset before running a compaction

For datasets that get frequent small appends the number of data slices can grow over time and affect the performance of querying. This command allows to merge multiple small data slices into a few large files, which can be beneficial in terms of size from more compact encoding, and in query performance, as data engines will have to scan through far fewer file headers.

There are two types of compactions: soft and hard.

Soft compactions produce new files while leaving the old blocks intact. This allows for faster queries, while still preserving the accurate history of how dataset evolved over time.

Hard compactions rewrite the history of the dataset as if data was originally written in big batches. They allow to shrink the history of a dataset to just a few blocks, reclaim the space used by old data files, but at the expense of history loss. Hard compactions will rewrite the metadata chain, changing block hashes. Therefore, they will break all downstream datasets that depend on them.

Examples:

Perform a history-altering hard compaction:

kamu system compact --hard my.dataset

kamu tail

Displays a sample of most recent records in a dataset

Usage: kamu tail [OPTIONS] <dataset>

Arguments:

  • <DATASET> — Local dataset reference

Options:

  • -n, --num-records <NUM> — Number of records to display

    Default value: 10

  • -s, --skip-records <NUM> — Number of initial records to skip before applying the limit

    Default value: 0

  • -o, --output-format <FMT> — Format to display the results in

    Possible values: table, csv, json, ndjson, json-soa

This command can be thought of as a shortcut for:

kamu sql --engine datafusion --command 'select * from "{dataset}" order by {offset_col} desc limit {num_records}'

kamu ui

Opens web interface

Usage: kamu ui [OPTIONS]

Options:

  • --address <ADDRESS> — Expose HTTP server on specific network interface
  • --http-port <HTTP-PORT> — Which port to run HTTP server on
  • --get-token — Output a JWT token you can use to authorize API queries

Starts a built-in HTTP + GraphQL server and opens a pre-packaged Web UI application in your browser.

Examples:

Starts server and opens UI in your default browser:

kamu ui

Start server on a specific port:

kamu ui --http-port 12345

kamu verify

Verifies the validity of a dataset

Usage: kamu verify [OPTIONS] <dataset>...

Arguments:

  • <DATASET> — Local dataset reference(s)

Options:

  • -r, --recursive — Verify the entire transformation chain starting with root datasets
  • --integrity — Check only the hashes of metadata and data without replaying transformations

Validity of derivative data is determined by:

  • Trustworthiness of the source data that went into it
  • Soundness of the derivative transformation chain that shaped it
  • Guaranteeing that derivative data was in fact produced by declared transformations

For the first two, you can inspect the dataset lineage so see which root datasets the data is coming from and whether their publishers are credible. Then you can audit all derivative transformations to ensure they are sound and non-malicious.

This command can help you with the last stage. It uses the history of transformations stored in metadata to first compare the hashes of data with ones stored in metadata (i.e. verify that data corresponds to metadata). Then it repeats all declared transformations locally to ensure that what’s declared in metadata actually produces the presented result.

The combination of the above steps can give you a high certainty that the data you’re using is trustworthy.

When called on a root dataset the command will only perform the integrity check of comparing data hashes to metadata.

Examples:

Verify the data in a dataset starting from its immediate inputs:

kamu verify com.example.deriv

Verify the data in datasets matching pattern:

kamu verify com.example.%

Verify the entire transformation chain starting with root datasets (may download a lot of data):

kamu pull --recursive com.example.deriv

Verify only the hashes of metadata and data, without replaying the transformations. This is useful when you trust the peers performing transformations but want to ensure data was not tampered in storage or during the transmission:

kamu verify --integrity com.example.deriv

kamu version

Outputs build information

Usage: kamu version [OPTIONS]

Options:

  • -o, --output-format <FMT>

    Possible values: shell, json, yaml