CLI Reference
kamu
Usage: kamu [OPTIONS] <COMMAND>
Subcommands:
add
— Add a new dataset or modify an existing onecompletions
— Generate tab-completion scripts for your shellconfig
— Get or set configuration optionsdelete
— Delete a datasetingest
— Adds data to the root dataset according to its push source configurationinit
— Initialize an empty workspace in the current directoryinspect
— Group of commands for exploring dataset metadatalist
— List all datasets in the workspacelog
— Shows dataset metadata historylogin
— Logs in to a remote Kamu server interactivelylogout
— Logs out from a remote Kamu servernew
— Creates a new dataset manifest from a templatenotebook
— Starts the notebook server for exploring the data in the workspacepull
— Pull new data into the datasetspush
— Push local data into a repositoryrename
— Rename a datasetreset
— Revert the dataset back to the specified staterepo
— Manage set of tracked repositoriessearch
— Searches for datasets in the registered repositoriessql
— Executes an SQL query or drops you into an SQL shellsystem
— Command group for system-level functionalitytail
— Displays a sample of most recent records in a datasetui
— Opens web interfaceverify
— Verifies the validity of a datasetversion
— Outputs build information
Options:
-v
— Sets the level of verbosity (repeat for more)--no-color
— Disable color output in the terminal-q
,--quiet
— Suppress all non-essential output--trace
— Record and visualize the command execution as perfetto.dev trace
To get help for individual commands use:
kamu
kamu add
Add a new dataset or modify an existing one
Usage: kamu add [OPTIONS] [manifest]...
Arguments:
<MANIFEST>
— Dataset manifest reference(s) (path, or URL)
Options:
-r
,--recursive
— Recursively search for all manifest in the specified directory--replace
— Delete and re-add datasets that already exist--stdin
— Read manifests from standard input--name <N>
— Overrides the name in a loaded manifest
This command creates a new dataset from the provided DatasetSnapshot manifest.
Note that after kamu creates a dataset the changes in the source file will not have any effect unless you run the add command again. When you are experimenting with adding new dataset you currently may need to delete and re-add it multiple times until you get your parameters and schema right.
In future versions the add command will allow you to modify the structure of already existing datasets (e.g. changing schema in a compatible way).
Examples:
Add a root/derivative dataset from local manifest:
kamu add org.example.data.yaml
Add datasets from all manifests found in the current directory:
kamu add --recursive .
Add a dataset from manifest hosted externally (e.g. on GihHub):
kamu add https://raw.githubusercontent.com/kamu-data/kamu-contrib/master/ca.bankofcanada/ca.bankofcanada.exchange-rates.daily.yaml
To add dataset from a repository see kamu pull
command.
kamu completions
Generate tab-completion scripts for your shell
Usage: kamu completions <shell>
Arguments:
<SHELL>
Possible values:
bash
,elvish
,fish
,powershell
,zsh
The command outputs to STDOUT, allowing you to re-direct the output to the file of your choosing. Where you place the file will depend on which shell and which operating system you are using. Your particular configuration may also determine where these scripts need to be placed.
Here are some common set ups:
Bash:
Append the following to your ~/.bashrc
:
source <(kamu completions bash)
You will need to reload your shell session (or execute the same command in your current one) for changes to take effect.
Zsh:
Append the following to your ~/.zshrc
:
autoload -U +X bashcompinit && bashcompinit
source <(kamu completions bash)
Please contribute a guide for your favorite shell!
kamu config
Get or set configuration options
Usage: kamu config <COMMAND>
Subcommands:
list
— Display current configuration combined from all config filesget
— Get current configuration valueset
— Set or unset configuration value
Configuration in kamu
is managed very similarly to git
. Starting with your current workspace and going up the directory tree you can have multiple .kamuconfig
YAML files which are all merged together to get the resulting config.
Most commonly you will have a workspace-scoped config inside the .kamu
directory and the user-scoped config residing in your home directory.
Examples:
List current configuration as combined view of config files:
kamu config list
Get current configuration value:
kamu config get engine.runtime
Set configuration value in workspace scope:
kamu config set engine.runtime podman
Set configuration value in user scope:
kamu config set --user engine.runtime podman
Unset or revert to default value:
kamu config set --user engine.runtime
kamu config list
Display current configuration combined from all config files
Usage: kamu config list [OPTIONS]
Options:
--user
— Show only user scope configuration--with-defaults
— Show configuration with all default values applied
kamu config get
Get current configuration value
Usage: kamu config get [OPTIONS] <cfgkey>
Arguments:
<CFGKEY>
— Path to the config option
Options:
--user
— Operate on the user scope configuration file--with-defaults
— Get default value if config option is not explicitly set
kamu config set
Set or unset configuration value
Usage: kamu config set [OPTIONS] <cfgkey> [value]
Arguments:
<CFGKEY>
— Path to the config option<VALUE>
— New value to set
Options:
--user
— Operate on the user scope configuration file
kamu delete
Delete a dataset
Usage: kamu delete [OPTIONS] [dataset]...
Arguments:
<DATASET>
— Local dataset reference(s)
Options:
-a
,--all
— Delete all datasets in the workspace-r
,--recursive
— Also delete all transitive dependencies of specified datasets-y
,--yes
— Don’t ask for confirmation
This command deletes the dataset from your workspace, including both metadata and the raw data.
Take great care when deleting root datasets. If you have not pushed your local changes to a repository - the data will be lost.
Deleting a derivative dataset is usually not a big deal, since they can always be reconstructed, but it will disrupt downstream consumers.
Examples:
Delete a local dataset:
kamu delete my.dataset
Delete local datasets matching pattern:
kamu delete my.dataset.%
kamu ingest
Adds data to the root dataset according to its push source configuration
Usage: kamu ingest [OPTIONS] <dataset> [FILE]...
Arguments:
<DATASET>
— Local dataset reference<FILE>
— Data file(s) to ingest
Options:
--source-name <SRC>
— Name of the push source to use for ingestion--event-time <T>
— Event time to be used if data does not contain one--stdin
— Read data from the standard input-r
,--recursive
— Recursively propagate the updates into all downstream datasets--input-format <FMT>
— Overrides the media type of the data expected by the push sourcePossible values:
csv
,json
,ndjson
,geojson
,ndgeojson
,parquet
,esrishapefile
Examples:
Ingest data from files:
kamu ingest org.example.data path/to/data.csv
Ingest data from standard input (assumes source is defined to use NDJSON):
echo '{"key": "value1"}\n{"key": "value2"}' | kamu ingest org.example.data --stdin
Ingest data with format conversion:
echo '[{"key": "value1"}, {"key": "value2"}]' | kamu ingest org.example.data --stdin --input-format json
Ingest data with event time hint:
kamu ingest org.example.data data.json --event-time 2050-01-02T12:00:00Z
kamu init
Initialize an empty workspace in the current directory
Usage: kamu init [OPTIONS]
Options:
--exists-ok
— Don’t return an error if workspace already exists--pull-images
— Only pull container images and exit
A workspace is where kamu stores all the important information about datasets (metadata) and in some cases raw data.
It is recommended to create one kamu workspace per data science project, grouping all related datasets together.
Initializing a workspace creates a .kamu
directory contains dataset metadata, data, and all supporting files (configs, known repositories etc.).
kamu inspect
Group of commands for exploring dataset metadata
Usage: kamu inspect <COMMAND>
Subcommands:
lineage
— Shows the dependency tree of a datasetquery
— Shows the transformations used by a derivative datasetschema
— Shows the dataset schema
kamu inspect lineage
Shows the dependency tree of a dataset
Usage: kamu inspect lineage [OPTIONS] <dataset>...
Arguments:
<DATASET>
— Local dataset reference(s)
Options:
-o
,--output-format <FMT>
— Format of an outputPossible values:
shell
,dot
,csv
,html
-b
,--browse
— Produce HTML and open it in a browser
Presents the dataset-level lineage that includes current and past dependencies.
Examples:
Show lineage of a single dataset:
kamu inspect lineage my.dataset
Show lineage graph of all datasets in a browser:
kamu inspect lineage --browse
Render the lineage graph into a png image (needs graphviz installed):
kamu inspect lineage -o dot | dot -Tpng > depgraph.png
kamu inspect query
Shows the transformations used by a derivative dataset
Usage: kamu inspect query <dataset>
Arguments:
<DATASET>
— Local dataset reference
This command allows you to audit the transformations performed by a derivative dataset and their evolution. Such audit is an important step in validating the trustworthiness of data (see kamu verify
command).
kamu inspect schema
Shows the dataset schema
Usage: kamu inspect schema [OPTIONS] <dataset>
Arguments:
<DATASET>
— Local dataset reference
Options:
-o
,--output-format <FMT>
— Format of an outputPossible values:
ddl
,parquet
,parquet-json
,arrow-json
Displays the schema of the dataset. Note that dataset schemas can evolve over time and by default the latest schema will be shown.
Examples:
Show logical schema of a dataset in the DDL format:
kamu inspect schema my.dataset
Show physical schema of the underlying Parquet files:
kamu inspect schema my.dataset -o parquet
kamu list
List all datasets in the workspace
Usage: kamu list [OPTIONS]
Options:
-w
,--wide
— Show more details (repeat for more)-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
Examples:
To see a human-friendly list of datasets in your workspace:
kamu list
To see more details:
kamu list -w
To get a machine-readable list of datasets:
kamu list -o csv
kamu log
Shows dataset metadata history
Usage: kamu log [OPTIONS] <dataset>
Arguments:
<DATASET>
— Local dataset reference
Options:
-o
,--output-format <FMT>
Possible values:
yaml
-f
,--filter <FLT>
--limit <LIMIT>
— Maximum number of blocks to displayDefault value:
500
Metadata of a dataset contains historical record of everything that ever influenced how data currently looks like.
This includes events such as:
- Data ingestion / transformation
- Change of query
- Change of schema
- Change of source URL or other ingestion steps in a root dataset
Use this command to explore how dataset evolved over time.
Examples:
Show brief summaries of individual metadata blocks:
kamu log org.example.data
Show detailed content of all blocks:
kamu log -o yaml org.example.data
Using a filter to inspect blocks containing query changes of a derivative dataset:
kamu log -o yaml --filter source org.example.data
kamu login
Logs in to a remote Kamu server interactively
Usage: kamu login [OPTIONS] [server] [COMMAND]
Subcommands:
oauth
— Performs non-interactive login to a remote Kamu server via OAuth provider tokenpassword
— Performs non-interactive login to a remote Kamu server via login and password
Arguments:
<SERVER>
— ODF server URL (defaults to kamu.dev)
Options:
--user
— Store access token in the user home folder rather than in the workspace--check
— Check whether existing authorization is still valid without triggering a login flow--access-token <ACCESS-TOKEN>
— Provide an existing access token
kamu login oauth
Performs non-interactive login to a remote Kamu server via OAuth provider token
Usage: kamu login oauth [OPTIONS] <provider> <access-token> [server]
Arguments:
<PROVIDER>
— Name of the OAuth provider, i.e. ‘github’<ACCESS-TOKEN>
— OAuth provider access token<SERVER>
— ODF backend server URL (defaults to kamu.dev)
Options:
--user
— Store access token in the user home folder rather than in the workspace
kamu login password
Performs non-interactive login to a remote Kamu server via login and password
Usage: kamu login password [OPTIONS] <login> <password> [server]
Arguments:
<LOGIN>
— Specify user name<PASSWORD>
— Specify password<SERVER>
— ODF backend server URL (defaults to kamu.dev)
Options:
--user
— Store access token in the user home folder rather than in the workspace
kamu logout
Logs out from a remote Kamu server
Usage: kamu logout [OPTIONS] [server]
Arguments:
<SERVER>
— ODF server URL (defaults to kamu.dev)
Options:
--user
— Drop access token stored in the user home folder rather than in the workspace-a
,--all
— Log out of all logged in servers
kamu new
Creates a new dataset manifest from a template
Usage: kamu new [OPTIONS] <name>
Arguments:
<NAME>
— Name of the new dataset
Options:
--root
— Create a root dataset--derivative
— Create a derivative dataset
This command will create a dataset manifest from a template allowing you to customize the most relevant parts without having to remember the exact structure of the yaml file.
Examples:
Create org.example.data.yaml
file from template in the current directory:
kamu new org.example.data --root
kamu notebook
Starts the notebook server for exploring the data in the workspace
Usage: kamu notebook [OPTIONS]
Options:
--address <ADDRESS>
— Expose HTTP server on specific network interface--http-port <HTTP-PORT>
— Expose HTTP server on specific port-e
,--env <VAR>
— Propagate or set an environment variable in the notebook (e.g.-e VAR
or-e VAR=foo
)
This command will run the Jupyter server and the Spark engine connected together, letting you query data with SQL before pulling it into the notebook for final processing and visualization.
For more information check out notebook examples at https://github.com/kamu-data/kamu-cli
kamu pull
Pull new data into the datasets
Usage: kamu pull [OPTIONS] [dataset]...
Arguments:
<DATASET>
— Local or remote dataset reference(s)
Options:
-a
,--all
— Pull all datasets in the workspace-r
,--recursive
— Also pull all transitive dependencies of specified datasets--fetch-uncacheable
— Pull latest data from the uncacheable data sources--as <NAME>
— Local name of a dataset to use when syncing from a repository--no-alias
— Don’t automatically add a remote push alias for this destination--set-watermark <TIME>
— Injects a manual watermark into the dataset to signify that no data is expected to arrive with event time that precedes it-f
,--force
— Overwrite local version with remote, even if revisions have diverged--reset-derivatives-on-diverged-input
— Run hard compaction of derivative dataset if transformation failed due to root dataset compaction
Pull is a multi-functional command that lets you update a local dataset. Depending on the parameters and the types of datasets involved it can be used to:
- Run polling ingest to pull data into a root dataset from an external source
- Run transformations on a derivative dataset to process previously unseen data
- Pull dataset from a remote repository into your workspace
- Update watermark on a dataset
Examples:
Fetch latest data in a specific dataset:
kamu pull org.example.data
Fetch latest data in datasets matching pattern:
kamu pull org.example.%
Fetch latest data for the entire dependency tree of a dataset:
kamu pull --recursive org.example.derivative
Refresh data of all datasets in the workspace:
kamu pull --all
Fetch dataset from a registered repository:
kamu pull kamu/org.example.data
Fetch dataset from a URL (see kamu repo add -h
for supported sources):
kamu pull ipfs://bafy...a0dx/data
kamu pull s3://my-bucket.example.org/odf/org.example.data
kamu pull s3+https://example.org:5000/data --as org.example.data
Advance the watermark of a dataset:
kamu pull --set-watermark 2020-01-01 org.example.data
kamu push
Push local data into a repository
Usage: kamu push [OPTIONS] [dataset]...
Arguments:
<DATASET>
— Local or remote dataset reference(s)
Options:
-a
,--all
— Push all datasets in the workspace-r
,--recursive
— Also push all transitive dependencies of specified datasets--no-alias
— Don’t automatically add a remote push alias for this destination--to <REM>
— Remote alias or a URL to push to-f
,--force
— Overwrite remote version with local, even if revisions have diverged
Use this command to share your new dataset or new data with others. All changes performed by this command are atomic and non-destructive. This command will analyze the state of the dataset at the repository and will only upload data and metadata that wasn’t previously seen.
Similarly to git, if someone else modified the dataset concurrently with you - your push will be rejected, and you will have to resolve the conflict.
Examples:
Sync dataset to a destination URL (see kamu repo add -h
for supported protocols):
kamu push org.example.data --to s3://my-bucket.example.org/odf/org.example.data
Sync dataset to a named repository (see kamu repo
command group):
kamu push org.example.data --to kamu-hub/org.example.data
Sync dataset that already has a push alias:
kamu push org.example.data
Sync datasets matching pattern that already have push aliases:
kamu push org.example.%
Add dataset to local IPFS node and update IPNS entry to the new CID:
kamu push org.example.data --to ipns://k5..zy
kamu rename
Rename a dataset
Usage: kamu rename <dataset> <name>
Arguments:
<DATASET>
— Dataset reference<NAME>
— The new name to give it
Use this command to rename a dataset in your local workspace. Renaming is safe in terms of downstream derivative datasets as they use stable dataset IDs to define their inputs.
Examples:
Renaming is often useful when you pull a remote dataset by URL, and it gets auto-assigned not the most convenient name:
kamu pull ipfs://bafy...a0da
kamu rename bafy...a0da my.dataset
kamu reset
Revert the dataset back to the specified state
Usage: kamu reset [OPTIONS] <dataset> <hash>
Arguments:
<DATASET>
— ID of the dataset<HASH>
— Hash of the block to reset to
Options:
-y
,--yes
— Don’t ask for confirmation
Resetting a dataset to the specified block erases all metadata blocks that followed it and deletes all data added since that point. This can sometimes be useful to resolve conflicts, but otherwise should be used with care.
Keep in mind that blocks that were pushed to a repository could’ve been already observed by other people, so resetting the history will not let you take that data back and instead create conflicts for the downstream consumers of your data.
kamu repo
Manage set of tracked repositories
Usage: kamu repo <COMMAND>
Subcommands:
add
— Adds a repositorydelete
— Deletes a reference to repositorylist
— Lists known repositoriesalias
— Manage set of remote aliases associated with datasets
Repositories are nodes on the network that let users exchange datasets. In the most basic form, a repository can simply be a location where the dataset files are hosted over one of the supported file or object-based data transfer protocols. The owner of a dataset will have push privileges to this location, while other participants can pull data from it.
Examples:
Show available repositories:
kamu repo list
Add S3 bucket as a repository:
kamu repo add example-repo s3://bucket.my-company.example/
kamu repo add
Adds a repository
Usage: kamu repo add <name> <url>
Arguments:
<NAME>
— Local alias of the repository<URL>
— URL of the repository
For local file system repositories use the following URL formats:
file:///home/me/example/repository/
file:///c:/Users/me/example/repository/
For S3-compatible basic repositories use:
s3://bucket.my-company.example/
s3+http://my-minio-server:9000/bucket/
s3+https://my-minio-server:9000/bucket/
For ODF-compatible smart repositories use:
odf+http://odf-server/
odf+https://odf-server/
kamu repo delete
Deletes a reference to repository
Usage: kamu repo delete [OPTIONS] [repository]...
Arguments:
<REPOSITORY>
— Repository name(s)
Options:
-a
,--all
— Delete all known repositories-y
,--yes
— Don’t ask for confirmation
kamu repo list
Lists known repositories
Usage: kamu repo list [OPTIONS]
Options:
-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
kamu repo alias
Manage set of remote aliases associated with datasets
Usage: kamu repo alias <COMMAND>
Subcommands:
list
— Lists remote aliasesadd
— Adds a remote alias to a datasetdelete
— Deletes a remote alias associated with a dataset
When you pull and push datasets from repositories kamu uses aliases to let you avoid specifying the full remote reference each time. Aliases are usually created the first time you do a push or pull and saved for later. If you have an unusual setup (e.g. pushing to multiple repositories) you can use this command to manage the aliases.
Examples:
List all aliases:
kamu repo alias list
List all aliases of a specific dataset:
kamu repo alias list org.example.data
Add a new pull alias:
kamu repo alias add --pull org.example.data kamu.dev/me/org.example.data
kamu repo alias list
Lists remote aliases
Usage: kamu repo alias list [OPTIONS] [dataset]
Arguments:
<DATASET>
— Local dataset reference
Options:
-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
kamu repo alias add
Adds a remote alias to a dataset
Usage: kamu repo alias add [OPTIONS] <dataset> <alias>
Arguments:
<DATASET>
— Local dataset reference<ALIAS>
— Remote dataset name
Options:
--push
— Add a push alias--pull
— Add a pull alias
kamu repo alias delete
Deletes a remote alias associated with a dataset
Usage: kamu repo alias delete [OPTIONS] <dataset> [alias]
Arguments:
<DATASET>
— Local dataset reference<ALIAS>
— Remote dataset name
Options:
-a
,--all
— Delete all aliases--push
— Add a push alias--pull
— Add a pull alias
kamu search
Searches for datasets in the registered repositories
Usage: kamu search [OPTIONS] [QUERY]
Arguments:
<QUERY>
— Search terms
Options:
--repo <REPO>
— Repository name(s) to search in-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
Search is delegated to the repository implementations and its capabilities depend on the type of the repo. Whereas smart repos may support advanced full-text search, simple storage-only repos may be limited to a substring search by dataset name.
Examples:
Search all repositories:
kamu search covid19
Search only specific repositories:
kamu search covid19 --repo kamu --repo statcan.gc.ca
kamu sql
Executes an SQL query or drops you into an SQL shell
Usage: kamu sql [OPTIONS] [COMMAND]
Subcommands:
server
— Run JDBC server only
Options:
--url <URL>
— URL of a running JDBC server (e.g. jdbc:hive2://example.com:10000)-c
,--command <CMD>
— SQL command to run--script <FILE>
— SQL script file to execute--engine <ENG>
— Engine type to use for this SQL sessionPossible values:
spark
,datafusion
-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
SQL shell allows you to explore data of all dataset in your workspace using one of the supported data processing engines. This can be a great way to prepare and test a query that you cal later turn into derivative dataset.
Examples:
Drop into SQL shell:
kamu sql
Execute SQL command and return its output in CSV format:
kamu sql -c 'SELECT * FROM `org.example.data` LIMIT 10' -o csv
Run SQL server to use with external data processing tools:
kamu sql server --address 0.0.0.0 --port 8080
Connect to a remote SQL server:
kamu sql --url jdbc:hive2://example.com:10000
Note: Currently when connecting to a remote SQL kamu server you will need to manually instruct it to load datasets from the data files. This can be done using the following command:
CREATE TEMP VIEW `my.dataset` AS (SELECT * FROM parquet.`kamu_data/my.dataset`);
kamu sql server
Run JDBC server only
Usage: kamu sql server [OPTIONS]
Options:
--address <ADDRESS>
— Expose JDBC server on specific network interfaceDefault value:
127.0.0.1
--port <PORT>
— Expose JDBC server on specific portDefault value:
10000
--flight-sql
— Run Flight SQL server instead of Spark JDBC
kamu system
Command group for system-level functionality
Usage: kamu system <COMMAND>
Subcommands:
gc
— Runs garbage collection to clean up cached and unreachable objects in the workspaceupgrade-workspace
— Upgrade the layout of a local workspace to the latest versionapi-server
— Run HTTP + GraphQL serverinfo
— Summary of the system informationdiagnose
— Run basic system diagnose checkipfs
— IPFS helpersdebug-token
— Validate a Kamu tokengenerate-token
— Generate a platform token from a known secret for debuggingcompact
— Compact a dataset
kamu system gc
Runs garbage collection to clean up cached and unreachable objects in the workspace
Usage: kamu system gc
kamu system upgrade-workspace
Upgrade the layout of a local workspace to the latest version
Usage: kamu system upgrade-workspace
kamu system api-server
Run HTTP + GraphQL server
Usage: kamu system api-server [OPTIONS] [COMMAND]
Subcommands:
gql-query
— Executes the GraphQL query and prints out the resultgql-schema
— Prints the GraphQL schema
Options:
--address <ADDRESS>
— Bind to a specific network interface--http-port <HTTP-PORT>
— Expose HTTP+GraphQL server on specific port--get-token
— Output a JWT token you can use to authorize API queries--external-address <EXTERNAL-ADDRESS>
— Allows changing the base URL used in the API. Can be handy when launching inside a container
Examples:
Run API server on a specified port:
kamu system api-server --http-port 12312
Execute a single GraphQL query and print result to stdout:
kamu system api-server gql-query '{ apiVersion }'
Print out GraphQL API schema:
kamu system api-server gql-schema
kamu system api-server gql-query
Executes the GraphQL query and prints out the result
Usage: kamu system api-server gql-query [OPTIONS] <query>
Arguments:
<QUERY>
Options:
--full
— Display the full result including extensions
kamu system api-server gql-schema
Prints the GraphQL schema
Usage: kamu system api-server gql-schema
kamu system info
Summary of the system information
Usage: kamu system info [OPTIONS]
Options:
-o
,--output-format <FMT>
Possible values:
shell
,json
,yaml
kamu system diagnose
Run basic system diagnose check
Usage: kamu system diagnose
kamu system ipfs
IPFS helpers
Usage: kamu system ipfs <COMMAND>
Subcommands:
add
— Adds the specified dataset to IPFS and returns the CID
kamu system ipfs add
Adds the specified dataset to IPFS and returns the CID
Usage: kamu system ipfs add <dataset>
Arguments:
<DATASET>
— Dataset reference
kamu system debug-token
Validate a Kamu token
Usage: kamu system debug-token <token>
Arguments:
<TOKEN>
— Kamu token
kamu system generate-token
Generate a platform token from a known secret for debugging
Usage: kamu system generate-token [OPTIONS]
Options:
--subject <SUBJECT>
— AccountID to generate token for--login <LOGIN>
— Account name to derive ID from (for predefined accounts only)--expiration-time-sec <EXPIRATION-TIME-SEC>
— Token expiration time in secondsDefault value:
3600
kamu system compact
Compact a dataset
Usage: kamu system compact [OPTIONS] <dataset>...
Arguments:
<DATASET>
— Local dataset reference(s)
Options:
--max-slice-size <SIZE>
— Maximum size of a single data slice file in bytesDefault value:
300000000
--max-slice-records <RECORDS>
— Maximum amount of records in a single data slice fileDefault value:
10000
--hard
— Perform ‘hard’ compaction that rewrites the history of a dataset--keep-metadata-only
— Perform compaction without saving data blocks--verify
— Perform verification of the dataset before running a compaction
For datasets that get frequent small appends the number of data slices can grow over time and affect the performance of querying. This command allows to merge multiple small data slices into a few large files, which can be beneficial in terms of size from more compact encoding, and in query performance, as data engines will have to scan through far fewer file headers.
There are two types of compactions: soft and hard.
Soft compactions produce new files while leaving the old blocks intact. This allows for faster queries, while still preserving the accurate history of how dataset evolved over time.
Hard compactions rewrite the history of the dataset as if data was originally written in big batches. They allow to shrink the history of a dataset to just a few blocks, reclaim the space used by old data files, but at the expense of history loss. Hard compactions will rewrite the metadata chain, changing block hashes. Therefore, they will break all downstream datasets that depend on them.
Examples:
Perform a history-altering hard compaction:
kamu system compact --hard my.dataset
kamu tail
Displays a sample of most recent records in a dataset
Usage: kamu tail [OPTIONS] <dataset>
Arguments:
<DATASET>
— Local dataset reference
Options:
-n
,--num-records <NUM>
— Number of records to displayDefault value:
10
-s
,--skip-records <NUM>
— Number of initial records to skip before applying the limitDefault value:
0
-o
,--output-format <FMT>
— Format to display the results inPossible values:
table
,csv
,json
,ndjson
,json-soa
,json-aoa
This command can be thought of as a shortcut for:
kamu sql --engine datafusion --command 'select * from "{dataset}" order by {offset_col} desc limit {num_records}'
kamu ui
Opens web interface
Usage: kamu ui [OPTIONS]
Options:
--address <ADDRESS>
— Expose HTTP server on specific network interface--http-port <HTTP-PORT>
— Which port to run HTTP server on--get-token
— Output a JWT token you can use to authorize API queries
Starts a built-in HTTP + GraphQL server and opens a pre-packaged Web UI application in your browser.
Examples:
Starts server and opens UI in your default browser:
kamu ui
Start server on a specific port:
kamu ui --http-port 12345
kamu verify
Verifies the validity of a dataset
Usage: kamu verify [OPTIONS] <dataset>...
Arguments:
<DATASET>
— Local dataset reference(s)
Options:
-r
,--recursive
— Verify the entire transformation chain starting with root datasets--integrity
— Check only the hashes of metadata and data without replaying transformations
Validity of derivative data is determined by:
- Trustworthiness of the source data that went into it
- Soundness of the derivative transformation chain that shaped it
- Guaranteeing that derivative data was in fact produced by declared transformations
For the first two, you can inspect the dataset lineage so see which root datasets the data is coming from and whether their publishers are credible. Then you can audit all derivative transformations to ensure they are sound and non-malicious.
This command can help you with the last stage. It uses the history of transformations stored in metadata to first compare the hashes of data with ones stored in metadata (i.e. verify that data corresponds to metadata). Then it repeats all declared transformations locally to ensure that what’s declared in metadata actually produces the presented result.
The combination of the above steps can give you a high certainty that the data you’re using is trustworthy.
When called on a root dataset the command will only perform the integrity check of comparing data hashes to metadata.
Examples:
Verify the data in a dataset starting from its immediate inputs:
kamu verify com.example.deriv
Verify the data in datasets matching pattern:
kamu verify com.example.%
Verify the entire transformation chain starting with root datasets (may download a lot of data):
kamu pull --recursive com.example.deriv
Verify only the hashes of metadata and data, without replaying the transformations. This is useful when you trust the peers performing transformations but want to ensure data was not tampered in storage or during the transmission:
kamu verify --integrity com.example.deriv
kamu version
Outputs build information
Usage: kamu version [OPTIONS]
Options:
-o
,--output-format <FMT>
Possible values:
shell
,json
,yaml