Documentation Index
Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
Use this file to discover all available pages before exploring further.
kamu supports multiple and that in combination with custom can be used ingest all kinds of data.
Note that kamu is not made to be or replace data workflow tools like Apache Airflow or Apache Ni-Fi, or data extraction tool like Debezium. Utilities described below are here only to simplify the initial data ingestion step - the very first step in the data’s journey through a web of structured stream processing pipelines.
CSV Variants
You can customize a lot of formatting options for CSV format parser. For example, a tab-separated file can be read as:JSON Document
If you have a JSON document such as this:subPath points at the array of records withing the document.
See:
NDJSON
NDJSON, aka newline-delimited JSON file such as:GeoJSON Document
Simply use:FeatureCollection object in the root and will create a record per each Feature inside it, extracting the properties into individual columns and leaving the feature geometry in its own column.
See:
NDGeoJSON Document
Simply use:FeatureCollection object in the root it expects every individual Feature to appear on its own line.
See:
Esri Shapefile
GIS data in ESRI format can be read as:Compressed Data & Archives
Usedecompress preparation step to extract data from gzip, zip archives.
Other Formats
If you have to work with formats that are not natively supported you’ll need to transcode them. Using thepipe preparation step you can specify a custom program or a script that will get data via STDIN and output result to STDOUT.
For example here’s how transcoding a JSON document into CSV using jq may look like:
Directory of Timestamped Files
The is used in cases where directory contains a growing set of files. Files can be periodic snapshots of your database or represent batches of new data in a ledger. In either case file content should never change - oncekamu processes a file it will not consider it again. It’s OK for files to disappear - kamu will remember the name of the file it ingested last and will only consider files that are higher in order than that one (lexicographically based on file name, or based on event time as shown below).
In the example below data inside the files is in snapshot format, and to complicate things it does not itself contain an event time - the event time is written into the file’s name.
Directory contents:
Dealing with API Keys
Sometimes you may want to parametrize the URL to include things like API keys and auth tokens. For thiskamu supports basic variable substitution:
Using Ingest Scripts
Sometimes you may need the power of a general purpose programming language to deal with particularly complex API, or when doing web scraping. For thiskamu supports containerized ingestion tasks:
- Produce data to
stdout - Write warnings / errors to
sterr - Use following environment variables:
ODF_LAST_MODIFIED- last modified time of data from the previous ingest run, if any (in RFC3339 format)ODF_ETAG- caching tag of data from the previous ingest run, if anyODF_BATCH_SIZE- is the recommended number of records, for ingest scripts that provide continuous stream of data and can resume from previous state- default value is 10 000, can be overridden via
env
- default value is 10 000, can be overridden via
ODF_NEW_LAST_MODIFIED_PATH- path to a text file where ingest script may write newLast-ModifiedtimestampODF_NEW_ETAG_PATH- path to a text file where ingest script may write neweTagODF_NEW_HAS_MORE_DATA_PATH- path to a text file which ingest script can create to indicate about having more data for the next batch- ⚠️ Please note: if the file is created, one of the following output marks must also be present:
eTagorLast-Modifiedtimestamp
- ⚠️ Please note: if the file is created, one of the following output marks must also be present: