Polling sources are used for cases when external data is stored somewhere in bulk and we want to periodically synchronize its state with an ODF dataset. Polling sources are suitable for ingesting data from:Documentation Index
Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
Use this file to discover all available pages before exploring further.
- Periodic database dumps
- Data published as a set of files on the web
- Bulk data access APIs
- External systems using custom connector libraries.
Source Metadata
Polling sources are defined via metadata event:fetch- specifies how to download the data from some external source (e.g. HTTP/FTP) and how to cache it efficientlyprepare(optional) - specifies how to prepare raw binary data for reading (e.g. extracting an archive or converting between formats)read- specifies how to read the data into structured form (e.g. as CSV or Parquet)preprocess(optional) - allows to shape the structured data with queries (e.g. to parse and convert types into best suited form wit SQL)merge- specifies how to combine the read data with the history of previously seen data (this step is extremely important as it performs “ledgerization” / “historization” of the evolving state of data - see Merge Strategies section for explanation).
Polling Data via CLI
To poll data into dataset viakamu use the general-purpose kamu pull command.
Polling Data via API
See APIs documentation for various options of polling data programmatically via APIs.Event Time
The perfect scenario forkamu is when data records contain within them as a column, but many data sources on the web are not like that.
If event time is not present in data - kamu will try to infer it. This can be:
- Modification time for files on local or remote file systems
Last-Modifiedtime for HTTP resources.
fetch section of . It’s pretty flexible, allowing you to even extract time from timestamps that are part of file names:
Source Caching
kamu does its best to avoid redundant work and not ingest data if source was not updated since the lass poll.
Exact mechanism of cache control depends on the and the protocol used. In case of HTTP, for example, it will rely on standard HTTP caching headers like ETag and Last-Modified.
The latest caching information is stored in dataset metadata in event in a special object. This means that it is possible for ingest to return no data and no new , but still write a containing only the new source state.
You can control caching behavior via object in the fetch section of .