Phases Fetching Data Preparing Data Reading Data Preprocessing Data Merging Data See Merge Strategies for detailed description of the merge phase.
Purpose Open Data Fabric by design stores all data in the append-only event log format, always preserving the entire history. Unfortunately, a lot of data in the world is not stored or exposed this way. Some organizations may expose their data in the form of periodic database dumps, while some choose to provide it as a log of changes between current and the previous export. When ingesting data from external sources, the Root Datasets can choose between different merge strategies that define how to combine the newly-ingested data with the existing one.
Compressed Data & Archives Use decompress preparation step to extract data from gzip, zip archives. prepare: - kind: decompress format: gzip In case of a multi-file archive you can specify which file should be extracted: prepare: - kind: decompress format: zip subPath: specific-file-*.csv # Note: can contain glob patterns See also: PrepStep::Decompress CSV and Variants Tab-separated file: read: kind: csv separator: "\t" quote: '"' See also: ReadStep::Csv