> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kamu.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Ingesting Data

export const Diagram = ({src, alt}) => {
  return <div style={{
    display: "flex",
    "flex-direction": "column",
    "align-items": "center"
  }}>
    <img src={src} alt={alt} style={{
    background: "#dddddd",
    "margin-bottom": 0
  }} />
    <span>{alt}</span>
  </div>;
};

export const YouTubeList = ({id}) => {
  const src = `https://www.youtube.com/embed/videoseries?list=${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>;
};

export const YouTube = ({id, width}) => {
  const src = `https://www.youtube.com/embed/${id}`;
  return <iframe className="w-full aspect-video rounded-xl" src={src} allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen width={width}></iframe>;
};

export const Schema = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/odf/schemas#${anchor}`;
  return <a class="schema-object" href={link}>{t}</a>;
};

export const Term = ({t, id}) => {
  const anchor = id ? id : t.toLowerCase().replace(/\s+/g, "-");
  const link = `/general/glossary#${anchor}`;
  return <a class="glossary-term" href={link}>{t}</a>;
};

Ingestion is the process by which external data is added into Open Data Fabric's <Term t="root datasets" id="root-dataset" />. Kamu supports a wide variety of sources including resources on the web and [blockchains](/cli/ingest/blockchain-source). Below we will describe why this process is necessary and how it works.

## Motivation

When interacting with data on the web we usually cannot make any assumptions about guarantees that its publisher provides:

* Just by looking at the data we cannot tell if it is **mutable** or preserves **history of changes**
* We often can't tell where it **originates** from and who can be held **accountable** for its **validity**
* We can't be sure that it's sufficiently **replicated**, so that if the server that hosts it today disappears tomorrow - all our data science projects will not be left **non-reproducible**.

Data on the web is in a state of a *constant churn* where data is often updated in destructive ways and even important medical datasets disappear over night. Any guarantees do exist are hidden behind many layers of service agreements, yet still provide us no strong assurance that they won't be violated (on purpose or by accident). This is why step #1 of every data science project is to **copy and version** the data.

[Open Data Fabric](/odf/spec) was created to **avoid excessive copying and harmful versioning** of data, and instead embed these guarantees into the data format itself, making them **impossible to violate**.

Ingestion step is about getting the external data from this *"churning world"* into strict ledgers of <Term t="root datasets" id="root-dataset" />, where properties like clear ownership, reproducibility, accountability, and complete historical account can be guaranteed and made explicit.

## Sources

There are two types of ingestion sources:

* **Push sources** - for cases when external actor actively sends data into a dataset. This is suitable e.g. for IoT devices that periodically write data, business processes that can report events directly into ODF dataset, or for ingesting data from streaming data APIs and event queues like [Apache Kafka](https://kafka.apache.org/).
* **Polling source** - for cases when external data is stored somewhere in bulk and we want to synchronize its state with ODF dataset periodically. This is suitable e.g. for datasets that exist as files on the web or for reading data from bulk APIs.

## Phases

Ingestion process has several well-defined phases:

<div align="center">
  <Diagram src="/images/cli/ingest/ingest.png" alt="Ingest flow" />
</div>

These phases are directly reflected in the <Schema t="SetPollingSource" /> event:

* `fetch` - specifies how to download the data from some external source (e.g. HTTP/FTP) and how to cache it efficiently
* `prepare` (optional) - specifies how to prepare raw binary data for reading (e.g. extracting an archive or converting between formats)
* `read` - specifies how to read the data into structured form (e.g. as CSV or Parquet)
* `preprocess` (optional) - allows to shape the structured data with queries (e.g. to parse and convert types into best suited form wit SQL)
* `merge` - specifies how to **combine the read data with the history of previously seen data** (this step is extremely important as it performs "ledgerization" / "historization" of the evolving state of data - see [Merge Strategies](/cli/ingest/merge-strategies) section for explanation).

<Tip>
  If you are confused about what `SetPollingSource` event is - please refer to [First Steps](/cli/quick-start) section that explains dataset creation.
</Tip>

The phases of push ingest are defined by the <Schema t="AddPushSource" /> event and are very similar, except for omitting `fetch` and `prepare` steps.

For more information refer to [Polling Sources](/cli/ingest/polling-source) and [Push Sources](/cli/ingest/push-source) sections.

## Further Reading

* For more information about defining data sources refer to [Polling Sources](/cli/ingest/polling-source) and [Push Sources](/cli/ingest/push-source) sections.
* For examples of dealing with various types of data refer to [Input Formats](/cli/ingest/input-formats) section.
* For detailed explanation of "ledgerization" process see [Merge Strategies](/cli/ingest/merge-strategies) section.
* For more inspiration on creating <Term t="root datasets" id="root-dataset" /> see [Examples](/examples) and [`kamu-contrib`](https://github.com/kamu-data/kamu-contrib/) repo.
