Validity of Data
This topic is also covered in detail in this video:
kamu sharing data becomes very easy, but with that problem out of the way you will soon start wondering “How can I trust the data I downloaded from someon else?”. Let’s first define what validity or trustworthiness of data means.
Validity of Root Data
Let’s say you’re about to use a root dataset containing historical weather information in your city. How can you be sure it’s trustworthy?
Because source data is non-reproducible, its validity depends entirely on its publisher. Publisher is in full control of the data they present, which also means that measuring and processing errors and even malicious data can easily make its way into the root dataset. Make sure that your publishers have good reputation and prefer data that comes from well-established organizations (government or NGOs) that directly collect or opertate the systems from which the data is gathered.
Aside from external audits, another way to improve confidence in data is to correlate it with data from other similar sources. In our example we could compare it with the data from a weather station in the neighbouring city and look for anomalies.
Validity of Derivative Data
Derivative data in
kamu is created purely through transformations that are recorded in metadata. These transformations can still be malicious, but since they are usually small (e.g. a few SQL queries) we can easily audit them to ensure they are sound and done in good faith. Repeating this process for the entire transformation chain, starting with root datasets, will give you confidence in trustworthiness of derivative data.
Based on the above, here are the steps needed to ensure a dataset is trustworthy:
- Inspect lineage and identify all root datasets it’s composed of
- Ensure publishers of root datasets are reputable and credible
- Use lineage to audit all derivative transformations to see if they are sound and non-malicious
kamu verifycommand to ensure that data you downloaded actually matches the declared transformations
# Inspect the dependency graph of a dataset to find all root sources kamu inspect lineage ca.vancouver.opendata.weather.aggregated-daily # Inspect the transformations applied # Repeat this for every derivative dataset in the dependency graph kamu inspect query ca.vancouver.opendata.weather.aggregated-daily # Verify the data matches the declared metadata and transformations kamu verify --recursive ca.vancouver.opendata.weather.aggregated-daily
For every derivative dataset the
kamu verify command does two things:
- Compares hashes of data you downloaded to the ones stored in metadata (ensures data is not tampered or corrupted)
- Executes the declared derivative transformations locally to compares the hash of the result to one stored in metadata (ensures that metadata was not spoofed to match the fake result)