Adding large datasets

I am planning to use some large remote sensing datasets of Modis, which are all stored online:

What is a good way of dealing with such datasets? Would it be good to have a script that first downloads the data to some temporary location, extract just what is needed, outputs just a relatively small file and delete the downloaded files? In this way, I could also run it with renku, but then the input data itself is not stored in the repository. Or would you suggest to first add the data to the renku repository? In this way, the repository might become quite big.


Your use case is not clear to me: Do you want to process the download data and then discard it or you need the downloaded data in future?
If you need to use the data only once (e.g. to generate some output) then it’s best not to add it to your repo. You can download it in a temporary location on your file system and add this data to your project as EXTERNAL data. In this case, renku will create a pointer to your data in the project but it won’t copy the actual data. You can use these pointers in your workflows as normal files and then delete them once the processing is done. Note that if you delete these external files then update and rerun commands will fail since they cannot find the source data but I guess you don’t want to run those commands if your use case is a one-time processing. To add external data to a dataset use renku dataset add <ds-name> --external /path/to/external/data.

If you want to reuse that data then it’s probably best to add the data to your repo. You can add data to your dataset directly from a URL: `renku dataset add url(s).

Okay, yes that answers my question mostly. So one option is indeed as you say: download the data, process it (i.e. I extract the time series for one specific location, instead of all the data) and then discard it. Other option would be to add the full data to the repository.

The external option seems a good way then. But this will break the lineage, from the original source data to the outputs, correct? Because the downloading will then be outside of renku. I don’t want to do the rerun and update commands indeed, but once we finish and publish, it should be possible for someone to see the right version and link to the original data. Just asking as I wonder what are best practices here, should I add this information then to the metadata of an external source?

But this will break the lineage, from the original source data to the outputs, correct?

Right, external metadata in renku only includes links to your local filesystem path and not the original URL. ATM, we don’t have a solution to track this single-use data in renku without storing it in the repo.

@rcnijzink pointed me to this, as I had a similar question. @mohammad-sdsc is there a plan to implement clear tracking of external data? I have been using renku dataset add url(s), but then I never see these URLs in the KG or any other obvious way to get to them in the UI. The original sources are actually a bit opaque for all data sets. For example, when I downloaded this dataset:
I actually added the link manually in the metadata, otherwise I would not know how to find out later where I got the data from.

When adding data from a URL, Renku stores it in the dataset’s metadata. However, as you’ve mentioned, KG doesn’t process this data and UI doesn’t display it. I’ll create a story for this issue in the Renku repo.

1 Like

@rcnijzink I just came across this: Skip the download! Stream NASA data directly into Python objects | by Scott Henderson | pangeo | Medium

Perhaps worth exploring.