Renku dataset avoid redundant directories

Hello,

Let’s say I have just generated new data in a renku project, either by executing commands or by running renku run.

What is the recommended way of creating a dataset with that new data ? In particular:

  • When creating the dataset, should the short name of the dataset match the directory hierarchy in data to prevent additional directories from being created ?
  • Should I specify the git repo URL and the path to the data within the git repo even though the dataset is created within the project where the data was generated ?
  • If I re-run renku run, do I need to renku dataset update even though the dataset is specified for the project where the data was generated ?

Thanks !

Cyril

2 Likes

Hello Cyril,

please excuse the late reply. I can try and answer your questions, but they’re rather tricky to answer.

  1. It is fine for a directory with the shortname to already exist in the data/ directory but adding a file that’s in that directory would cause the command to fail, as far as I understand it.
  2. You should specify a local path in the repo. If you use the git url, I’m not sure what exactly would happen (you could run into corner cases where a newer file gets overwritten by an older file if you didn’t push, I guess). That said, I think using the git url might actually work and might even enable dataset update to work for renku run, so you’re welcome to try. But it’s definitely not something we envisioned users doing.
  3. Currently, renku run and datasets don’t work together. We have it on our roadmap to add support to this and see it as a critical feature, but it is tricky to implement and cover all use-cases. You should be able to renku run into the dataset folder in data/, but the metadata of the dataset would not be updated and still contain the old file information. But if someone else imported the dataset into their project, my guess is they would still get the newest version of the file. dataset update does not work for local files to the best of my knowledge, it is intended to keep imported/remote datasets in sync. You should be able to call renku dataset add --overwrite to add the new file to the dataset after the rerun.

Hopefully this will all work nicer together once we work out how to use dataset together with renku run. I’d be great if you could write about your usecase or how you’d like datasets and renku run to work together in https://github.com/SwissDataScienceCenter/renku-python/issues/787

Oh and a heads up, “shortname” will be changed to “name” in the next renku release, and “name” will be “title”.

I hope this helps and that I understood your questions correctly.

Regards,

Ralf

1 Like

Hello Ralf,

Thanks for the nice answers ! 2 and 3 are clarified.

I was not clear with question 1, sorry. The question I should have asked instead was:

Does it make sense at all to create a dataset within the project where the data was generated when the goal is to import that data as a dataset in another project ?

With your answers, I’m pretty sure the answer would be no. That would also explain why creating a dataset automatically copies files, even when the dataset is created with data generated directly within the renku project. That would also explain why it makes no sense to add the git url of the repo for a dataset created with files from a local renku project.

I’ll think about renku run and renku dataset, and post to that thread.

Best,

Cyril

Hello Cyril,

I think for question 1, the main difference would be whether you
(1) dataset add all files in project A and import the dataset in project B or
(2) dataset add from project A in project B directly.

If there’s more than one project using the data, then (1) is easier for the other projects, as they don’t need to know about the individual files. Also with (1), if you later add additional files to the dataset in project A, they will automatically end up in project B if you dataset update there.

To not spam the project, you should be able to delete the original file (the one not in data/) once it is in the dataset.

Regards,

Ralf

1 Like