Renku dataset names

Hi All,

I’m having a bit of a confusing time with different dataset names and where / how they are used. My confusion could also be an artefact of using renku dataset import :slight_smile:

Context: I wanted to list dataset and the files they contain, to check everything is correct.

Renku version: 0.8.2 on Jupyterlab

renku dataset gives a list of the datasets within the repo which is very handy. In the example here, I created three different datasets:
renku dataset create meteorology-raw-wind-summary
renku dataset import <link-to-zenodo-dataset>
renku dataset import --name test-import-dataset <link-to-zenodo-dataset>

$ renku dataset

ID DISPLAY_NAME VERSION CREATED CREATORS
------------------------------------ --------------------------- --------- ------------------- ------------------
4a3e1ca1-e8b2-48dd-b65d-ef30c5005d3f meteorologyrawwindsum 2020-03-19 14:04:56 J.Thomas
f6cca5d3-9d1d-495d-ad09-17cd92371fe0 summary_raw_wind_data_fr_11 1.1 2020-04-27 15:32:50 P.E.Carles,T.Jenny
18d733fd-958e-4f2f-b18a-f1ed06c0872c summary_raw_wind_data_fr_11 1.1 2020-04-27 16:06:12 P.E.Carles,T.Jenny

Problem: The commands I used to check files, and thereafter change datasets did not work as expected which I think is due to the name I was using. Which dataset should be used in renku dataset commands?

Example listing dataset files (ls-files):
The following output was as I expected. All files that I had added to the dataset were listed.

renku dataset ls-files meteorology-raw-wind-summary

ADDED CREATORS DATASET PATH
------------------- ---------- ---------------------------- -------------------------------------------------------------------------------------------------------
2020-03-19 14:05:22 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/metdata_wind_20161220_20170118.csv
2020-03-19 14:05:22 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/metdata_wind_20170122_20170223.csv
2020-03-19 14:05:22 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/metdata_wind_20170226_20170319.csv
2020-03-19 14:05:22 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/metdata_wind_20170322_20170411.csv
2020-03-20 11:55:19 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/metdata_wind_20161119_20161216.csv
2020-03-20 11:55:26 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/data_file_header.txt
2020-03-20 11:55:31 Jen Thomas meteorology-raw-wind-summary /work/meteorology-raw-wind-legs0-4/data/meteorology-raw-wind-summary/README.txt

To list the files, I used renku dataset ls-files name-given-on-create which is displayed in the UI on renkulab.

However if I try the same for the other datasets within the repo, then it looks as though the datasets do not have any files - the following output was not expected. Files and datasets are listed though using git lfs ls-files. using the name that is displayed in the UI does not work, nor does the <DISPLAY_NAME>.

renku dataset ls-files summary_raw_wind_data_fr_11

ADDED CREATORS DATASET PATH
------- ---------- --------- ------

and

renku dataset ls-files test-dataset-import

ADDED CREATORS DATASET PATH
------- ---------- --------- ------

I wasn’t sure which to use to refer to the dataset here, but I don’t get the expected result using either the <DISPLAY_NAME> or the name used with renky dataset import. Files are being tracked using git lfs and appear in renkulab.io files and dataset sections, so seem to have been added to the dataset correctly.

Example unlinking dataset files:
renku dataset unlink --include metdata* test-import-dataset

Warning: You are about to remove following from "dataset" dataset.
/work/meteorology-raw-wind-legs0-4/data/summary_raw_wind_data_fr_11/metdata_wind_20161117_20161216.csv
/work/meteorology-raw-wind-legs0-4/data/summary_raw_wind_data_fr_11/metdata_wind_20161220_20170118.csv
/work/meteorology-raw-wind-legs0-4/data/summary_raw_wind_data_fr_11/metdata_wind_20170122_20170223.csv
/work/meteorology-raw-wind-legs0-4/data/summary_raw_wind_data_fr_11/metdata_wind_20170226_20170319.csv
/work/meteorology-raw-wind-legs0-4/data/summary_raw_wind_data_fr_11/metdata_wind_20170322_20170411.csv
/work/meteorology-raw-wind-legs0-4/data/test-import-dataset/metdata_wind_20161117_20161216.csv
/work/meteorology-raw-wind-legs0-4/data/test-import-dataset/metdata_wind_20161220_20170118.csv
/work/meteorology-raw-wind-legs0-4/data/test-import-dataset/metdata_wind_20170122_20170223.csv
/work/meteorology-raw-wind-legs0-4/data/test-import-dataset/metdata_wind_20170226_20170319.csv
/work/meteorology-raw-wind-legs0-4/data/test-import-dataset/metdata_wind_20170322_20170411.csv
Do you wish to continue? [y/N]: N
Aborted!

Here I used the dataset name that was given on import, but trying to unlink files suggests that files from two different datasets could be removed.

Questions:

  • which dataset names should I use with the renku dataset commands, particularly when using a dataset that has been created using renku dataset import?
  • is there a way to show all the names associated with a dataset?

Hi Jen,

You are using a very old version of Renku. A lot of things have changed recently and bunch of bug fixes and UX improvements have been added. Please consider to upgrade to a recent version.

To answer your questions:

  1. You have to use the real name of the dataset where can be found in the output of renku dataset ls-files. However, this won’t solve your problem because the name is shared by the two imported dataset.
  2. A workaround is to use -I or --include with the path of the dataset you want to list/unlink: renku dataset ls-files --include 'data/test-import-dataset/*'.

The unlink command is buggy in that version and it won’t commit the changes in metadata after unlinking. You need to add the metadata file and create a manual commit. This bug is fixed in the latest version of Renku.

Thanks and kind regards,
Mohammad