Renku dataset edit

I’d like to improve the metadata of my Renku datasets so that it is correct in Renku, but also in preparation for export to and then publication on Zenodo. For my particular use case, I often create and publish datasets on behalf of others so therefore I include their details (eg. mailto, affiliation, name, email etc) rather than mine.

I understand that the metadata is retrieved from the dataset and my Gitlab user. Is this correct?

Is there anywhere else that this metadata comes from, that I would be able to add further information to make it more correct when it is created so less editing is required?

I currently edit it using
renku dataset edit [OPTIONS] DATASET_ID

Are there any other ways of editing this metadata that I might have missed?
Are there any plans to be able to edit the file to be able to define fields in an easier way through the Renku cli?
i.e something like renku dataset edit affiliation "insitution-name"
This would be very useful!

Finally, I often end up with a discrepancy between the metadata in Renku and that in Zenodo because I edit fields in Zenodo before publication, that are not automatically filled during the dataset export. Can you recommend the best way that I could go about importing the correct metadata into Renku from Zenodo?

2 Likes

Hi @jen-thomas thanks for this request and use-case! We have an issue currently open and part of the current sprint to implement exactly this - you can follow it here: https://github.com/SwissDataScienceCenter/renku-python/issues/1074

We recently released a UI upgrade that allows you to add datasets through the UI. The plan is to also allow for metadata editing in the same place so it should be a bit more user-friendly than through the CLI.

For the second part - I guess what you have in mind is a sync between the renku dataset metadata and the metadata on zenodo for the same dataset after publication?

1 Like

Hi @rrrrrok, that’s great news!

We recently released a UI upgrade that allows you to add datasets through the UI. The plan is to also allow for metadata editing in the same place so it should be a bit more user-friendly than through the CLI.
That sounds great. I’d be happy to have it through the CLI too though, so I can do everything in one place and would be able to automate it in the future.

Regarding Zenodo, there are a few things:

  • I find that some of the metadata that are populated firstly are not populated correctly in the Zenodo fields, i.e. Zenodo requires author names in the format surname, firstname and it is populated firstname surname
  • some of them are not populated during the upload, i.e. keywords, license
  • I’d like to be able to populate the other fields directly from the Renku dataset if possible (I realise some are mandatory and others are not) i.e. I always have multiple authors, contributors etc.
  • as you said, it would be brilliant to be able to sync from Zenodo metadata to Renku dataset metadata. I don’t know if the renku dataset import already does this - it’s not something I’ve used recently.
  • I also sometimes notice changes that need to be made to a dataset (missing metadata or a change to a file) when I have exported it to Zenodo, but it has not yet been published. In this case, I understand currently I would need to make the changes, then do the export again and repopulate the Zenodo metadata. But it would be really useful to be able to get the UID of the Zenodo record and be able to make the changes to an exisiting, unpublished record (I’ve previously used curl to be able to upload some large files to a particular Zenodo record that needed modifications before publication).

I see, thanks for clarifying! It’s good to know that the CLI is important in this case since mostly we get requests for adding UI functionality.

I will make an issue for the first two points - those sound like bugs. Regarding multiple authors etc. we have added the option to specify creators on the command line recently:

renku dataset create --help
Usage: renku dataset create [OPTIONS] SHORT_NAME

  Create an empty dataset in the current repo.

Options:
  --title TEXT            Title of the dataset.
  -d, --description TEXT  Dataset's description.
  -c, --creator TEXT      Creator's name and email ("Name <email>").
  -h, --help              Show this message and exit.

Are you using that already?

re: syncing metadata - no, import will not sync metadata of an existing renku dataset. We need to think about how to make the appropriate links between the renku-only dataset (a draft I guess) and the published zenodo dataset. This will need a bit of work.

Regarding your last point - do you mean to make those changes in the zenodo interface (i.e. you just want Renku to give you the URL of the draft dataset) or you want to make changes in Renku (perhaps from a script) and push those changes to the draft?

I would definitely like to be have the same functionality through the CLI: ideally it would be nice to properly incorporate renku into our workflow, which would mean needing to automate some steps.

Another small point regarding automation and using the CLI:
As I mentioned automating some processes in the future, I would also be really keen on getting the UID of the Zenodo record and / or the publication DOI when the dataset is exported (with or without the --publish option).

Re: renku dataset create
Ah I hadn’t seen that it is possible to add those bits of metadata in that way: that’s brilliant! I’ll try that out :slight_smile: Sorry I missed it.

For the last point about syncing with Zenodo, perhaps if I explain the problem that I have a little better that would help:
Part 1: Currently, I create a renku dataset, then edit the dataset metadata and export it to Zenodo. I then add the additional metadata in the Zenodo interface and publish it through Zenodo. I then end up with a mismatch between then metadata in Renku and that in Zenodo. Of course this will be easier as the metadata is improved anyway :slight_smile:
Part 2: If I then notice that one of the dataset files needs updating before publication, I add the file to the Renku dataset again and as far as I know, I cannot then “export” or add this file to the Zenodo record that I have already created and modified, so I need to do a new export then re-enter the additional metadata.

So I guess it would be useful to be able to make the changes preferent
ially in Renku and push them to the draft Zenodo record (for both data files and dataset metadata).

Iirc, it should be possible to update an unpublished zenodo deposition, but I’m not 100% sure - I’m checking with Sam who implemented that feature.

About the first part of your question - the issue to track is this one: https://github.com/SwissDataScienceCenter/renku-python/issues/1074 and it’s being worked on as we speak so there will be a PR soon.

Hello Jen,

Thank you for your feedback. As rrrrrok pointed out, it is possible to update unpublished zenodo deposit, however this update will overwrite any edits you don’t have synced with local renku repo.

For this case we would have to build certain syncing mechanism which would ensure that your local repository state matches the remote one (zenodo in this case) - this goes for both cases of data and metadata.

I will open a ticket for this since I think it should be theoretically possible to do this syncing (at least as far as I can tell for Zenodo case - not sure if same applies for Dataverse).

Again, thank you for pointing this out to us, we will try to address it as soon as we find resources to do so.

@jsam, @rrrrrok thanks both for your replies. Sounds good :slight_smile:

Hi @jen-thomas,

We implemented the new dataset editing feature. Now, you can run renku dataset edit <dataset-name> and pass --creator, --title, and --description (or their shorter counterparts: -c, -t, and -d) to set new list of creators, title, and description of a dataset.

For example, executing renku dataset edit my-dataset -c 'John Doe <john.doe@example.com> [An Affiliation]' will change the creators list of the dataset to John Doe and set his email and affiliation.
If dataset has more than one creator, you can pass multiple -c options. The string format for a person’s name, email, and affiliation is Name <email> [Affiliation]. Both of the email and affiliation are optional.

To see the current list of creator for datasets just run renku dataset -c short_name,creators_full. You can use this information when editing your datasets.

This feature is not merged yet but you are welcome to try it. If you are using pipx to install renku (e.g. if you are using renkulab) then run pipx install --editable --force git+https://github.com/SwissDataScienceCenter/renku-python.git@1074-edit-dataset-metadata#egg=renku to install the branch that contains this feature.

Please let me know if you have any questions or comments.

Hi @mohammad-sdsc - that sounds great, thanks very much! I look forward to trying it out!

Hi @mohammad-sdsc - I tried installing that version but cannot create a repo at the moment: I get a bug with renku init repo-name so I’ll wait until it is a bit more stable and give it a go then.

@jen-thomas can you paste the stacktrace? I tested this and didn’t get the same problem.

(venv) jen@jen:~/projects/ace_data_management/renku-projects$ renku --version
0.9.2.dev19
(venv) jen@jen:~/projects/ace_data_management/renku-projects$ renku init test-repo
Ahhhhhhhh! You have found a bug. :beetle:

  1. Open an issue by typing “open”;
  2. Print human-readable information by typing “print”;
  3. See the full traceback without submitting details (default: “ignore”).

Please select an action by typing its name (open, print, ignore) [ignore]: print

Describe the bug

A clear and concise description.

Details

Please verify and redact the details.

Renku version: 0.9.2.dev19
OS: Linux (#1 SMP Debian 4.19.98-1 (2020-01-26))
Python: 3.7.3

Traceback

Traceback (most recent call last):
  File "[...]/renku/cli/exception_handler.py", line 119, in main
    result = super().main(*args, **kwargs)
  File "[...]/renku/cli/exception_handler.py", line 90, in main
    return super().main(*args, **kwargs)
  File "[...]/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "[...]/site-packages/click/core.py", line 1134, in invoke
    Command.invoke(self, ctx)
  File "[...]/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "[...]/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "[...]/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "[...]/renku/cli/__init__.py", line 204, in cli
    check_for_migration()
  File "[...]/renku/core/commands/client.py", line 103, in new_func
    result = ctx.invoke(method, client, *args, **kwargs)
  File "[...]/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "[...]/renku/core/commands/migrate.py", line 29, in check_for_migration
    if is_migration_required(client):
  File "[...]/renku/core/management/migrate.py", line 43, in is_migration_required
    _is_renku_project(client) and
  File "[...]/renku/core/management/migrate.py", line 92, in _is_renku_project
    return client.project is not None
  File "[...]/site-packages/werkzeug/utils.py", line 90, in __get__
    value = self.func(obj)
  File "[...]/renku/core/management/repository.py", line 153, in project
    return Project.from_yaml(self.renku_metadata_path, client=self)
  File "[...]/renku/core/models/jsonld.py", line 563, in from_yaml
    __source__=deepcopy(source)
  File "[...]/renku/core/models/jsonld.py", line 544, in from_jsonld
    self = cls(**data_)
  File "<attrs generated init renku.core.models.jsonld.Project>", line 17, in __init__
    self.__attrs_post_init__()
  File "[...]/renku/core/models/projects.py", line 95, in __attrs_post_init__
    self.client.renku_metadata_path, return_first=True
  File "[...]/renku/core/management/repository.py", line 235, in find_previous_commit
    file_commits = list(self.repo.iter_commits(revision, paths=paths))
AttributeError: 'NoneType' object has no attribute 'iter_commits'

Additional context

Add any other context about the problem.

@mohammad-sdsc I really like this, thanks very much! Just tested it out with the new version (0.10.0) :slight_smile:

I also like being able to view what I have in the metadata - that is super helpful.

It works without the name of the dataset:

renku dataset -c creators_full

but I expected it to need the dataset name as you said above:

renku dataset -c ammonia-concentration-surface-seawater,creators_full
Error: Invalid parameter value - Invalid column name: "ammonia-concentration-surface-seawater".
Possible values: id, created, short_name, creators, creators_full, tags, version, title

Happy to hear that feature helped :slight_smile:!

To see dataset’s name you have to use short_name, not the name of the dataset. For example, to see dataset’s name (i.e. short_name), creators, and human readable title:

$ renku dataset -c short_name,creators_full,title

SHORT_NAME    CREATORS                      TITLE
------------  ----------------------------  --------------------------------
my-dataset    Name <name@email.com> [SDSC]  Dataset containing COVID-19 data

You can use any combination from this list only: id, created, short_name, creators, creators_full, tags, version, title.

Thanks for your reply @mohammad-sdsc.

Ah sorry, I think I misunderstood! So I guess if I had more than one dataset within the renku project, would they all be listed in the output of renku dataset -c short_name,creators_full,title?

I think I was trying to get the information for a specific dataset, so I was using the short_name of a particular dataset in that way.

Sorry for the confusion!

No problem! ATM, it’s not possible to list a specific dataset. A workaround is to use Unix grep command.

@mohammad-sdsc would it make sense to enable a command like

renku dataset <dataset-name> 

that would display extra details about the dataset?

It won’t be possible without a verb, I believe; something like: renku dataset info/show/describe <dataset-name>. We already have a story for that: https://github.com/SwissDataScienceCenter/renku-python/issues/511

ok makes sense - I would then argue that renku dataset should just show the help text and renku dataset show/list/whatever would give info about the datasets