How to find the source of an imported dataset?

Could someone tell me how to find out what was the source of an imported dataset?
I imported a dataset from github (ESSM_plotting) and then tried renku dataset show ESSM_plotting, but this only shows:

Name: ESSM_plotting
Created: 2023-11-20 19:03:35+01:00
Data Directory: modules/imported/ESSM_plotting
Creator(s): Stan Schymanski <stan.schymanski@env.ethz.ch>
Keywords: 
Version: 
Storage: 
Annotations: 
Title: ESSM_plotting
Description: 

How do I get the original github link?

I don’t think there’s an easy way to see the. Your best bet is probably to use renku dataset ls --format json-ld, find the entry for the dataset and check its sameAs property.

Hi @schymans.
I’m guessing you are using our renku CLI. However, if it would be an option for you to resort to our API, you can call https://renkulab.io/knowledge-graph/datasets/63154e15a55b4db1abbf4ea868e7111c and in the response, you can find out what was the previous version of the dataset (the derivedFrom) or the initial one (the versions.initial or even follow the link to initial-version listed under the _links property).

Hi @jachro, Thanks for your help. Unless I made a mistake somewhere, the dataset should be linked to GitHub - schymans/ESSM_plotting: Python modules to simplify plotting of equations generated using ESSM, but I don’t see any mention of this link in the metadata. Perhaps I deleted the wrong one when trying to re-import and removing the old, local dataset?

Thanks @ralf.grubenmann, I do find the link in the output, but I would never be able to find it if I didn’t know what to look for. Considering that transparent re-use of data and giving credit to the sources of datasets is an important motivation for renku, it would be very important to see these links easily and prominently. Is this on the todo-list for the next renku update?

Hi @schymans, Thank you for your feedback! I agree that making dataset provenance easily understandable is important for Renku, and not something we do a good job at currently! This is not on our immediate roadmap (which you can find here, by the way), but we are planning to overhaul Renku’s dataset concept in 2024.

Hi @schymans. Actually, the way the command you suggest in your GitHub repo works might be a bit different to what you could be expecting. When you initialise a Dataset this way

renku --no-external-storage dataset add --create ESSM_plotting \
git@github.com:schymans/ESSM_plotting.git

a new Dataset is created and files from the provided GitHub repo are copied. Although info about the origins of each file is added to the metadata (on the level of each file), the Dataset is still perceived as a created (not imported) Dataset, and no link saying that’s the same as the GitHub Dataset exists. We’d only add such a link (on the Dataset level) when the dataset import command is used. I could guess the import command could be more suitable in your case but unfortunately, currently, it’s not possible to issue such a command with a link to a git repository.

Aaah, thank you! So the clean way would be to publish the dataset on zenodo first and then import from zenodo? So could I e.g. execute:
renku --no-external-storage dataset import --datadir modules/imported/ https://zenodo.org/records/4380979?

Yes, this command should do the job but, unfortunately, recent changes on the Zenodo API cause our import to fail. I know @ralf.grubenmann was looking into this but cannot say what’s the status now.

1 Like

Hmmm, I just tried it out again using renku 2.7.0, and renku dataset add does actually preserve the link. Here is what I did:

renku --no-external-storage dataset add modules --source modules/* git@renkulab.io:wave/li-6800.git```

Today, I found a bug, so I modified a file in the modules folder in renkulab.io:wave/li-6800. Then I ran:

(main *%)stan:~/notebooks/jupyter/renku/fortress$ renku dataset update modules
Checking git files for updates:   0%|               | 0.00/1.00 [00:00<?, ?iB/s]Cloning remote repository...
Updated 1 files 

The changed file is now updated in the repo where I executed these commands, so obviously the link is preserved even if I do renku dataset add instead of renku dataset clone. The command renku dataset ls --format json-ld, as suggested by @ralf.grubenmann also worked very well, showing the links to the origins of the different files in the datasets. Thanks!

1 Like