Proper use of renku dataset to link data from project to project

Hello,

I’ve been working these last two days on sv-renku, quite successfully (new projects, renkuized old projects). I intend to properly link the data produced by project 1 to project 2, where it will be used as an input. I’ve created a dataset with renku dataset in project 1, and added the path to the dataset by accessing the repo through https (that seems kind of circular to me, so might be ill-advised). When I do that, the dataset is not saved.

I’ve now tried a different approach: I’m directly creating the dataset in project 2, passing the path to the data in project 1 through the --source option (https access to project 1 repo). I managed to copy the data, but still, nothing shows up in the graph in the GUI. Is this way of proceeding correct ? Or should I manage to create the dataset in project 1, and then import it in project 2 through renku dataset import ?

Thanks !

Cyril

Hi Cyril,

It’s better to use renku dataset import to import a dataset from another renku project (in your case from project1). In this case you will have proper metadata copied over from the original dataset.
If you are using CLI to add data, please write down the command you’ve used so that I can take a look.
Note that after pushing changes to the project, it takes a while (normally a few minutes) for KG to be built and updates (e.g. new datasets) being visible in the GUI.

Regards,
Mohammad

Hi Mohammad, thanks for your help

Here’s what I did in project 1:

renku dataset create dataset1

renku dataset add --source path/to/data/in/project1 dataset1 https://to-project1-repo.git

The dataset still does not appear in the KG (the KG does not build at all in fact in project 1), although I’ve verified it exists and contains the files as expected in project1 by running renku dataset ls-files dataset1

Now, I tried importing dataset1 in project2 with:

renku dataset import https://to-project1-repo.git

but the command fails:
Error: Invalid parameter value - Could not process https://sv-renku-git.epfl.ch/pulver/cisbp-to-meme.git. Couldn't test provider <class 'renku.core.commands.providers.dataverse.DataverseProvider'>: Expecting value: line 1 column 1 (char 0) Reason: provider not found for https://sv-renku-git.epfl.ch/pulver/cisbp-to-meme.git Hint: Supported providers are: dataverse, renku, zenodo

I’m confused as to what renku dataset import expects as an input.

Also, could the systematic failure of the KG building process be due to specifics on the SV-deployment of renku ? we are running Renku version 0.6.2 (April 29th 2020).

So, the first dataset add operation succeeds since you can list the files. To see them on KG make sure that:

  1. Changes are pushed to the remote repo
  2. KG is activated for the project. If project1 is private then you need to click on the “Activate KG” button that shows up in the project UI page.

To import a renku dataset, you need to use dataset’s URL from the UI (and not the Git link). URLs have a format like https://sv-renku.epfl.ch/datasets/<dataset-id> or https://sv-renku.epfl.ch/projects/<user>/<project>/datasets/<dataset-id>.

KG must be activated for your project1 if you want to import from that project.

I do know about the failure of KG in the SV deployment. It’s better to ask your devops team about it.

Ok thanks for the clear answer !

I’m pretty sure I accepted KG activation when I created the two projects, even though they are in private.

I’m sending a message to the team at SV-IT in charge of the deployment for this KG build problem.

Update: the KG build was activated/repaired by the SV-IT team. Now we still suffer from the known issue that deleted datasets still haunt the GUI.

Thanks again for your help !

Cyril

I everyone.

I seem to have a similar issue. I have a renku project that I use as data repository. The KG is activated for this project.

In another renku project I try to import the data using the following command:
renku dataset import https://renkulab.io/datasets/<ID>

The command fails with the following error:
Error: Cannot access knowledge graph: […]
Response code: 500

Are the additional steps required to access a dataset from another project?

Hi @rchris,

There are no more steps required to import a dataset.

Can you please post the complete error message and dataset’s ID?

Thanks!
Mohammad

Hi Mohammad

Thx, for the quick response.

This is the target project to where I try to import the dataset:
https://renkulab.io/projects/abiz-lab/roche/toolchain/notebook

This is the command I use (fails):
renku dataset import https://renkulab.io/datasets/67003e0c-0ae5-4663-9678-ae281ea8034d

I tried with a different, public dataset (works):
renku dataset import https://renkulab.io/datasets/f4c204e3-99ec-43d0-b97e-e7d6926c8c13

The project seems to be private and I don’t have access to it. Can you please try the following to see if it solved the issue:

  • Go to project’s gitlab page (go to renkulab and click on “View in Gitlab” on the upper right)
  • In gitlab, to to project’s Settings > Webhooks (in the left pane)
  • Scroll down the page and under Project Hooks delete the one for renku lab: https://renkulab.io/webhooks/events
  • Go back to project’s page in renkulab; refresh the page and the orange button to Activate KG for the project should appear again; click on it and wait for the project to be indexed.
  • Retry the import and now it should work

Please let me know if it worked or not.

Mohammad

Thx Mohammad.

I followed your instructions to re-enable the KG, but it did not resolve the problem.

However, I gave your user access to the two mentioned projects, if you like to take a look.
Currently, this is a testbed with no real data or code. Feel free to play around with it if needed.

  • dataset project: https://renkulab.io/gitlab/abiz-lab/roche_data
  • data-consumer project: https://renkulab.io/gitlab/abiz-lab/roche/toolchain/notebook

Thanks for the access right!

It came out that there is a bug where projects without a README.md file won’t get indexed in the KG. I’ve added a README.md file to your project (sorry about it) and re-activated KG for it. Now, the project is in KG and the import worked for me locally.

This bug is fixed in a newer version of renku and will be fixed in renkulab once we upgrade it.

Best,
Mohammad

Thx a lot mohammad. This is indeed not something I would have come up with in my bug fixing attempts :slight_smile:

I now continued to create our test-setup on renkulab. It still doesnt work quite yet for me:

I’ve re-created a data project WITH a readme and added a dataset. This is the dataset I try to access:
https://renkulab.io/gitlab/abiz-lab/roche/toolchain/data/-/blob/4ef4eebe257255e9437f11e081ce217b4ab4742d/.renku/datasets/3c53bc7b-5ffa-4443-a09c-b305b4a069f8/metadata.yml

Again, I try to cross-import the dataset to this project:
https://renkulab.io/gitlab/abiz-lab/roche/toolchain/notebook

This is the command I use:
renku dataset import https://renkulab.io/datasets/3c53bc7b-5ffa-4443-a09c-b305b4a069f8

This is the error I get:
Error: Cannot access knowledge graph: https://renkulab.io/knowledge-graph/projects/toolchain/data
Response code: 500

It seems not to handle the group hierarchy correctly.
Do you have some advice for me?
Thx a lot.

fyi: I feel bad to keep you that busy with this. We are currently evaluating how we’re gonna use renkulab in the future and are about to introduce it at our university. This is part of the eval and are support requests naturally will decrease :slight_smile:

Hello,

No worries at all! I’m happy to help!

I am not sure if I understand your question about group hierarchy; do you mean internal visibility?

Also, how do you create these projects? Through UI using a pre-defined template? Are you using a custom template?

Make sure that the readme file name is README.md and it is in the root of your repo. Also try to do steps that I proposed above (Proper use of renku dataset to link data from project to project).

BTW, could you import the other dataset: renku dataset import https://renkulab.io/datasets/67003e0c-0ae5-4663-9678-ae281ea8034d?

Best,
Mohammad

Hi Mohammad

Thx for your feedback. Please find my responses below:

not sure if I understand your question about group hierarchy; do you mean internal visibility?

No, I mean groups and subgroups in GitLab.

Also, how do you create these projects? Through UI using a pre-defined template? Are you using a custom template?

I’ve created them using the Renku-Lab WebPortal. I used the template “Minimal Renku”

Make sure that the readme file name is README.md and it is in the root of your repo.

Yes, that is the case.

Let me point out the problem with sub groups in detail:

The data-project is located at Reproducible Data Science | Open Research | Renku
(group: abiz-lab, subgroup1: roche, subgroup2: toolchain).

Renkulab translates the dataset-ID to the path https://renkulab.io/knowledge-graph/projects/toolchain/data

It ignores the group and the subgroup1.
It seams as if renkulab datasets only work if the project is located in the root, and not within subgroups.
Can you verify that?

Thanks for clarification! What you observed is true, Renku does not support subgroups in Gitlab. I’ll create an issue for this use-case.

Best,
Mohammad