Dataset is not in KG

Hi,
I’m testing some dataset import/updates and I’m wondering about some datasets that are not in the KG after import followed by an update.

More precisely:

  1. I generated a test_dataset_1 in this project. By that time the dataset had the id fe4ea92a05774811813d9daf26c8f69b
  2. I imported that test dataset into a second project. That worked all fine and I could find the dataset with a new ID ( ffa44150b88347c9a0f1f14ad43d204a) pointing to the dataset in project 1.
  3. I added a new file to dataset 1 and updated the dataset in project1
  4. I updated the dataset in the second project running renku dataset update test_dataset1. This showed a message that the dataset was updated and the new file was available in project2.
    However now the dataset is shown with a yellow warning flag as not being in the KG at RenkuLab.

I can also not find a dataset with the new ID as shown in the second project (see below) when querying the KG endpoint
Screenshot from 2022-05-13 09-02-42

The KG status of all projects are shown as active and the template and renku versions are the latest. I waited a day to see if this is a matter of time. The same happened when importing and updating that dataset into a third project.

In general it is more a detail because the dataset is available and the updates themselves work, but I’m optimizing our dataset queries across projects and was wondering if there is a reason for this?

Thanks in advance.
Best,
Almut

Thanks for drawing our attention to this problem and providing a concrete example. We are investigating and will get back to you as soon as we understand what is causing it.

1 Like

Hi @Almut
Sorry, it took me a bit to come back to you but I had to find out the cause of the problem. It turned out that there’s a bit of an issue on the KG side and some operations you did for the second and third projects didn’t get through KG. As a result, the Datasets you mention are not visible to them.
There’s an issue I’ve just created for this and we’ll try to work on it as soon as possible.

Many thanks for reporting the problem in such a detailed way!

Thanks a lot for looking into this and opening the issue!

Hi @Almut. Just to let you know we rolled out a new version and the problem you reported here is solved now. You can check here.

1 Like

Hi all,
I’m very sorry, but somehow I came across the same or a similar problem again with this dataset. The problem is basically identical to the one described above, but the underlying problem could still be different?

In short I created the renku dataset yesterday and added quite a few (72) files. This all seemed to have worked fine and the files are displayed as added from within the project:

Unfortunately the dataset is still not updated in the KG:

When querying for it using the KG endpoints I find a previous version without files → https://renkulab.io/datasets/8eda0b02800e4784b9773fc286a34608).

renku dataset ls shows the following id: 53922202e38c4e7996e1fafa39dc8292, but https://renkulab.io/knowledge-graph/datasets/53922202e38c4e7996e1fafa39dc8292 links to a “resource not found”.
Do you have any idea what could have happened? Could the number or size of files cause a problem (the files are medium size (50.-70 MB each))? But all files were successfully uploaded to git lfs?

Hi @Almut,

Yes, the problem manifests in a very similar way to the one we fixed some time ago. However, the underlying issue is slightly different. Generally, the omnibenchmark/omni-batch-py/mnn-py repo’s metadata seems to contain some Workflow parameters in a shape that KG provisioning process cannot work with (although they look correct). As a result, many commits of the mentioned project are in a failure status. I’ve done some initial investigation and will soon open an issue to address that.

Sorry for the inconvenience.
Jakub

Hi Jakub,
thanks a lot for looking into it! You are right and now that you pointed me to it there are in deed suspicious/unintended parameter without position associated to the workflow (parameter-ab12, parameter-c1b0). They seem to point to some parameter values (auto and sig, which are values for parameter 7 and 8), but they are not part of the original command line call that generated the workflow. Could it be possible that I managed to add them through execution of the plan?
I will try to figure out, when I managed to introduce them. Maybe the problem is on my side of the code (I suspect somewhere here) and not on the renku side. I will look into it and let you know.
Almut

Thanks lot Jakub for looking into it!
The problem was in deed the workflow. There I had specified explicit parameter values, but was using different parameter values (with the same name) in the command line call. This let to unbound parameter (without positions) and caused the problem in the KG provisioning process.
The parameter mismatch was not intended from my side and I cleaned the repo (unlinked dataset files, reverted activities, and removed workflow and dataset) and it seems to work now :slight_smile: