Is it possible to publish code through Renku?

We would like to be able to publish the code of a Renku project/repo to get a DOI so that it can be referenced. I wondered if this is already possible?

Hi @jen-thomas good question - itā€™s not possible atm. Do you have something in mind like the github/zenodo integration?

Hi @rrrrrok, thanks for your answer.

Yes, something like that would be perfect. For the occasion where I have wanted to publish code so far, I have done just as you say here, work on it in Github, then publish it through the integration with Zenodo.

But we are looking now to use Renku more as it was intended, to import datasets from Zenodo, then have the code there as well, create a finalised dataset from here and publish it through Zenodo. We would also like to publish the code as well, so it can be referenced. Does that make sense?

Yeah it makes sense - we could certainly just do something like package a tarball and ship it to zenodo (this is what github does afaik). The question is what to do with the data? Should it come with the project? Or would you have one project for code and one project for data?

Hi @jen-thomas and @rrrrrok. Thanks for linking me to this quesiton. I think in some cases it would make sense to publish code and data in the same project. E.g. in my case I would use datasets from zenodo and other sources and derive a ā€˜correctionā€™ to one data set. Some of the functions used here could be very usefull for other projects and I think they are best published together with the ā€˜example dataā€™ in one project - if that should be possible. But there might be other cases where separate projects make more sense.

1 Like

Hi @sebastian-landwehr thanks for chiming in! Yeah I agree in that case it makes perfect sense - I was just curious in the case of datasets that get published via zenodo as individual entities, and they also happen to be parts of a renku project, would it then make sense to publish the same data again with the code or would it be better to have a separation? Apart from redundancy, I donā€™t really see why you couldnā€™t push the same data twice - basically, when you publish a dataset itā€™s only for the purpose of providing the data as-is and making it possible to reference it in future work; then when you publish the full project (potentially with some pre-processing pipelines etc.) you are basically saving a snapshot of a work that uses that dataset.

Sorry for being verbose but does that make sense?

Hi @rrrrrok hm yes I think what you suggest makes perfect sense appart from the redundancy but I donā€™t see a big issue. If some one wants to use the project to build uppon it they can cite the project and if they just use the data they can cite the dataset. @jen-thomas what do you think?

Hmā€¦Iā€™m not sure. I understand whilst it seems useful to have everything bundled in one publication, I prefer having code and datasets (whether input, output etc) separate from eachother and the code.

I could see a situation where someone wants to cite code that was used to process their own dataset. If that code was ā€œbundledā€ with a dataset and other things, I feel that it would be very confusing: if someone sees the citation, how do they know it was the code that the person had used rather than the data? The part of the publication is being cited is not explicit.

I also think it would be confusing if there were different versions of code and data: what would be the best thing to do if one of them had to change?

DOIs also have a ā€œtypeā€ - in the case of DataCite, this is resourceType which would be distinct for software and dataset. This is a one to one relationship in this case.

There are also ways now to easily link DOIs with a relationType from DataCite, using for example IsDerivedFrom.

Having said all of that, I think it would be very useful to be able to publish a summary of the workflow (again linking this to the other publications). Iā€™m not sure if this is to do with the renku run command, I havenā€™t used this :slight_smile:

There could also be the case that a dataset was imported from outside of the Renku project and it wouldnā€™t necessarily right to publish it with everything else if the creator / originator was different, for example, to the code. Or vice versa.

Jenā€™s points make lots of sense.

I agree that it would be ideal to observe a reasonably strict separation of data and code - however it might not always be possible or practical. In that case, I see the ā€œtarball to zenodoā€ approach basically a means of taking a snapshot of the project. So that for the purposes of reproducibility one could always retrieve the state exactly as it was. In practice, I donā€™t think anyone initiates work from a zenodo snapshot of code - they just go to the github repo and start from there. I see the same thing happening here - if someone wanted to continue work on the project, they would fork the project. If they wanted to just use the data, they would pull the original data.

And yes, part of what Iā€™m thinking is preserving the full provenance of the project - in renku, we store all the necessary information to rerun an analysis in the project so for replication/reproducibility this could work more or less out of the box if the data is also there.

otoh, we must also be mindful of the fact that some data will be external to the project - we are working on mechanisms to ā€œlazilyā€ fetch this data and in this case it would not be archived with the project itself. Of course, there are then no guarantees that you can actually reproduce the work.

Yes I agree it would be nice to have the ā€œtarball to zenodoā€ for those purposes too.

Is it already, or would it be possible, to import/export a repo of code to/from Github, so that it could be published with Zenodo in that way, if publishing the code through Renku is maybe not suitable?

You can already do that simply by adding another remote to your project and pushing to whatever git server you want. Or do you mean could we automate this from the renku CLI somehow?

I just found this because I was looking for a way to publish a renkulab project on zenodo. Yes, I think it would be really great if we could do this both from the renku CLI and the renkulab UI. Or at least some instructions on the easiest way to publish a chosen commit of a renku project on zenodo.
Oh, and I agree that in a perfect world, data would be pulled into a renku project from persistent databases and hence would not need to be included in the tarball on zenodo for ensuring reproducibility, but it would be good if the user could decide which datasets are safe to be left out and which should be included for reproducibility.

1 Like

I wondered if there is already a good solution for this, as we try to publish an entire renku-repo on zenodo. What is now the best way of doing this?

I just came across this: GitLab to Zenodo ā€” eossr documentation
I have not tried it out yet, but it looks like something that could be implemented in renku-gitlab repos. Perhaps someone could even create a template for this.

After working a bit more with datasets in renku, I had some additional thoughts on the above issue. I usually develop code to solve a particular data-related problem, so the code is born and grows in a data repo. Eventually, I find out that the code can be generalised and used with many different datasets. At this point, I should probably publish the code as a separate package, along with some sample data to showcase its use. For example, in LI-6800, generated from Reproducible Data Science | Open Research | Renku, I have a notebooks folder, where I import different .py files and show how they can be used in context to import data from instrument output files, perform computations on the data and plot the results. The .py files themselves are in the data folder, as a separate dataset called ā€˜modulesā€™. Ideally, I would publish the .py files related to the importing and processing of the data in a separate package, and the .py files related to plotting of data in a different one again, as they could be used with very different sorts of data. Perhaps I should pack the different .py files into different datasets and then publish them saparately, then pull them back into the li-6800 repo as external data, and publish li-6800 as a use case for data processing? Would this be consistent with your thinking @jen-thomas and @rrrrrok?
The best way to proceed would probably be to create a new repo, e.g. on github or gitlab or renkulab with that code and sample data, publish it on zenodo, and then replace

If you are going to publish your more general / modular code as a python package then putting it in pypi or in your own conda channel then adding it like any other dependency to your requirement.txt or environment.yml might be the way to go rather than treating the code more like a dataset.

Also an FYI following up from one of the old comments eossr implements their own thing in their library for sending project snapshots over to zenodo but this project: gitlab2zenodo gitlab2zenodo Ā· PyPI is more general. (Iā€™m using it with a ci/cd file to automatically snapshot my book and send an update to zenodo when I set a new version tag - https://data-guide.hdbi.org/ the tarball upload part has been a bit flaky for me so that might need some tweaking - was working on a project template with this but got bogged down trying to debug it because of the difficulty testing templates, going via gitlab.com for pages hosting also adds quite a bit to the initial setup complexity)

2 Likes

There is a 5-year old issue on gitlab-zenodo integration, unfortunately still no easy solution: connect to GitLab just as to GitHub Ā· Issue #1404 Ā· zenodo/zenodo Ā· GitHub
The pypi way would make it most convenient for re-importing, but it does not solve the archiving challenge, so we would still need to put each release on zenodo, right?

Iā€™ve not used pypi myself but Iā€™d assume that packaging for it would be based off a git repo probably with some kind of build operation first so if you had a gitlab ci/cd or github action that builds a package for pypi whenever you set a new version tag you could also integrate with zenodo using gitlab2zenodo or githubā€™s built in zenodo integration to generate those archived snapshots