We would like to be able to publish the code of a Renku project/repo to get a DOI so that it can be referenced. I wondered if this is already possible?
Hi @jen-thomas good question - it’s not possible atm. Do you have something in mind like the github/zenodo integration?
Hi @rrrrrok, thanks for your answer.
Yes, something like that would be perfect. For the occasion where I have wanted to publish code so far, I have done just as you say here, work on it in Github, then publish it through the integration with Zenodo.
But we are looking now to use Renku more as it was intended, to import datasets from Zenodo, then have the code there as well, create a finalised dataset from here and publish it through Zenodo. We would also like to publish the code as well, so it can be referenced. Does that make sense?
Yeah it makes sense - we could certainly just do something like package a tarball and ship it to zenodo (this is what github does afaik). The question is what to do with the data? Should it come with the project? Or would you have one project for code and one project for data?
Hi @jen-thomas and @rrrrrok. Thanks for linking me to this quesiton. I think in some cases it would make sense to publish code and data in the same project. E.g. in my case I would use datasets from zenodo and other sources and derive a ‘correction’ to one data set. Some of the functions used here could be very usefull for other projects and I think they are best published together with the ‘example data’ in one project - if that should be possible. But there might be other cases where separate projects make more sense.
Hi @sebastian-landwehr thanks for chiming in! Yeah I agree in that case it makes perfect sense - I was just curious in the case of datasets that get published via zenodo as individual entities, and they also happen to be parts of a renku project, would it then make sense to publish the same data again with the code or would it be better to have a separation? Apart from redundancy, I don’t really see why you couldn’t push the same data twice - basically, when you publish a dataset it’s only for the purpose of providing the data as-is and making it possible to reference it in future work; then when you publish the full project (potentially with some pre-processing pipelines etc.) you are basically saving a snapshot of a work that uses that dataset.
Sorry for being verbose but does that make sense?
Hi @rrrrrok hm yes I think what you suggest makes perfect sense appart from the redundancy but I don’t see a big issue. If some one wants to use the project to build uppon it they can cite the project and if they just use the data they can cite the dataset. @jen-thomas what do you think?
Hm…I’m not sure. I understand whilst it seems useful to have everything bundled in one publication, I prefer having code and datasets (whether input, output etc) separate from eachother and the code.
I could see a situation where someone wants to cite code that was used to process their own dataset. If that code was “bundled” with a dataset and other things, I feel that it would be very confusing: if someone sees the citation, how do they know it was the code that the person had used rather than the data? The part of the publication is being cited is not explicit.
I also think it would be confusing if there were different versions of code and data: what would be the best thing to do if one of them had to change?
DOIs also have a “type” - in the case of DataCite, this is resourceType which would be distinct for software and dataset. This is a one to one relationship in this case.
There are also ways now to easily link DOIs with a relationType from DataCite, using for example IsDerivedFrom.
Having said all of that, I think it would be very useful to be able to publish a summary of the workflow (again linking this to the other publications). I’m not sure if this is to do with the renku run command, I haven’t used this
There could also be the case that a dataset was imported from outside of the Renku project and it wouldn’t necessarily right to publish it with everything else if the creator / originator was different, for example, to the code. Or vice versa.
Jen’s points make lots of sense.
I agree that it would be ideal to observe a reasonably strict separation of data and code - however it might not always be possible or practical. In that case, I see the “tarball to zenodo” approach basically a means of taking a snapshot of the project. So that for the purposes of reproducibility one could always retrieve the state exactly as it was. In practice, I don’t think anyone initiates work from a zenodo snapshot of code - they just go to the github repo and start from there. I see the same thing happening here - if someone wanted to continue work on the project, they would fork the project. If they wanted to just use the data, they would pull the original data.
And yes, part of what I’m thinking is preserving the full provenance of the project - in renku, we store all the necessary information to rerun an analysis in the project so for replication/reproducibility this could work more or less out of the box if the data is also there.
otoh, we must also be mindful of the fact that some data will be external to the project - we are working on mechanisms to “lazily” fetch this data and in this case it would not be archived with the project itself. Of course, there are then no guarantees that you can actually reproduce the work.
Yes I agree it would be nice to have the “tarball to zenodo” for those purposes too.
Is it already, or would it be possible, to import/export a repo of code to/from Github, so that it could be published with Zenodo in that way, if publishing the code through Renku is maybe not suitable?
You can already do that simply by adding another remote to your project and pushing to whatever git server you want. Or do you mean could we automate this from the renku CLI somehow?
I just found this because I was looking for a way to publish a renkulab project on zenodo. Yes, I think it would be really great if we could do this both from the renku CLI and the renkulab UI. Or at least some instructions on the easiest way to publish a chosen commit of a renku project on zenodo.
Oh, and I agree that in a perfect world, data would be pulled into a renku project from persistent databases and hence would not need to be included in the tarball on zenodo for ensuring reproducibility, but it would be good if the user could decide which datasets are safe to be left out and which should be included for reproducibility.