How to ensure reproducibility of environment?

When I specify an environment in Dockerfile, environment.yml and requirements.txt, I can specify versions of the different packages or leave out that information. If I don’t specify a version, the currently up-to-date version will be installed, right? What if I or someone else tries to re-use the project in a year’s time, when the up-to-date versions of the packages are different? Is there a way to find out what were the original package versions? Something like a pip freeze? Here is a related issue: https://github.com/SwissDataScienceCenter/renku-python/issues/1266

Hi @schymans that’s an excellent question - in general yes, it is a problem. On renku, the images are persisted for quite a long time so as long as the original image is available in the registry then the versions will be ok. If the image is rebuilt, however, you are right that they will not be stable.

I don’t quite follow how the issue you linked is related? Can you elaborate?

Thanks for the quick response @rrrrrok. The linked issue just shows that without specifying the exact version of a package the install may not work. Sorry, not directly related. Should I create an issue? I think it would be great if we could keep track of any version combinations that have been tested.

We’ve not implemented functionality like this because it can lead to lots of dependency resolution issues. Python packaging tools work fine most of the time like this but on occasion it can break things in ways that are difficult for users to recover from. OTOH, not pinning clearly can also lead to problems. Maybe if this was done as a deliberate action from the user, e.g. renku freeze that would essentially just do pip freeze > requirements.txt it would be enough? Of course this doesn’t handle the non-python system libraries at all.

I think the renku freeze function you described would be great, but maybe also a possibility to preserve a docker container including any non-python system libraries would be useful. I am not sure if pip freeze > requirements.txt would be sufficient in all cases, as in many cases, the order in which libraries are installed matters. @rcnijzink might have some examples.

We do preserve the docker images but we don’t make guarantees (not on renkulab.io anyway). There might be other renku deployments that would make guarantees about how long-lived the images in the registry are. In general, for proper preservation it would be important to export the image to an archive like zenodo. We don’t currently offer an automatic way of doing that but I’d be happy if you wanted to draft more details around this use-case.

I think both would be great. renku freeze just pinning the current packages in requirements.txt, and adding the docker container to a permanent archive, e.g. on zenodo. Maybe a renku unfreeze would also be helpful to remove the pinning in case the user wants to test the code with the most up-to-date packages.

I don’t know enough about docker to give more details, but the general idea is to guarantee that results remain reproducible and code re-usable in the future.

Sure that’s fine, I think the technical details can be deliberated later - I’m more interested in how you imagine working with it as a user. What would you do with the archived image, for example? Somehow automatically recover it locally so you could run against an old version of the project?

Yes, I would like to be able to still run and modify my code from years ago, but a lot of it does not run on the current versions of libraries. It is a major effort to figure out what were the up-to-date versions of libraries when I originally developed the code. Ideally, I would have pinned those versions and could re-create a docker container that would run my old code. If this fails because the old code relies on some non-python programs, I would like to be able to pull out the docker container from back then and run the code in the original docker container to see what was the original behaviour of the code. This could be done at the release level, i.e. for versions of the project that are actually published and preserved permanently.

Right - I think it would make a lot of sense to have this sort of functionality in the mode where we could export a tarball of the entire project to a repository. The question then is what you would expect the hosted Renku (i.e. renkulab.io) to do for you (if anything). Or would it be enough to do something like this locally:

renku unarchive https://zenodo.org/<project-archive-doi>

which would then create a directory with the contents of the archived project and extract the preserved docker image so you can run it. From the implementation point of view I can see the CLI version of this being relatively straightforward, but I’m not sure what we could do on the renkulab side.

1 Like

Turns out we already have several issues for exporting projects to external repositories: https://github.com/SwissDataScienceCenter/renku-python/issues/1327

I’ve added a bit at the bottom to mention the need to export Docker images as well. Thanks for bringing up this use-case!

1 Like

Yes, thank you, a holistic solution as discussed in the issue would be great. In the mean time, would it be possible to implement something like renku freeze, which would just add explicit version numbers to all entries in requirements.txt and environment.yml? This could be used whenever creating a tag or submitting to zenodo, so that if I want to use a tag years later, I am more likely to still get a functioning combination of libraries for that particular repo.

Actually, I just realised that the issues and discussions therein should be part of an open science publication. At the moment, these can only be downloaded through the gitlab UI: Settings, General, Advanced, Export project. I just tried that, but I am still waiting for the email with the download link…
I wonder if there could be a better way of making the discussions in the issues available through e.g. zenodo.