How to ensure reproducibility of environment?

When I specify an environment in Dockerfile, environment.yml and requirements.txt, I can specify versions of the different packages or leave out that information. If I don’t specify a version, the currently up-to-date version will be installed, right? What if I or someone else tries to re-use the project in a year’s time, when the up-to-date versions of the packages are different? Is there a way to find out what were the original package versions? Something like a pip freeze? Here is a related issue: https://github.com/SwissDataScienceCenter/renku-python/issues/1266

Hi @schymans that’s an excellent question - in general yes, it is a problem. On renku, the images are persisted for quite a long time so as long as the original image is available in the registry then the versions will be ok. If the image is rebuilt, however, you are right that they will not be stable.

I don’t quite follow how the issue you linked is related? Can you elaborate?

Thanks for the quick response @rrrrrok. The linked issue just shows that without specifying the exact version of a package the install may not work. Sorry, not directly related. Should I create an issue? I think it would be great if we could keep track of any version combinations that have been tested.

We’ve not implemented functionality like this because it can lead to lots of dependency resolution issues. Python packaging tools work fine most of the time like this but on occasion it can break things in ways that are difficult for users to recover from. OTOH, not pinning clearly can also lead to problems. Maybe if this was done as a deliberate action from the user, e.g. renku freeze that would essentially just do pip freeze > requirements.txt it would be enough? Of course this doesn’t handle the non-python system libraries at all.

I think the renku freeze function you described would be great, but maybe also a possibility to preserve a docker container including any non-python system libraries would be useful. I am not sure if pip freeze > requirements.txt would be sufficient in all cases, as in many cases, the order in which libraries are installed matters. @rcnijzink might have some examples.

We do preserve the docker images but we don’t make guarantees (not on renkulab.io anyway). There might be other renku deployments that would make guarantees about how long-lived the images in the registry are. In general, for proper preservation it would be important to export the image to an archive like zenodo. We don’t currently offer an automatic way of doing that but I’d be happy if you wanted to draft more details around this use-case.

I think both would be great. renku freeze just pinning the current packages in requirements.txt, and adding the docker container to a permanent archive, e.g. on zenodo. Maybe a renku unfreeze would also be helpful to remove the pinning in case the user wants to test the code with the most up-to-date packages.

I don’t know enough about docker to give more details, but the general idea is to guarantee that results remain reproducible and code re-usable in the future.

Sure that’s fine, I think the technical details can be deliberated later - I’m more interested in how you imagine working with it as a user. What would you do with the archived image, for example? Somehow automatically recover it locally so you could run against an old version of the project?

Yes, I would like to be able to still run and modify my code from years ago, but a lot of it does not run on the current versions of libraries. It is a major effort to figure out what were the up-to-date versions of libraries when I originally developed the code. Ideally, I would have pinned those versions and could re-create a docker container that would run my old code. If this fails because the old code relies on some non-python programs, I would like to be able to pull out the docker container from back then and run the code in the original docker container to see what was the original behaviour of the code. This could be done at the release level, i.e. for versions of the project that are actually published and preserved permanently.

Right - I think it would make a lot of sense to have this sort of functionality in the mode where we could export a tarball of the entire project to a repository. The question then is what you would expect the hosted Renku (i.e. renkulab.io) to do for you (if anything). Or would it be enough to do something like this locally:

renku unarchive https://zenodo.org/<project-archive-doi>

which would then create a directory with the contents of the archived project and extract the preserved docker image so you can run it. From the implementation point of view I can see the CLI version of this being relatively straightforward, but I’m not sure what we could do on the renkulab side.

1 Like

Turns out we already have several issues for exporting projects to external repositories: https://github.com/SwissDataScienceCenter/renku-python/issues/1327

I’ve added a bit at the bottom to mention the need to export Docker images as well. Thanks for bringing up this use-case!

1 Like

Yes, thank you, a holistic solution as discussed in the issue would be great. In the mean time, would it be possible to implement something like renku freeze, which would just add explicit version numbers to all entries in requirements.txt and environment.yml? This could be used whenever creating a tag or submitting to zenodo, so that if I want to use a tag years later, I am more likely to still get a functioning combination of libraries for that particular repo.

Actually, I just realised that the issues and discussions therein should be part of an open science publication. At the moment, these can only be downloaded through the gitlab UI: Settings, General, Advanced, Export project. I just tried that, but I am still waiting for the email with the download link…
I wonder if there could be a better way of making the discussions in the issues available through e.g. zenodo.

3 years later, and I had the challenge of re-using a project that had been evaluated in 2021 the last time, without a pip freeze. If I now create a new environment, nothing works any more, as the packages listed in requirements.txt did not have versions specified. However, since I know that the project still worked on 30/07/2021, I was able to re-create the output of pip freeze from back then with the help of https://github.com/astrofrog/pypi-timemachine:
In a terminal, create a new conda environment and then:

pip install pypi-timemachine
pypi-timemachine 2021-07-30 --port 5000

Then, in another terminal tab, using the same conda environment:

pip install --index-url http://localhost:5000/ -r requirements.txt
pip freeze > requirements_2021-07-30.txt

After this, I re-named the original requirements.txt to requirements_specific.txt, and requirements_2021-07-30.txt to requirements.txt. After pushing back to my repo and creting a new renkulab environment based on the new requirements.txt file, I obtained a state similar to that on 30/07/2021, where the old code works again. Could such a procedure be simplified in renku? It happens quite often that I need to re-use an old project, which does not work with new python packages any more.

Here is an issue submitted to pypi in order to include such a functionality in the pip command, but it did not gain any traction:

Did you try spinning up the old docker image? Is it available for your project?

No, I didn’t try that, hadn’t thought of it. But I also wouldn’t know where to look and how to re-activate it. The project is: Reproducible Data Science | Open Research | Renku

one thing that we do on our end to make image builds for our services reproducible is using poetry for dependencies and installing dependencies from the lock file poetry creates, which has specific versions.

E.g. here: renku-data-services/projects/renku_data_service/Dockerfile at main · SwissDataScienceCenter/renku-data-services · GitHub

If we just did poetry install, it’d install the newest versions compatible with our pyproject.toml. But since we do poetry export, this uses the poetry.lock file with fixed versions to create a requirements.txt, which we then install. This way, for our Renku services, no matter when we build the image, we always get the same packages. So you could possibly adopt a similar workflow. (note that this breaks down if a package from the lock file isn’t available on pypi anymore due to being deleted)

I just cloned your project, checked out the last pre-2024 commit and did docker build - it worked after I changed the renku version to 2.9.2 (there was an issue with one of the dependencies).

Thanks for checking, the build works, but I was not able to run the notebooks due to the use of deprecated packages, and other run-time incompatibilities.
Incidentally, the workflows are broken in this repo since the upgrade, see my other post here: How to resolve circular workflows?
Could you help me fix this? I don’t even know where to start. :frowning: