How to ensure reproducibility of environment?

schymans · 26 October 2020 17:04

When I specify an environment in Dockerfile, environment.yml and requirements.txt, I can specify versions of the different packages or leave out that information. If I don’t specify a version, the currently up-to-date version will be installed, right? What if I or someone else tries to re-use the project in a year’s time, when the up-to-date versions of the packages are different? Is there a way to find out what were the original package versions? Something like a pip freeze? Here is a related issue: https://github.com/SwissDataScienceCenter/renku-python/issues/1266

rrrrrok · 26 October 2020 17:14

Hi @schymans that’s an excellent question - in general yes, it is a problem. On renku, the images are persisted for quite a long time so as long as the original image is available in the registry then the versions will be ok. If the image is rebuilt, however, you are right that they will not be stable.

I don’t quite follow how the issue you linked is related? Can you elaborate?

schymans · 27 October 2020 07:39

Thanks for the quick response @rrrrrok. The linked issue just shows that without specifying the exact version of a package the install may not work. Sorry, not directly related. Should I create an issue? I think it would be great if we could keep track of any version combinations that have been tested.

rrrrrok · 27 October 2020 12:37

We’ve not implemented functionality like this because it can lead to lots of dependency resolution issues. Python packaging tools work fine most of the time like this but on occasion it can break things in ways that are difficult for users to recover from. OTOH, not pinning clearly can also lead to problems. Maybe if this was done as a deliberate action from the user, e.g. renku freeze that would essentially just do pip freeze > requirements.txt it would be enough? Of course this doesn’t handle the non-python system libraries at all.

schymans · 27 October 2020 13:12

I think the renku freeze function you described would be great, but maybe also a possibility to preserve a docker container including any non-python system libraries would be useful. I am not sure if pip freeze > requirements.txt would be sufficient in all cases, as in many cases, the order in which libraries are installed matters. @rcnijzink might have some examples.

rrrrrok · 27 October 2020 13:16

We do preserve the docker images but we don’t make guarantees (not on renkulab.io anyway). There might be other renku deployments that would make guarantees about how long-lived the images in the registry are. In general, for proper preservation it would be important to export the image to an archive like zenodo. We don’t currently offer an automatic way of doing that but I’d be happy if you wanted to draft more details around this use-case.

schymans · 27 October 2020 13:21

I think both would be great. renku freeze just pinning the current packages in requirements.txt, and adding the docker container to a permanent archive, e.g. on zenodo. Maybe a renku unfreeze would also be helpful to remove the pinning in case the user wants to test the code with the most up-to-date packages.

I don’t know enough about docker to give more details, but the general idea is to guarantee that results remain reproducible and code re-usable in the future.

rrrrrok · 27 October 2020 13:29

Sure that’s fine, I think the technical details can be deliberated later - I’m more interested in how you imagine working with it as a user. What would you do with the archived image, for example? Somehow automatically recover it locally so you could run against an old version of the project?

schymans · 27 October 2020 13:59

Yes, I would like to be able to still run and modify my code from years ago, but a lot of it does not run on the current versions of libraries. It is a major effort to figure out what were the up-to-date versions of libraries when I originally developed the code. Ideally, I would have pinned those versions and could re-create a docker container that would run my old code. If this fails because the old code relies on some non-python programs, I would like to be able to pull out the docker container from back then and run the code in the original docker container to see what was the original behaviour of the code. This could be done at the release level, i.e. for versions of the project that are actually published and preserved permanently.

rrrrrok · 28 October 2020 17:08

Right - I think it would make a lot of sense to have this sort of functionality in the mode where we could export a tarball of the entire project to a repository. The question then is what you would expect the hosted Renku (i.e. renkulab.io) to do for you (if anything). Or would it be enough to do something like this locally:

renku unarchive https://zenodo.org/<project-archive-doi>

which would then create a directory with the contents of the archived project and extract the preserved docker image so you can run it. From the implementation point of view I can see the CLI version of this being relatively straightforward, but I’m not sure what we could do on the renkulab side.

rrrrrok · 28 October 2020 17:21

Turns out we already have several issues for exporting projects to external repositories: https://github.com/SwissDataScienceCenter/renku-python/issues/1327

I’ve added a bit at the bottom to mention the need to export Docker images as well. Thanks for bringing up this use-case!

schymans · 2 November 2020 10:01

Yes, thank you, a holistic solution as discussed in the issue would be great. In the mean time, would it be possible to implement something like renku freeze, which would just add explicit version numbers to all entries in requirements.txt and environment.yml? This could be used whenever creating a tag or submitting to zenodo, so that if I want to use a tag years later, I am more likely to still get a functioning combination of libraries for that particular repo.

schymans · 2 December 2020 08:53

Actually, I just realised that the issues and discussions therein should be part of an open science publication. At the moment, these can only be downloaded through the gitlab UI: Settings, General, Advanced, Export project. I just tried that, but I am still waiting for the email with the download link…
I wonder if there could be a better way of making the discussions in the issues available through e.g. zenodo.

schymans · 16 May 2024 14:23

3 years later, and I had the challenge of re-using a project that had been evaluated in 2021 the last time, without a pip freeze. If I now create a new environment, nothing works any more, as the packages listed in requirements.txt did not have versions specified. However, since I know that the project still worked on 30/07/2021, I was able to re-create the output of pip freeze from back then with the help of https://github.com/astrofrog/pypi-timemachine:
In a terminal, create a new conda environment and then:

pip install pypi-timemachine
pypi-timemachine 2021-07-30 --port 5000

Then, in another terminal tab, using the same conda environment:

pip install --index-url http://localhost:5000/ -r requirements.txt
pip freeze > requirements_2021-07-30.txt

After this, I re-named the original requirements.txt to requirements_specific.txt, and requirements_2021-07-30.txt to requirements.txt. After pushing back to my repo and creting a new renkulab environment based on the new requirements.txt file, I obtained a state similar to that on 30/07/2021, where the old code works again. Could such a procedure be simplified in renku? It happens quite often that I need to re-use an old project, which does not work with new python packages any more.

Here is an issue submitted to pypi in order to include such a functionality in the pip command, but it did not gain any traction:

github.com/pypi/warehouse

Have a way to filter packages by date

opened 11:38AM - 20 Jul 19 UTC

astrofrog

feature request

**What's the problem this feature will solve?** There is currently no way to …reproduce a ``pip install ...`` command from a certain date in the past. Even if I were to run e.g. ``pip install astropy==2.0.11``, recent versions of dependencies would get picked up. It would be great if there was a way through the PyPI API to exclude packages added after a certain date, which could then be extended to a pip flag or configuration option. This would effectively enable reproducible installs from PyPI. **Describe the solution you'd like** I've developed a small package called [pypi-timemachine](https://github.com/astrofrog/pypi-timemachine) which implements this feature by running a tiny local proxy PyPI server that excludes packages released after a certain date. This means that it's possible to then run a pip install command that will install the full stack of dependencies as if the user was at some date in the past. **Additional context** There are two ways I can see to implement this into PyPI - either the main index URL for PyPI could be made to take various filters as GET arguments, e.g.: https://pypi.org/simple?date-max=2013-03-02T10:30:23 or new routes could be added such as https://pypi.org/simple/snapshot/2013-03-02T10:30:23 There might be other ways to do this that I haven't thought of though. If there is interest in this kind of functionality, I'd be happy to try and give it a go. If there is no interest, feel free to close this and I can continue to maintain pypi-timemachine as a separate project.

rrrrrok · 16 May 2024 14:26

Did you try spinning up the old docker image? Is it available for your project?

schymans · 16 May 2024 14:27

No, I didn’t try that, hadn’t thought of it. But I also wouldn’t know where to look and how to re-activate it. The project is: Reproducible Data Science | Open Research | Renku

ralf.grubenmann · 16 May 2024 14:38

one thing that we do on our end to make image builds for our services reproducible is using poetry for dependencies and installing dependencies from the lock file poetry creates, which has specific versions.

E.g. here: renku-data-services/projects/renku_data_service/Dockerfile at main · SwissDataScienceCenter/renku-data-services · GitHub

If we just did poetry install, it’d install the newest versions compatible with our pyproject.toml. But since we do poetry export, this uses the poetry.lock file with fixed versions to create a requirements.txt, which we then install. This way, for our Renku services, no matter when we build the image, we always get the same packages. So you could possibly adopt a similar workflow. (note that this breaks down if a package from the lock file isn’t available on pypi anymore due to being deleted)

rrrrrok · 16 May 2024 14:52

I just cloned your project, checked out the last pre-2024 commit and did docker build - it worked after I changed the renku version to 2.9.2 (there was an issue with one of the dependencies).

schymans · 17 May 2024 07:22

Thanks for checking, the build works, but I was not able to run the notebooks due to the use of deprecated packages, and other run-time incompatibilities.
Incidentally, the workflows are broken in this repo since the upgrade, see my other post here: How to resolve circular workflows?
Could you help me fix this? I don’t even know where to start.

Topic		Replies	Views
Error during Environment creation	12	1829	8 September 2021
Pip install order on Renku using the Dockerfile	3	315	10 February 2023
Use customized image or Docker file	5	57	8 July 2024
Using Renku Cuda images with Python 3.11	2	23	22 August 2024
Renkulab-py for python 3.8 RenkuLab	8	173	6 September 2023

How to ensure reproducibility of environment?

Related topics