CI/CD Runners - something has changed?

Hello,

Since a few days I am experiencing an issue with CI/CD runners while building Dockerfile across multiple repositories.

They particularly get stuck in the following line until the timeout is reached (even if it’s 10h):

RUN mamba env update -q -f /tmp/environment.yml && \
    /opt/conda/bin/pip install -r /tmp/requirements.txt --no-cache-dir && \
    mamba clean -y --all && \
    mamba env export -n "root" && \
    rm -rf ${HOME}/.renku/venv

If any help, I experienced this already across:

So I am wondering if something has changed with the runners that cause them to get stuck while building conda environments.

I am facing the same issue on renku limited since last Friday. The CI/CD runners got stuck at the same line as Firat showed above.

@firat @Lin thanks for reporting this - have you tried using mamba/conda locally to install the packages?

Hi @rrrrrok, I can install the conda environment locally with mamba, and quite fast. Nevertheless, the two repos I see it are very much different. Also, the CI/CD stopped working despite no changes in environment or Dockerfile or requirements.txt. So it must be some update on the side of the runners.
I am wondering if RAM / disk /etc specs of the CI/CD got downgraded? Or if there is something that is cached that is causing some issue?
See where it always gets stuck, but not fail, just gracefully timeout when the max time is reached.

Step 7/9 : RUN conda env update -q -f /tmp/environment.yml && /opt/conda/bin/pip install -r /tmp/requirements.txt && conda clean -y --all && conda env export -n “root”
240 —> Running in 082fdb2ddff6
241Collecting package metadata (repodata.json): …working…
243ERROR: Job failed: execution took longer than 1h0m0s seconds

Hi @rrrrrok, I can install the packages locally using conda, also tried building the docker image locally, which works fine on my local machine.

Hi @rrrrrok, I am experiencing the same. the CI/CD fails because taking too long (limit is now 1h, but much less is needed normally), and it get stuck at the mamba env update. Same as for @firat

Any update on this? Locally on a VM all works (conda, mamba is not installed on it).

Sorry about that, we’re looking into it. As a workaround in the meantime: if you can build the image locally, you could push that to the registry and it will be picked up when you try to launch a session.

thanks for the workaround. Could you give us pointers if there is some in renku docs (e.g., if there’s an example) for where we should be pushing the images to get them recognized?
At the moment, I switched to merging pull requests without waiting build to succeed.

The information here is a little bit outdated wrt launching the sessions locally, but all the bits about the registry, the docker login and the image naming should be correct.

Hey @firat @lin - I’ve been looking into this for an hour or so; it seems there is some issue with the conda environments generally which is causing the conda env update to get stuck. I’ve tried a couple of basic things to no avail as yet - I’ll look into it a bit more tomorrow…

could it be related to this: Add Packages to Renku project via Conda - pyGIMLi - #6 by champost

@firat @Lin I just realized you’re building with conda - have you tried with mamba?

Oh wow, you are right the first project seems to be outdated. But second one (private project) is actually using mamba.
Let me try to replace Dockerfile with a newer Renku dockerfile setup.

Edit: Updating Dockerfile didn’t fix the issue.

@rrrrrok @seanrmurphy Just some more details in case: I am using mamba, and the project was building 4 days prior in few minutes, with no issues at all. The conda environment was consistent with no conflicts, as it was re-built from scratch after important packages had to be updated (such as torch). No change occurred on the environment.yaml nor on the Dockerfile, it just started getting stuck and the mamba update at one commit pushing only changes to the documentation. Not sure if this helps, but it seems that this is not something caused by the project itself (or not directly and in obvious manners at least, and the log from the image build job does not show anything sketchy). Let me know if you want to look at the project, I can add you! Cheers and thanks for looking into this!

@Lin you said you could build the docker image locally. Can you try again without cache? I can no longer build them locally?!

Just tried. It still works on my side.

@mitch - thanks for providing more details here - if you could add me to the project, it would be great.

As you note, it seems unlikely to be a project specific issue if it’s affecting multiple people across multiple projects at the same time.

@rrrrrok The same issue with mamba. My situation is pretty similar to @mitch. One package added to environment.yaml (for sure no conflicts, local build is also fine), and no change on the Dockerfile. All previous CI/CD runs worked fine.

Hi folks - I have been looking into this a little this morning - I can confirm that I can build the project of @mitch (mzb-workflow) on a vm and almost certainly on my local machine. My current thinking is that there could be issues with our access to anaconda inside the gitlab runners (similar issues have been reported last week) - I could of course be wrong on this.

The VM build did take time (20m) but it was a build from scratch - this is still significantly less than the 1h time limit we have as a default on the job in the pipeline so I don’t think we are arbitrarily hitting the timeout; also, I see @firat made the timeout 10h and still the problem manifested.

I’ll update when I find out more.

Just another data point here - I tested this with another project and switching to the conda-forge channel seems to have helped…