CI/CD Runners - something has changed?

firat · 14 August 2023 07:50

Hello,

Since a few days I am experiencing an issue with CI/CD runners while building Dockerfile across multiple repositories.

They particularly get stuck in the following line until the timeout is reached (even if it’s 10h):

RUN mamba env update -q -f /tmp/environment.yml && \
    /opt/conda/bin/pip install -r /tmp/requirements.txt --no-cache-dir && \
    mamba clean -y --all && \
    mamba env export -n "root" && \
    rm -rf ${HOME}/.renku/venv

If any help, I experienced this already across:

So I am wondering if something has changed with the runners that cause them to get stuck while building conda environments.

Lin · 14 August 2023 13:19

I am facing the same issue on renku limited since last Friday. The CI/CD runners got stuck at the same line as Firat showed above.

rrrrrok · 14 August 2023 14:08

@firat @Lin thanks for reporting this - have you tried using mamba/conda locally to install the packages?

firat · 14 August 2023 14:48

Hi @rrrrrok, I can install the conda environment locally with mamba, and quite fast. Nevertheless, the two repos I see it are very much different. Also, the CI/CD stopped working despite no changes in environment or Dockerfile or requirements.txt. So it must be some update on the side of the runners.
I am wondering if RAM / disk /etc specs of the CI/CD got downgraded? Or if there is something that is cached that is causing some issue?
See where it always gets stuck, but not fail, just gracefully timeout when the max time is reached.

Step 7/9 : RUN conda env update -q -f /tmp/environment.yml && /opt/conda/bin/pip install -r /tmp/requirements.txt && conda clean -y --all && conda env export -n “root”
240 —> Running in 082fdb2ddff6
241Collecting package metadata (repodata.json): …working…
243ERROR: Job failed: execution took longer than 1h0m0s seconds

Lin · 14 August 2023 15:08

Hi @rrrrrok, I can install the packages locally using conda, also tried building the docker image locally, which works fine on my local machine.

mitch · 16 August 2023 09:42

Hi @rrrrrok, I am experiencing the same. the CI/CD fails because taking too long (limit is now 1h, but much less is needed normally), and it get stuck at the mamba env update. Same as for @firat

Any update on this? Locally on a VM all works (conda, mamba is not installed on it).

rrrrrok · 16 August 2023 11:08

Sorry about that, we’re looking into it. As a workaround in the meantime: if you can build the image locally, you could push that to the registry and it will be picked up when you try to launch a session.

firat · 16 August 2023 11:21

thanks for the workaround. Could you give us pointers if there is some in renku docs (e.g., if there’s an example) for where we should be pushing the images to get them recognized?
At the moment, I switched to merging pull requests without waiting build to succeed.

rrrrrok · 16 August 2023 11:35

The information here is a little bit outdated wrt launching the sessions locally, but all the bits about the registry, the docker login and the image naming should be correct.

seanrmurphy · 16 August 2023 15:51

Hey @firat @lin - I’ve been looking into this for an hour or so; it seems there is some issue with the conda environments generally which is causing the conda env update to get stuck. I’ve tried a couple of basic things to no avail as yet - I’ll look into it a bit more tomorrow…

rrrrrok · 16 August 2023 16:03

could it be related to this: Add Packages to Renku project via Conda - pyGIMLi - #6 by champost

rrrrrok · 16 August 2023 17:12

@firat @Lin I just realized you’re building with conda - have you tried with mamba?

firat · 16 August 2023 20:03

Oh wow, you are right the first project seems to be outdated. But second one (private project) is actually using mamba.
Let me try to replace Dockerfile with a newer Renku dockerfile setup.

Edit: Updating Dockerfile didn’t fix the issue.

mitch · 17 August 2023 06:33

@rrrrrok @seanrmurphy Just some more details in case: I am using mamba, and the project was building 4 days prior in few minutes, with no issues at all. The conda environment was consistent with no conflicts, as it was re-built from scratch after important packages had to be updated (such as torch). No change occurred on the environment.yaml nor on the Dockerfile, it just started getting stuck and the mamba update at one commit pushing only changes to the documentation. Not sure if this helps, but it seems that this is not something caused by the project itself (or not directly and in obvious manners at least, and the log from the image build job does not show anything sketchy). Let me know if you want to look at the project, I can add you! Cheers and thanks for looking into this!

firat · 17 August 2023 07:02

@Lin you said you could build the docker image locally. Can you try again without cache? I can no longer build them locally?!

Lin · 17 August 2023 07:25

Just tried. It still works on my side.

seanrmurphy · 17 August 2023 07:40

@mitch - thanks for providing more details here - if you could add me to the project, it would be great.

As you note, it seems unlikely to be a project specific issue if it’s affecting multiple people across multiple projects at the same time.

Lin · 17 August 2023 07:41

@rrrrrok The same issue with mamba. My situation is pretty similar to @mitch. One package added to environment.yaml (for sure no conflicts, local build is also fine), and no change on the Dockerfile. All previous CI/CD runs worked fine.

seanrmurphy · 17 August 2023 09:41

Hi folks - I have been looking into this a little this morning - I can confirm that I can build the project of @mitch (mzb-workflow) on a vm and almost certainly on my local machine. My current thinking is that there could be issues with our access to anaconda inside the gitlab runners (similar issues have been reported last week) - I could of course be wrong on this.

The VM build did take time (20m) but it was a build from scratch - this is still significantly less than the 1h time limit we have as a default on the job in the pipeline so I don’t think we are arbitrarily hitting the timeout; also, I see @firat made the timeout 10h and still the problem manifested.

I’ll update when I find out more.

rrrrrok · 17 August 2023 13:26

Just another data point here - I tested this with another project and switching to the conda-forge channel seems to have helped…

Topic		Replies	Views
Building renku docker images locally	2	41	11 September 2024
Container build stuck conda env in a previously working project RenkuLab	1	193	11 September 2023
CondaEnvException: Unable to determine environment RenkuLab	2	118	29 November 2023
Failed pipeline after migration to latest version fro Renku	12	167	13 November 2024
Docker image build failed RenkuLab	8	1521	14 May 2020

CI/CD Runners - something has changed?

Related topics