Renku session on own machine with GPUs

I would like to run renku on an own machine (as described here Renku on your Own Machine — Renku documentation ), but I’m having trouble getting the GPUs to work.

On the host machine I have installed docker, nvidia drivers and nvidia-container-toolkit. If I run the renku cuda image myself, the GPUs are recognised.

rawlik_m@magnifico ~/r/gict-of-human-breast (master)> docker run --gpus all renku/renkulab-cuda:11.7-55d6c10 nvidia-smi
cat: ./.ssh/authorized_keys: input file is output file
Thu Apr  6 16:24:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:15:00.0 Off |                  N/A |
| 35%   29C    P8    20W / 260W |      5MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:2D:00.0 Off |                  Off |
| 30%   39C    P8    16W / 230W |     75MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I have changed the FROM directive in Dockerfile of my project to use the renku cuda image:

# RENKU_VERSION determines the version of the renku CLI
# that will be used in this image. To find the latest version,
# visit https://pypi.org/project/renku/#history.
ARG RENKU_VERSION=2.3.2

# Install renku from pypi or from github if a dev version
RUN if [ -n "$RENKU_VERSION" ] ; then \
        source .renku/venv/bin/activate ; \
        currentversion=$(renku --version) ; \
        if [ "$RENKU_VERSION" != "$currentversion" ] ; then \
            pip uninstall renku -y ; \
            gitversion=$(echo "$RENKU_VERSION" | sed -n "s/^[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+\(rc[[:digit:]]\+\)*\(\.dev[[:digit:]]\+\)*\(+g\([a-f0-9]\+\)\)*\(+dirty\)*$/\4/p") ; \
            if [ -n "$gitversion" ] ; then \
                pip install --no-cache-dir --force "git+https://github.com/SwissDataScienceCenter/renku-python.git@$gitversion" ;\
            else \
                pip install --no-cache-dir --force renku==${RENKU_VERSION} ;\
            fi \
        fi \
    fi
#             End Renku install section                #
########################################################

FROM renku/renkulab-cuda:11.7-0769e3b

# Uncomment and adapt if code is to be included in the image
# COPY src /code/src

# Uncomment and adapt if your R or python packages require extra linux (ubuntu) software
# e.g. the following installs apt-utils and vim; each pkg on its own line, all lines
# except for the last end with backslash '\' to continue the RUN line
#
# USER root
# RUN apt-get update && \
#    apt-get install -y --no-install-recommends \
#    apt-utils \
#    vim
# USER ${NB_USER}

# install the python dependencies
COPY requirements.txt environment.yml /tmp/
RUN mamba env update -q -f /tmp/environment.yml && \
    /opt/conda/bin/pip install -r /tmp/requirements.txt --no-cache-dir && \
    mamba clean -y --all && \
    mamba env export -n "root" && \
    rm -rf ${HOME}/.renku/venv

COPY --from=builder ${HOME}/.renku/venv ${HOME}/.renku/venv

But when I start the session with renku session start —port 8888 and connect to it, the command nvidia-smi is not found.

Renku version: 2.3.2.dev20+g35f9d22

Hi @mrawlik - the renku session start command doesn’t do anything other than run the image with all the flags it needs to set up the container (port, volume mounts, and command). Are you sure that when you do that it uses the correct image? From the sound of it my first guess would be that it’s somehow not using the image that should be based on the renku cuda image. Did you commit that change before running renku session start?

Hi @rrrrrok I am absolutely sure. I just checked:

rawlik_m@magnifico ~/r/gict-of-human-breast (master)> renku session start --force-build --port 8889                                                                                                                          (base)
Image registry.renkulab.io/stamplab/gict-of-human-breast:a59aca5 built successfully.

a59aca5 is the hash of the newest commit. In that renku session I cannot access the gpus:

 base ▶ ~ ▶ work ❯ gict-of-human-breast ▶ # ▶ nvidia-smi
bash: nvidia-smi: command not found

Does renku pass --gpus all to docker?

I just checked that running that very image manually with --gpus all works fine, but without gives the same nvidia-smi: command not found as inside the renku session.

rawlik_m@magnifico ~/r/gict-of-human-breast (master)> docker run --gpus all registry.renkulab.io/stamplab/gict-of-human-breast:a59aca5 nvidia-smi                                                                            (base)
cat: ./.ssh/authorized_keys: input file is output file
Tue Apr 11 07:34:40 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:15:00.0 Off |                  N/A |
| 35%   29C    P8    19W / 260W |      5MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:2D:00.0 Off |                  Off |
| 30%   38C    P8    15W / 230W |     75MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
rawlik_m@magnifico ~/r/gict-of-human-breast (master)> docker run registry.renkulab.io/stamplab/gict-of-human-breast:a59aca5 nvidia-smi                                                                                       (base)
cat: ./.ssh/authorized_keys: input file is output file
/entrypoint.sh: line 60: nvidia-smi: command not found

good point! It almost certainly does not. We’ll need to add that. Thanks for reporting this, it should be an easy fix with the next release.

Fantastic! As this is blocking for our group at the moment, I’d very much like to keep track of it. Could you post the github issue once is made?

Also, do you think that a general option to pass arbitrary flags to docker might be a good idea? I would certainly find it useful. There was a similar situation with the lack of --port flag.

Did you use the --gpu flag for renku session start ? It can be passed all or a number. This does set the docker-py DeviceRequest.

You can also set it permanently for a project using e.g. renku config set interactive.gpu_request all.

I tried with a number, but I got an error:

Traceback (most recent call last):
  File "[...]/renku/ui/cli/exception_handler.py", line 132, in main
    return super().main(*args, **kwargs)
  File "[...]/renku/ui/cli/exception_handler.py", line 91, in main
    return super().main(*args, **kwargs)
  File "[...]/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "[...]/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "[...]/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "[...]/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "[...]/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "[...]/renku/ui/cli/session.py", line 276, in start
    session_start_command().with_communicator(communicator).build().execute(
  File "[...]/renku/command/command_builder/command.py", line 250, in execute
    output = self._operation(*args, **kwargs)  # type: ignore
  File "pydantic/decorator.py", line 40, in pydantic.decorator.validate_arguments.validate.wrapper_function
  File "pydantic/decorator.py", line 134, in pydantic.decorator.ValidatedFunction.call
  File "pydantic/decorator.py", line 206, in pydantic.decorator.ValidatedFunction.execute
  File "[...]/renku/core/session/session.py", line 161, in session_start
    provider_message, warning_message = provider_api.session_start(
  File "[...]/renku/core/session/docker.py", line 280, in session_start
    result = session_start_helper(consider_disk_request=True)
  File "[...]/renku/core/session/docker.py", line 201, in session_start_helper
    docker.types.DeviceRequest(count=[gpu_request], capabilities=[["compute", "utility"]])
  File "[...]/docker/types/containers.py", line 190, in __init__
    raise ValueError('DeviceRequest.count must be an integer')
ValueError: DeviceRequest.count must be an integer

I dug through documentation to see what the possible options are, but I couldn’t find. I didn’t know all is possible. I made an issue for it: Possible values for —gpu flag in renku session start · Issue #3388 · SwissDataScienceCenter/renku-python · GitHub

When starting the session with —gpu all it works! Thank you very much!

That does look like a bug, probably we need to cast it from string to int. Thank you for reporting it!