No NVIDIA driver on the system found

anisioti · 6 April 2021 15:40

Hello,

I am trying to run some pytorch code that uses GPU’s. I have configured my environment with 2 GPUs and I made sure that the code can be run on a GPU by testing on my computer. When running I get the following error that I cannot solve.

Do you have any suggestions on how I can handle this error?

Thank you very much
Athina

lusamino · 6 April 2021 16:01

Hi @anisioti !!

Ohh, that is actually a strange error considering you get a correct output from nvidia-smi. In any case, could past here the first lines of your Dockerfile (contained in root). Perhaps, the environment is not correct…

Also, have you tried on a notebook just importing torch, and then:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

There you could really assess if your code will correctly find the gpu.

I hope this helps! But as said, please let us know how your Dockerfile looks.

Cheers
Luis

anisioti · 6 April 2021 16:15

Hello!
Thank you very much for your reply.
I just tried what you suggested and took the following output, so I think that the code is not finding the gpus. The n_gpu is zero
and with torch.cuda.get_device_name(0) I am getting an error.

So maybe something is wrong with my environment setup. My dockerfile is the following

Best,
Athina

lusamino · 6 April 2021 16:18

Hi @anisioti

Then, I believe we found it! In the Dockerfile the Renku version is not a CUDA enabled one. Please, try changing the first two lines by:

ARG RENKU_BASE_IMAGE=renku/renkulab-cuda10.0-tf1.14:0.7.4
FROM ${RENKU_BASE_IMAGE}

Wait for Gitlab to generate the new Docker environment, and try again! I bet it will work now. And if not, here we are to help.

Cheers
Luis

anisioti · 6 April 2021 16:51

Hello!
I changed the 2 lines of the docker file and now it looks like this

I also stopped the previous environment I had and created a new one, but still the problem exists.
Shall I renew another environment or just wait more? (I see you are refering to a docker environment, but I am not sure what this is)

Thank you very much again!
Athina

rrrrrok · 6 April 2021 19:47

Hi @anisioti - when you are editing the file in your running session, please make sure the Dockerfile is permanently saved in your repository by committing and pushing the file to the server. The easiest way to do this is simply opening a terminal in JupyterLab and running renku save.

I took the liberty to have a look at your project and I noticed that the format of your conda environment.yml is slightly wrong - the dependencies need to be listed like this:

dependencies:
  - torch
  - torchvision

… and so on. Once you fix this, the image should be able to build properly (you have a failing job right now). Hope that helps!

anisioti · 7 April 2021 09:50

I haven’t noticed that I should use - before writing the dependencies
It worked indeed!
Thank you very much!

rrrrrok · 7 April 2021 11:07

Hi @anisioti it’s actually still not quite right - at the moment your environment.yml looks like this:

name: "base"
channels:
  - defaults
# dependencies:
# - add packages here
  - torch 
  - torchvision
  - tqdm
  - numpy
  - pandas
  - time
  - sklearn
  - subword_nmt
# - one per line
prefix: "/opt/conda"

which says that it should use channels torch, torchvision etc. It should be like this (with dependencies uncommented):

name: "base"
channels:
  - defaults
dependencies:
  - torch 
  - torchvision
  - tqdm
  - numpy
  - pandas
  - time
  - sklearn
  - subword_nmt
prefix: "/opt/conda"

The reason they got installed anyway is because you also have them in the requirements.txt file - they only need to be specified in one place.

anisioti · 7 April 2021 11:40

Thank you very much and sorry for the weird mistakes, I haven’t worked with this kind of file before.

I have one more question to make. I see that we are working with python version 3.7. But for some code I need to run I will need python 3.6 with gpu in order to install all the requirements. Is there a way to change this setting? I see the choices for the docker image on the github readme, but I cannot understand what should be written in the first lines of the docker file of the new environment I will need to make.

rrrrrok · 7 April 2021 11:47

No problem at all, the syntax is definitely not super obvious

It is possible to install a different python version in the image, but it’s not entirely trivial - are you absolutely sure you need python 3.6? If it’s a must, @mitch has done this in the past, I believe… maybe he has an example he can share.

mitch · 7 April 2021 12:10

Hi All, sorry if I am late to the party.

Regarding the evironment.yml first. It is best to install torch and torchvision from the torch channels (see https://pytorch.org/). For this, one can specify the channels that differ from defaults by simply adding the channel before the package to be installed, e.g. - pytorch::torch, so the environment definition will become:

name: "base"
channels:
  - defaults
dependencies:
  - pytorch::torch 
  - pytorch::torchvision
  - tqdm
  - numpy
  - pandas
  - time
  - sklearn
  - subword_nmt
prefix: "/opt/conda"

Regarding the python version, I personally use pytorch=1.7.1 with python 3.7 (py3.7_cuda10.1.243_cudnn7.6.3_0 ). I think it is best to stick to that… @rrrrrok I did downgrade it once, but it is kinda messy as renku wants python 3.7 which is the one provided in the docker image. It was much simpler to use 3.7 or even 3.8, torch now provides versions for a variety of CUDA and python combos. And I never used that within a renkulab environment, it was a local env. I would just pin all package versions to make sure that all packages are happy with python 3.7.

@anisioti may I ask you what package must use python 3.6? Maybe I can help more!

Cheers all!

Topic		Replies	Views
Renku session on own machine with GPUs Renku (CLI)	7	225	11 April 2023
Problem with CUDA initialization RenkuLab	6	1248	5 November 2020
Nvidia-smi gives 2 GPUs on 0 GPU machine // Mismatch TF version	0	191	26 June 2022
I cannot use GPUs with tensorflow	1	257	24 March 2021
What is the GPU option RenkuLab	4	801	10 June 2022

No NVIDIA driver on the system found

Related topics