I am trying to run some pytorch code that uses GPU’s. I have configured my environment with 2 GPUs and I made sure that the code can be run on a GPU by testing on my computer. When running I get the following error that I cannot solve.
Do you have any suggestions on how I can handle this error?
Thank you very much
Hi @anisioti !!
Ohh, that is actually a strange error considering you get a correct output from
nvidia-smi. In any case, could past here the first lines of your
Dockerfile (contained in root). Perhaps, the environment is not correct…
Also, have you tried on a notebook just importing
torch, and then:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
There you could really assess if your code will correctly find the
I hope this helps! But as said, please let us know how your
Thank you very much for your reply.
I just tried what you suggested and took the following output, so I think that the code is not finding the gpus. The n_gpu is zero
and with torch.cuda.get_device_name(0) I am getting an error.
So maybe something is wrong with my environment setup. My dockerfile is the following
Then, I believe we found it! In the
Dockerfile the Renku version is not a CUDA enabled one. Please, try changing the first two lines by:
Wait for Gitlab to generate the new Docker environment, and try again! I bet it will work now. And if not, here we are to help.
I changed the 2 lines of the docker file and now it looks like this
I also stopped the previous environment I had and created a new one, but still the problem exists.
Shall I renew another environment or just wait more? (I see you are refering to a docker environment, but I am not sure what this is)
Thank you very much again!
Hi @anisioti - when you are editing the file in your running session, please make sure the Dockerfile is permanently saved in your repository by committing and pushing the file to the server. The easiest way to do this is simply opening a terminal in JupyterLab and running
I took the liberty to have a look at your project and I noticed that the format of your conda
environment.yml is slightly wrong - the dependencies need to be listed like this:
… and so on. Once you fix this, the image should be able to build properly (you have a failing job right now). Hope that helps!
I haven’t noticed that I should use - before writing the dependencies
It worked indeed!
Thank you very much!
Hi @anisioti it’s actually still not quite right - at the moment your
environment.yml looks like this:
# - add packages here
# - one per line
which says that it should use channels
torchvision etc. It should be like this (with
The reason they got installed anyway is because you also have them in the
requirements.txt file - they only need to be specified in one place.
Thank you very much and sorry for the weird mistakes, I haven’t worked with this kind of file before.
I have one more question to make. I see that we are working with python version 3.7. But for some code I need to run I will need python 3.6 with gpu in order to install all the requirements. Is there a way to change this setting? I see the choices for the docker image on the github readme, but I cannot understand what should be written in the first lines of the docker file of the new environment I will need to make.
No problem at all, the syntax is definitely not super obvious
It is possible to install a different python version in the image, but it’s not entirely trivial - are you absolutely sure you need python 3.6? If it’s a must, @mitch has done this in the past, I believe… maybe he has an example he can share.
Hi All, sorry if I am late to the party.
evironment.yml first. It is best to install torch and torchvision from the torch channels (see https://pytorch.org/). For this, one can specify the channels that differ from
defaults by simply adding the channel before the package to be installed, e.g.
- pytorch::torch, so the environment definition will become:
Regarding the python version, I personally use
pytorch=1.7.1 with python 3.7 (
py3.7_cuda10.1.243_cudnn7.6.3_0 ). I think it is best to stick to that… @rrrrrok I did downgrade it once, but it is kinda messy as renku wants python 3.7 which is the one provided in the docker image. It was much simpler to use 3.7 or even 3.8, torch now provides versions for a variety of CUDA and python combos. And I never used that within a renkulab environment, it was a local env. I would just pin all package versions to make sure that all packages are happy with python 3.7.
@anisioti may I ask you what package must use python 3.6? Maybe I can help more!