I am trying to run some pytorch code that uses GPU’s. I have configured my environment with 2 GPUs and I made sure that the code can be run on a GPU by testing on my computer. When running I get the following error that I cannot solve.
Ohh, that is actually a strange error considering you get a correct output from nvidia-smi. In any case, could past here the first lines of your Dockerfile (contained in root). Perhaps, the environment is not correct…
Also, have you tried on a notebook just importing torch, and then:
Hello!
Thank you very much for your reply.
I just tried what you suggested and took the following output, so I think that the code is not finding the gpus. The n_gpu is zero
and with torch.cuda.get_device_name(0) I am getting an error.
So maybe something is wrong with my environment setup. My dockerfile is the following
I also stopped the previous environment I had and created a new one, but still the problem exists.
Shall I renew another environment or just wait more? (I see you are refering to a docker environment, but I am not sure what this is)
Hi @anisioti - when you are editing the file in your running session, please make sure the Dockerfile is permanently saved in your repository by committing and pushing the file to the server. The easiest way to do this is simply opening a terminal in JupyterLab and running renku save.
I took the liberty to have a look at your project and I noticed that the format of your conda environment.yml is slightly wrong - the dependencies need to be listed like this:
dependencies:
- torch
- torchvision
… and so on. Once you fix this, the image should be able to build properly (you have a failing job right now). Hope that helps!
Thank you very much and sorry for the weird mistakes, I haven’t worked with this kind of file before.
I have one more question to make. I see that we are working with python version 3.7. But for some code I need to run I will need python 3.6 with gpu in order to install all the requirements. Is there a way to change this setting? I see the choices for the docker image on the github readme, but I cannot understand what should be written in the first lines of the docker file of the new environment I will need to make.
No problem at all, the syntax is definitely not super obvious
It is possible to install a different python version in the image, but it’s not entirely trivial - are you absolutely sure you need python 3.6? If it’s a must, @mitch has done this in the past, I believe… maybe he has an example he can share.
Regarding the evironment.yml first. It is best to install torch and torchvision from the torch channels (see https://pytorch.org/). For this, one can specify the channels that differ from defaults by simply adding the channel before the package to be installed, e.g. - pytorch::torch, so the environment definition will become:
Regarding the python version, I personally use pytorch=1.7.1 with python 3.7 (py3.7_cuda10.1.243_cudnn7.6.3_0 ). I think it is best to stick to that… @rrrrrok I did downgrade it once, but it is kinda messy as renku wants python 3.7 which is the one provided in the docker image. It was much simpler to use 3.7 or even 3.8, torch now provides versions for a variety of CUDA and python combos. And I never used that within a renkulab environment, it was a local env. I would just pin all package versions to make sure that all packages are happy with python 3.7.
@anisioti may I ask you what package must use python 3.6? Maybe I can help more!