I have successfully deployed Renku on my Kubernetes cluster with two nodes. However, I am unable to add my GPUs to Renku, as shown in the attached screenshot.
Could you kindly help me resolve this issue? I’d greatly appreciate your support.
@ALQATF have you added the GPUs on the nodes in your k8s cluster? And are they accessible to k8s workloads running on those nodes? If so then you can just set whatever combination of node affinities and taints in the resource classes for a resource pool and taints on the nodes in k8s to make certain resource classes schedule only on certain nodes in your cluster.
The documentation around how to create resource pools or classes that use specific nodes is still missing.
But the hardest part is usually making GPUs be usable by Kubernetes pods running in the cluster. And for this we cannot provide documentation because it depends on too many things (i.e. the gpu hardware, the hardware that runs the cluster, the Kubernetes version, cloud or bare metal, permissions, networking setup, etc). But if you have added the GPUs to nodes in your cluster and you have confirmed they are indeed usable by Kubernetes pods, then you can simply taint those nodes in k8s, then add the same toleration to a specific resource class in renku and also add an affinity for those nodes to the same class.
@ALQATF I think you should also try to run a simple Pod or Deployment that tries to use those gpus on one of the nodes. To confirm all the drivers are on the nodes and all that. If that does not work then running stuff on Renku will also not work.
Once you are certain the GPUs are usable from withing a K8s pod or deployment, you need to do the following:
Make your Renku account an administrator
Log out and back in
Go to the admin panel and create the resource pools and classes
Assign users that can access those resource pools you just created
To make yourself an admin do the following:
Navigate to https://<where-renku-is-installed>/auth
Log in as Keycloak admin with the username admin and the password can be found in a Kubernetes secret named keycloak-password-secret
Change the Keycloak realm to Renku
Go to the users page and find your own user
Assign the renku-admin role to the user
Log out of the keycloak admin panel
Log out of Renku
Log back in
If you click on the profile outline in the upper right corner after logging in then you can see an Admin panel option where you will be able to create and manage resource pools.
We have definitely had cases where kubectl describe nodes shows that the nodes have gpus but then you could not use or see the gpus from Pods running on those nodes.