Renku on a Slurm cluster

droqueiro · 18 January 2021 12:32

Dear Renku Team,
I would like to set up my own Renku instance on a cluster. I am following the documentation on how to deploy Renku in a cluster:
https://renku.readthedocs.io/en/0.7.4/admin/index.html?highlight=cluster#deploying-the-renku-platform

The first obstacle comes with the requirement for Kubernetes. I have a Slurm cluster and Slurm does not support Kubernets PODS. The Slurm page suggests some alternatives, like rootless Kubernets and I want to discuss with you if you have experience with this type of setup.
https://slurm.schedmd.com/containers.html#k8s

I’d be grateful for your feedback and we can take this conversation offline if you think this is not the right place to discuss this topic.

theThank you very much for your help.

rrrrrok · 21 January 2021 17:15

Hi @droqueiro thanks for coming to our forum!

I’m curious to learn a bit more about your use-case - generally, deploying long-running server processes on a Slurm cluster (or any HPC cluster for that matter) through standard batch scheduling is probably not advisable. How did you want to use Renku? Do you need the server-side or did you want to use it to just track provenance? If the latter, you don’t need to deploy anything, you just need to install the renku CLI and you can then push your repository to a public instance like renkulab.io.

droqueiro · 22 January 2021 12:46

Hi Rok,
Thanks for getting back to me.

The goal is to use Renku in the context of our SDSC-funded project “Deep-ephys” (link).

To give you some additional background, we will have a website that allows users to perform data analysis. Users will be able to upload their own data and analyze it on our website. To conduct the analyses, the website will run jobs and these jobs will be executed and managed by a Slurm cluster in the background. The website is only the front-end.

Each possible type of analysis job will be a Renku workflow that we would have previously prepared and tested. My plan is to have our own Renku instance.

The Slurm cluster is already deployed and functional. It consists of CPU and GPU servers. The website is being prepared and developed with the Django framework. Now my goal is to create our own instance of Renku in the Slurm cluster. I tried to follow the installation guidelines for clusters, version 0.7.4 (link) but I am stuck with solving the first requirement which is to have Kubernets. Slurm does not support Kubernets.

Do you know of any possible workaround? Perhaps you can suggest a different type of deployment? I’m open to any set of possibilities that would allow us to have our own instance of Renku in this set up.

Thanks again for your help. With kind regards,

Damian.

rrrrrok · 26 January 2021 08:18

HI @droqueiro, apologies for replying late.

Typically a Slurm cluster and a kubernetes cluster will serve two very different purposes, though there might in some cases be some overlap. A kubernetes cluster would typically be used to orchestrate long-running server processes, like your django web app together with other services, i.e. an authentication service, a certificate service etc. I’m not aware of a configuration where a Slurm cluster and a kubernetes cluster would share the same resources, so typically these would need to be separated, meaning that you would need to provision your own kubernetes cluster alongside the slurm cluster if you wanted to have a stand-alone Renku deployment together with your current setup.

This is a super interesting use-case, though it deviates a bit from a “vanilla” way that a Renku workflow might be used. If I’m understanding correctly, you would like to do something like this:

a (parametrized) workflow using the renku CLI is developed in a renku project - this will consist of several steps, some being embarrassingly parallel when executed in batch mode.
A user submits data via the web-app in the specified format such that it can be consumed by the workflow
the workflow is shipped to the slurm cluster and executed - several things need to happen here:
a. the data is copied to the HPC cluster
b. the “gold-standard” repository with the workflow is cloned in a temporary space dedicated to this specific workflow execution
c. the user-provided data is linked into the repository so that it may be tracked self-consistently with the workflow
d. the pre-compiled workflow is re-executed with the user-provided data using a Slurm-aware workflow engine that can analyze the workflow DAG and execute some steps in parallel
e. once the execution finishes, the results are collected by a process outside of control of Renku and sent back to the web-app (this could be done via a script triggered by a post-execution hook in Slurm)
f. the temporary clone of the repository where the execution ran is removed

This is one possibility - I’m assuming here that you (and the user) care only about the final result, meaning that the repository where the workflow was executed does not need to be preserved. In this case, the server-side of Renku is only used for storing the repository with the gold-standard workflow and the majority of Renku-related functionality is done by the command-line client. In my view, a stand-alone deployment of Renku would not really be necessary here, you could simply use renkulab.io to keep the reference repository (or repositories).

A potential downside of this approach is that the knowledge-graph information about these batch executions on user-provided data will be lost, though you could manually extract it before getting rid of the temporarily cloned repository. A benefit is that you could make the “gold-standard” repo publicly accessible so anyone could clone it and execute the workflows on their own infrastructure if they wanted to.

Let me know what you think! To support this use-case fully we would likely need to develop some additional pieces in the Renku command-line client so it’s good to keep the conversation going to clarify exactly what is needed.

rrrrrok · 5 February 2021 14:32

Hi @droqueiro just following up here with an issue that is relevant to this discussion: enable batch execution of workflows · Issue #1929 · SwissDataScienceCenter/renku · GitHub - feel free to comment!

Topic		Replies	Views
Hosting renku.lab RenkuLab	1	79	5 March 2024
Renku instance in a cluster Renku Admin	31	348	7 November 2023
Sharing resources when running renku on an own machine	1	206	28 November 2022
Sizing for Renku deployment Renku Admin	2	363	14 December 2021
Feedback on offline deployment with custom root CA Renku Admin	15	1137	17 August 2022

Renku on a Slurm cluster

Related topics