Guide: Running long ML training jobs without interruptions

If you have ever experienced a long-running training process die because of a Renku timeout, this guide is for you!

While Renku sessions stay active during CPU usage, the front-end application (JupyterLab/VSCode) may still stop long-running processes. To prevent this, we recommend using tmux, which is installed by default in all Renku global environments and code-based environments.

:test_tube: Workflow

  1. Open your session terminal in Renku.
  2. Start a new tmux session: tmux
  3. Run your script: e.g. python train.py

You can now safely close your browser window. Your model will keep training in the background until completion.

:open_book: Full documentation: Handle long training runs