2 GPU instance got killed and results are lost

Hi there,

I haven’t logged in to my 2 GPU machine since Thursday as model training was still on going.
I wanted to check back in today and save my results but noticed that my instance got killed?!!

How come you are killing machines without prior notice?

Also, I cannot see any autosave branch in the remote repo for some reason. And I don’t get prompted
that there’s unsaved work when starting a new instance.

Any chance I can access my results?

In any case, I try not to get too upset about renku’s recovery system failing when it’s most needed, I just wanted to pass along that it severely affects other people’s work and projects are delayed by more than a week! I know we’re all human, but I’d appreciate it if you guys took a look at the issue. Thank you.

Raphaela

Hi @rwagner we know that the reliability and the user experience when dealing with autosaves could and should be much better and we are working on improving things. But it will take some time before these improvements are available to users.

I did try to look for your project and see if I can restore the data but I could not find anything.

As for why we shut down sessions the answer is that “we simply have to”. We do not have unlimited resources and if we do not shut down idle sessions then we would need essentially unlimited resources in order to serve our users. This is especially important on a deployment such as limited where there are a few GPUs that easily get taken by users. If we never shut down any idle sessions then very few users would have access to the GPUs, Or users would have to wait for a very long time to get a GPU. I know that wait times can be quite long now but if we never shut down idle sessions then getting a session with a GPU would be impossible. In this regard too there are improvements that could be made.

May I recommend that you add something in your code that will call renku save when you have a session that you will leave to run for a very long time? This way you do not have to be exposed to the same risk while we are working on the improvements I mentioned.

For example if you are using python something like this should run renku save regardless of whether your script successfully finishes or not. Also I did not have a chance to test the code below myself so I would give it a quick test before using it with important data.

from subprocess import check_output

try:
  some_long_running_function()
  some_other_long_running_function()
finally:
  check_output("renku save", shell=True, cwd="path_to_repository_root_folder")

Also I looked at our logs and I see that your autosave branch failed to be created. I quickly tried to replicate the error but I could not. I will reach out for more information on Tuesday or later if I am still unable to replicate the bug that causes this.

One more thing to add. The automated session removal is not random. We monitor the resource usage of sessions every few minutes. Sessions are removed only if they have been consistently idle for a period of 24 hours.

And idle constitutes not being visited by users AND also having resource usage that is essentially zero. So even if you did not visit your session but your scripts were still running it would not have been stopped before your scripts/workloads finished.

Thanks for getting back.

I wasn’t aware of the 24h idling && no user login criteria. If I had known before, I would have taken precautionary actions of course.

I’m logging all results in weights and biases now, so I no longer rely on autosave.

I’d appreciate any updates regarding the data loss recovery.

Thank you,
Raphaela