I was working on a session on SV-Renku when suddenly I was unable to connect to the session for a little while. A minute later, I could connect again but the same session had reverted to three commits in the past. All the unpushed work was gone (which is not a lot of work so no issues).
I was working with the parallel package in R, I doubt it has any link with the issue but just in case.
Any idea why this happened ?
Hi @bopekno it is a bit hard to troubleshoot this without access to the cluster or a bit more details.
I have seen a similar behaviour on Renku deployments that use a specific type of volumes for user session disk storage. And the SV-Renku deployment falls in this category. In your case the disk storage for user sessions is part of the nodes that make up the cluster where Renku is running. Initially we used this type of storage without imposing any limits but at some point we introduced limits. Now the problem with these limits is that they are not obvious to the user. Even with these limits in place if you were to run
df -h ./ in your Renku session you would see a lot of free space - regardless of how much storage you requested when you started your session. However, as soon as you consume more than the amount you requested when you started your session, your session is “evicted” from the cluster. This eviction could cause the loss of data as you describe it.
Now we did not implement the eviction - that is just part of the Kubernetes cluster that is used to run Renku. After we saw this behaviour we did roll out a feature that enables a deployment to essentially ignore these limits when the same type of storage as the SV deployment is used. This feature was rolled out to the SV deployment recently. But I think that it may have been rolled out after you experienced this problem.
Anyhow I hope this sheds some light on the problem. Let me know if you have any questions.