Long renku migration

Hello, I am trying to migrate this repo to the newest Renku version:

https://renkulab.io/projects/remko.nijzink/budyko

But this takes extremely long, a couple of days and it didn’t finish, is there a way I could speed this up?

Hi!

What renku version are you trying to update to? 0.16.2?

I have now 0.16.1.post1, would it be easier with 0.16.2?

Unfortunately, 0.16.2 is just as slow as 0.16.1.post1 .
Renku goes through all git commits recursively when migrating/updating which on projects with a lot of commits and workflows can take very long.

We are at the moment getting ready to release a new version of renku-python (the command line client) that does not depend on git commits anymore and greatly alleviates these problems on large repositories, as well as introducing many other goodies in relation to workflows.

The initial migration for big projects like budyko will still be slow (but maybe 1 day instead of weeks, we tested it on vomcases which I think is comparable and that took ~12-18 hours on my machine), but and future updates after that should be really fast.

The release of this new version of renku-python should happen any day now (we’re waiting for 2 PRs to be merged, then it’s finished), but other components might not be as fast to catch up and a full release of the whole renku platform might take longer. What this means is, when we release the new renku-python, it will probably be a while until the platform as a whole supports projects migrated to it, so the UI on e.g. renkulab.io will initially not work with projects that are on the new version.

So you could try it with the current master of GitHub - SwissDataScienceCenter/renku-python: A Python library for the Renku collaborative data science platform. if you’re just working locally, and everything should work, but you would likely have to git reset to before the update and update again once we make a proper release, and the project wouldn’t work for things such as adding datasets through the UI on renkulab.io.

Other than that I can’t offer you a solution. One of the big motivations for this new release was improving the speed on projects such as yours, which just isn’t possible with the way 0.16.2 does things.

That said, if you want to test current master and give the changes we made a try (maybe on a separate branch or copy of the project), your feedback would be really valuable! You can find the initial draft of the release notes here if you’re interested renku-python/renku-release-notes-1.0.0.md at 1.0.0-release-notes · SwissDataScienceCenter/renku-python · GitHub

Ok, thanks for this! That sounds really good indeed! Just running it a day would be more than fine, compared to what it is now. I’ll give it a try with the current master version then.

FYI: We tried migrating budyko with current master today and there seems to be an issue with a git commit that is a bit out of the ordinary. We’ll look into it on Monday.

So I am also trying, and the first part went okay, with some warnings. Now I came to step 9

Applying migration m_0009__new_metadata_storage…

and here the speed reduces a lot. Also here warnings especially related to the submodules that I use:

Warning: Entity 'src_flexsimple/.git' not found at '212c7c116ac199e7b03b74db65caff5a5a2c8f96'

Yes step 9 is the slow one, it goes through each commit to gather metadata, and sometimes has to search previous commits for each commit to gather dependent metadata throughout the project history. This was one of the main bottlenecks that made renku slow in the past. Basically it had to do what this migration does every time you called renku status or renku update.

This migration changes it so all the data is stored in one place, not scattered through commits, which improves speed a lot. We have to do the long calculation once to convert it to the new format when migrating, and then after that it’s fast. We have also moved away from using commit shas in our metadata, instead using hashes of files, which makes metadata much more robust against rebases, but that means the migration has to go through all commits and get files hashes for the files at that point in time to be consistent, which isn’t that fast.

The warnings are usually due to renku metadata referring to commits that don’t exist anymore or that don’t contain the information that our old metadata says they should. This could be due to history rewrites or rebases, or in your case mentioned above, it’s due to commit 212c7c116ac199e7b03b74db65caff5a5a2c8f96 being a git submodule commit, not in the main repository being migrated. So renku metadata refers to a file that is in a submodule and it can’t find it at that commit in the main repository.
I’m not sure if pulling/checking out the submodules would help in that case.
The library we use for git, GitPython, is not really maintained anymore (and there’s not a good replacement for it), and they only partially support submodules. As per the maintainer, Using GitPython’s submodules should be avoided as it’s not a complete implementation anymore and this makes it basically impossible for us to support projects with submodules properly. So unfortunately all we can do is ignore these types of errors, as the library often just returns with “Success” but no result, with no indication that there even was an error. So our hands are tied in these cases.

Okay, thank you for the explanation! Makes sense! Unfortunately, I had to break it off now (did it locally, I had to restart my pc), do you know if I can continue the migration from here again? It says now there are untracked files in the repository for .renku/metadata, could I do git add .renku/metadata and continue?

Unfortunately no, since we don’t keep track of where the migration was at. You’d have to git reset --hard origin master (or something along those lines) to before the migrations were started and start over.

On a side note, I noticed yesterday that the migration is slower than it used to be and wanted to look into that a bit to maybe speed it up.

Okay, that’s too bad, I’ll try again then!