Long renku migration

Hello, I am trying to migrate this repo to the newest Renku version:

https://renkulab.io/projects/remko.nijzink/budyko

But this takes extremely long, a couple of days and it didn’t finish, is there a way I could speed this up?

Hi!

What renku version are you trying to update to? 0.16.2?

I have now 0.16.1.post1, would it be easier with 0.16.2?

Unfortunately, 0.16.2 is just as slow as 0.16.1.post1 .
Renku goes through all git commits recursively when migrating/updating which on projects with a lot of commits and workflows can take very long.

We are at the moment getting ready to release a new version of renku-python (the command line client) that does not depend on git commits anymore and greatly alleviates these problems on large repositories, as well as introducing many other goodies in relation to workflows.

The initial migration for big projects like budyko will still be slow (but maybe 1 day instead of weeks, we tested it on vomcases which I think is comparable and that took ~12-18 hours on my machine), but and future updates after that should be really fast.

The release of this new version of renku-python should happen any day now (we’re waiting for 2 PRs to be merged, then it’s finished), but other components might not be as fast to catch up and a full release of the whole renku platform might take longer. What this means is, when we release the new renku-python, it will probably be a while until the platform as a whole supports projects migrated to it, so the UI on e.g. renkulab.io will initially not work with projects that are on the new version.

So you could try it with the current master of GitHub - SwissDataScienceCenter/renku-python: A Python library for the Renku collaborative data science platform. if you’re just working locally, and everything should work, but you would likely have to git reset to before the update and update again once we make a proper release, and the project wouldn’t work for things such as adding datasets through the UI on renkulab.io.

Other than that I can’t offer you a solution. One of the big motivations for this new release was improving the speed on projects such as yours, which just isn’t possible with the way 0.16.2 does things.

That said, if you want to test current master and give the changes we made a try (maybe on a separate branch or copy of the project), your feedback would be really valuable! You can find the initial draft of the release notes here if you’re interested renku-python/renku-release-notes-1.0.0.md at 1.0.0-release-notes · SwissDataScienceCenter/renku-python · GitHub

Ok, thanks for this! That sounds really good indeed! Just running it a day would be more than fine, compared to what it is now. I’ll give it a try with the current master version then.

FYI: We tried migrating budyko with current master today and there seems to be an issue with a git commit that is a bit out of the ordinary. We’ll look into it on Monday.

So I am also trying, and the first part went okay, with some warnings. Now I came to step 9

Applying migration m_0009__new_metadata_storage…

and here the speed reduces a lot. Also here warnings especially related to the submodules that I use:

Warning: Entity 'src_flexsimple/.git' not found at '212c7c116ac199e7b03b74db65caff5a5a2c8f96'

Yes step 9 is the slow one, it goes through each commit to gather metadata, and sometimes has to search previous commits for each commit to gather dependent metadata throughout the project history. This was one of the main bottlenecks that made renku slow in the past. Basically it had to do what this migration does every time you called renku status or renku update.

This migration changes it so all the data is stored in one place, not scattered through commits, which improves speed a lot. We have to do the long calculation once to convert it to the new format when migrating, and then after that it’s fast. We have also moved away from using commit shas in our metadata, instead using hashes of files, which makes metadata much more robust against rebases, but that means the migration has to go through all commits and get files hashes for the files at that point in time to be consistent, which isn’t that fast.

The warnings are usually due to renku metadata referring to commits that don’t exist anymore or that don’t contain the information that our old metadata says they should. This could be due to history rewrites or rebases, or in your case mentioned above, it’s due to commit 212c7c116ac199e7b03b74db65caff5a5a2c8f96 being a git submodule commit, not in the main repository being migrated. So renku metadata refers to a file that is in a submodule and it can’t find it at that commit in the main repository.
I’m not sure if pulling/checking out the submodules would help in that case.
The library we use for git, GitPython, is not really maintained anymore (and there’s not a good replacement for it), and they only partially support submodules. As per the maintainer, Using GitPython’s submodules should be avoided as it’s not a complete implementation anymore and this makes it basically impossible for us to support projects with submodules properly. So unfortunately all we can do is ignore these types of errors, as the library often just returns with “Success” but no result, with no indication that there even was an error. So our hands are tied in these cases.

Okay, thank you for the explanation! Makes sense! Unfortunately, I had to break it off now (did it locally, I had to restart my pc), do you know if I can continue the migration from here again? It says now there are untracked files in the repository for .renku/metadata, could I do git add .renku/metadata and continue?

Unfortunately no, since we don’t keep track of where the migration was at. You’d have to git reset --hard origin master (or something along those lines) to before the migrations were started and start over.

On a side note, I noticed yesterday that the migration is slower than it used to be and wanted to look into that a bit to maybe speed it up.

Okay, that’s too bad, I’ll try again then!

Unfortunately, the project is still migrating… is there really no option to do this differently? Would it be better to install an older version of renku in an environment to be able to still work with this project?

I’m afraid migration might take a long time depending on the project’s size. You can always use an older version of renku in a virtual env (locally) or in an interactive session, however, be advised that new features/bugfixes won’t be back-ported to older versions.

I guess it’s the only solution. I am running it since the 25th of November now, and still not even at 10%:

Processing commits 1297/21951 8eb7923942069b98b50e9018169e4addc88db743

Which migration? There are a few of them

It is just this project:

https://renkulab.io/projects/remko.nijzink/budyko

And I started migrating it with a new version of Renku (the current master) and the Renku version I have installed (0.16.2).

Oh sure, I know it’s that project :slight_smile: I’m asking which migration because to go from its current metadata version to the version used by renku 1.0 several migrations are needed. So you will see at the top when you start the migration something like:

Applying migration m_0005__1_pyld2...
Applying migration m_0005__2_cwl...

For me, this migration proceeds relatively quickly (still very slowly, but this is a huge project with tens of thousands of commits). If it’s taking that long for you to get through this first migration, something else might be wrong.

Ah okay, sorry, misunderstood. It is this migration currently:

Applying migration m_0009__new_metadata_storage…

But I am just wondering if it would be worth being patient and let it run for some more time, or just stick with an older version of renku for this specific project.

I’m not sure what to suggest in this case. With an older version of renku, this project is not really usable anyway. What do you hope to do with renku here? If you want to use any of the workflow features, then you definitely need the new version of renku.

What I would try here is re-initialize your project with renku init using renku>1.0. By re-initializing the project, you’ve still got the git history in case you need it, but you should be able to use the updated renku functionality going forward. You will, however, lose the old metadata - can you redo some of the steps taken so far? If there are workflows that need to be recorded, can you record them again? If there are datasets, can you move the data and re-create the datasets?

Okay, that sounds like an idea. Actually the work is mostly finished in this repo, and we probably submit a paper related to it soon. So we wanted to publish the repo on zenodo or so, and to be most useful for others, I wanted to migrate it to the newest renku-version. But this repo had many large model runs (which, in hindsight, should have been spread over multiple smaller repositories), so it is also not really possible to redo-things here.