What is the future of reproducibility in renku?

I just attended the launch of Renku 2.0 webinar, and I understand that there is no lineage tracking in Renku 2.0 at the moment, and it is unclear if there ever will be again. If this is true, how can we ensure reproducibility of workflows in the future? The SDSC github page still states that “Renku provides a platform and tools for reproducible and collaborative data analysis”, but my impression is that reproducibility has been sacrificed for more progress in collaboration and interlinking possibilities. I would find this tragic, as reproducibility has been the main distinctive feature of renku and renkulab compared to other platforms. Is there a place to collect some feedback from the user community about this decision? Or did I misunderstand?

Hi Stan, thanks for your feedback!

What we’ve come to realize some time back is that “reproducibility” comes in many forms and on many levels. After talking to many researchers over the years, it became clear that the biggest hurdle for working in a more reproducible and reusable way wasn’t lineage tracking, but instead access to all of the resources that are needed to carry out a project - data, code, and compute. Workflows and lineage only come into the picture after this is solved - and there are already many tools out there for creating workflows! Renku could not really compete with the feature set of those tools and it is quite difficult to use the Renku CLI for more complex projects.

So what we’ve decided to focus on is the part that no one is actually solving very well right now, which is connecting the research ecosystem and providing bridges between different kinds of resources and infrastructures. We see it as giving way more freedom to researchers, because they are no longer tied to bringing data into git repositories (which is a very unnatural way to work with data) and Renku actually doesn’t force their hand as much in how they need to work. They are also free to choose where their code lives, which is quite important to a lot of people. By providing a simple way for them to connect up all the resources required for a project (including bespoke containerized compute environments), we are encouraging and enabling reproducibility because it means that it is simple for anyone to engage with the data, code, and results of a project in a repeatable way. If users want to track lineage or use a workflow tool inside their project they are still very much encouraged to do so - you can certainly keep using the Renku CLI in your projects if it meets your needs!

I hope that makes sense and I hope you’ll give Renku 2.0 a try - the experience is so much better than before!

1 Like

Thanks, @rrrrrok! I don’t see how disabling the possibility of bringing data into a project, and instead enabling soft links to data is improving reproducibility of research, but I guess I have to try it out and see. I would really like to see use examples of renku 2.0, including data sourcing, code development, intermediate data products, final data and graphical output, and archiving for later use and reproduction, including whatever workflow system can be usefully integrated.
On my side, some of us have invested a lot of time trying to use renku-python, papermill etc. for this purpose, as recommended in the renku 1 documentation, and I was hoping that renku 2.0 would expand on these possibilities, rather than dropping the development altogether. But you are right, there is no point re-inventing the wheel, so if there are existing ways of tracking the lineage, it would be really great to see some examples. Otherwise, we have to start from scratch. Hope that someone can point the readers of this topic to examples. Thanks in advance!

1 Like