Feedback from new user

jkminder · 16 May 2022 09:49

Hi everyone,
first I would like to say, that i find the idea of renku very cool and I’m very excited about the project. I have been working with renku for a few days now and have some notes/questions. I have been trying to move projects to renku but it has been giving me quite a hard time:) I think the overall issue is that, while the concepts work well in theory, real repos are much more messy and require prototyping and sandboxing. If the repo is not super clean and thought through from the beginning, it is often super complicated to get the desired results. I’m sure that a lot of issues I had, can be resolved easily if I would only know how, so please let me know if my approach was wrong (and maybe add it to the docs:)). I was told you are keen to know feedback so I hope the following helps you (please let me know if I should submit this somewhere else):

I think it should be possible to run multiple “renku run” sessions in parallel (multiple terminals). I would even argue that this could be quite essential. Or at least you have to state very clearly in the documentation that this is not possible or maybe warn directly in the CLI. As far as I can tell the outcome of such a parallel session is just unknown (only one workflow will be recorded?). This is super annoying.
Renku run won’t detect the output if the file already exists but is updated. I have to test my code before I run renku run so the output file probably already exists. Before running the code again I have to delete all output files first, so that renku will detect them. Couldn’t renku also detect if a file changes?
Can’t tag output parameters with renku workflow iterate. I need to run a plan multiple times. My script generates a different output file based on its parameters (simply it downloads data from a remote database to local). If I autogenerate the output file name in the script, it won’t work because renku can’t detect that the output file name has changed slightly from the first workflow recording to the reexecution with different parameters. A solution here is to also add the output file name as an input parameter. Then I can simply specify the output file name for each execution. I need to iterate over a set of parameters, for which renku workflow iterate would be perfect. But I don’t wan’t a “grid search”-like execution so I wanted to use tags. But sadly tagging does not work with output parameters. The cli failes with Error: Invalid parameter value - The value of 'output-3@tag1' parameter is neither a list nor templated variable!. Why?
No untracked files before running renku run. While I see that this makes sense in a perfectly clean repo, it is super tedious in a more realistic real world setting. Often I have local test outputs, maybe some test scripts where I experiment, that I don’t want to push to the repo. The solution i found is to just put everying in gitignore, but it is still annoying.
Unsetting parameters from plans. I may want to add/remove a flag from a script/plan (upon reexecution via execute or iterate. This is currently not possible and I had to hack my way around it.
The documentation is a bit limited and could really benefit from more examples. I am working with someone from the SDSC so luckily I had direct support. E.g. Section Specifying in/outs programmatically were not clear to me at all and could really benefit from examples.
renku log just dumps the log file to console. This is a bit annoying since then what I immediately see after running the command is the first ever activity. But what I’m probably more interested in is the most recent activity. To see this I have to scroll up a lot. I think it should use paging and initially just show the most recent activity. Check git log for a good example.

I often had to adapt my code/scripts to be able to run it via renku workflows. My usecase is probably not the most common (dataset creation), I still think renku should move towards a more seamless integration with existing project/tools. Otherwise the time overhead is just too much. Let me know if you need anymore details/more comments, I’ll be happy to help you improving renku!

jkminder · 16 May 2022 10:21

One more;)

Logging from workflow execution: I log to files currently. Naturally I don’t want the logs to be tracked. If i execute a workflow I would like to still inspect those logs but this is not possible because either I specify the log file as an output file, which means it is tracked or I don’t which means I will not have access to it. Is there a solution to this?

ralf.grubenmann · 16 May 2022 12:07

Thank you very much for all the feedback, it’s super valuable to see what issues users run into in the wild and helps us a lot with our development.

I’ll try to address each point you raised in turn.

I think it should be possible to run multiple “renku run” sessions in parallel (multiple terminals). I would even argue that this could be quite essential. Or at least you have to state very clearly in the documentation that this is not possible or maybe warn directly in the CLI. As far as I can tell the outcome of such a parallel session is just unknown (only one workflow will be recorded?). This is super annoying.

As mentioned in the other thread you posted last week, we do generally allow running multiple renku run (and some other commands) in parallel using the --isolation flag, which runs the command in an isolated git worktree (preventing parallelism issues like concurrent file access). The big issue here is that we recently move to a better optimized way of storing Renku metadata, which git cannot easily merge. And we don’t want our users to have to deal with the internal of our metadata to be able to merge things themselves. To this end, we recently wrote a custom git mergetool, which should make it into our regular release this Friday, at which point --isolation should work again as expected, merging together the generated metadata.

Renku run won’t detect the output if the file already exists but is updated. I have to test my code before I run renku run so the output file probably already exists. Before running the code again I have to delete all output files first, so that renku will detect them. Couldn’t renku also detect if a file changes?

This is on purpose, as we don’t allow circular workflow graphs. We do detect file modifications, e.g. if you rerun the same workflow with different parameters, we pick up the change. But in the case of a file that exists already, we don’t know if the file was both read from and written to, or if it was appended to. Both of these cases represent a file being both an input and an output at the same time, which would cause a circular dependency. Circular dependencies are not allowed as we have renku update that updates all outdated outputs based on modified inputs, and a circular dependency would end up with the project being perpetually out of date/an infinite regression. So not having circular dependencies is a hard requirement of Renku.

That said, we are considering way to at least alleviate the issue you are facing. We want to allow users to delete activities after the fact, so they could renku run repeatedly and delete executions that they aren’t happy with when they’re just playing around. We could also consider deleting explicit output files (specified using --output myfile) before executing the renku run. The only way I see of doing it automatically without relaxing requirements for strict reproducibility that I see would be to have two passes for renku run, executing once to detect modified files, then deleting those and executing again to record metadata. But that’d double the execution time of commands, which might not be something users want.

Can’t tag output parameters with renku workflow iterate . I need to run a plan multiple times. My script generates a different output file based on its parameters (simply it downloads data from a remote database to local). If I autogenerate the output file name in the script, it won’t work because renku can’t detect that the output file name has changed slightly from the first workflow recording to the reexecution with different parameters. A solution here is to also add the output file name as an input parameter. Then I can simply specify the output file name for each execution. I need to iterate over a set of parameters, for which renku workflow iterate would be perfect. But I don’t wan’t a “grid search”-like execution so I wanted to use tags. But sadly tagging does not work with output parameters. The cli failes with Error: Invalid parameter value - The value of 'output-3@tag1' parameter is neither a list nor templated variable! . Why?

Could you give an example of the command you tried? Tags are specifically to have lists of values, of the same length, for multiple parameters and to go through them in lockstep. So use the first value of param-a’s list and the first value of param-b’s list together (with both sharing the same tag), use the second value of param-a’s list and the second value of param-b’s list together, etc. It should work for outputs as well. The error sounds like you didn’t specify a list of values for the output.

No untracked files before running renku run . While I see that this makes sense in a perfectly clean repo, it is super tedious in a more realistic real world setting. Often I have local test outputs, maybe some test scripts where I experiment, that I don’t want to push to the repo. The solution i found is to just put everying in gitignore, but it is still annoying.

We couldn’t do this before due to technical reasons, but we recently switched large pieces of the code base to a better approach. I’ve created this issue to investigate this at some point. For now, what works for a lot of users, is to just git stash before running renku run and then git stash pop afterwards.

Unsetting parameters from plans. I may want to add/remove a flag from a script/plan (upon reexecution via execute or iterate . This is currently not possible and I had to hack my way around it.

Yes, this is a design choice on our side, if there are differing parameters, it’s not really the same workflow anymore in our view. It would become confusing to have two executions of a Plan that did completely different things, e.g. if you’re a third party wanting to inspect a repo to understand what happened. There are many command line tools whose behavior changes completely depending on what flags are specified, to the end that it’s essentially a different program being executed, and we want to keep the history clear and understandable.

There’s also some technical consideration with undefined answers, e.g. if I do renku run script.py --flag1="a" my_file. Then I edit the plan so the command would be script.py --flag2="b" <output>, what should happen if I do renku rerun my_file or renku update my_file. Do I want the old Plan to be executed or the new one? In the latter case, what should be the value for flag2.

We do allow users to change default values for parameters, but i think this is on a spectrum, from not being able to modify a Plan at all to being able to completely change the command that is execute, like changing a plan from mv a b to git log --pretty=.... The latter clearly doesn’t make sense, as it’d be an entirely different command and the concept of a Plan would become pretty useless at this point. So when we had this discussion, we decided to draw the line at changing default values (and things like the description) of Plans, but not adding/removing inputs/outputs/parameters.

The documentation is a bit limited and could really benefit from more examples. I am working with someone from the SDSC so luckily I had direct support. E.g. Section Specifying in/outs programmatically were not clear to me at all and could really benefit from examples.

Noted.

renku log just dumps the log file to console. This is a bit annoying since then what I immediately see after running the command is the first ever activity. But what I’m probably more interested in is the most recent activity. To see this I have to scroll up a lot. I think it should use paging and initially just show the most recent activity. Check git log for a good example.

You can do renku log | less, which is essentially what git does internally. It’s a pretty new command and not polished at all. I’ve created an issue for this.

Logging from workflow execution: I log to files currently. Naturally I don’t want the logs to be tracked. If i execute a workflow I would like to still inspect those logs but this is not possible because either I specify the log file as an output file, which means it is tracked or I don’t which means I will not have access to it. Is there a solution to this?

You can use --no-output-detection so renku doesn’t automatically detect outputs and specify your desired tracked outputs manually using --output [name=]path

laura.kinkead · 19 May 2022 07:57

Hi Julian! Thanks so much for your detailed feedback! I am particularly interested to hear more about your experience adapting your code to the renku workflow system, and what other projects and tool you use or would like to see renku integrate with. Would you be open to setting up a call to chat? You can reach out to me at laura.kinkead at sdsc.eth.ch

jkminder · 7 June 2022 08:44

Thanks for your responses. I’m sorry for my long silence, I have been focussing on other stuff but now I’m back and will check all your listed solutions.

renku workflow iterate --map parameter-2@tag1=[1,2] --map output-3@tag1=["first.out","second.out"] parliamentdb_fetch
This results in the error:

Id: /plans/9674c3ac93c14127aed18703421aaea0
Name: parliamentdb_fetch
Command: python conversion/database_fetch.py 38 data/parliamentdb/LP38_nv_c_data.pickle -n -c
Success Codes:
Inputs:
        - input-1:
                Default Value: conversion/database_fetch.py
                Position: 1
        - input-353b:
                Default Value: conversion/democrasci_iterator.py
                Position: None
Outputs:
        - output-3:
                Default Value: data/parliamentdb/LP38_nv_c_data.pickle
                Position: 3
Parameters:
        - parameter-2:
                Default Value: 38
                Position: 2
        - parameter-4:
                Default Value: -n
                Position: 4
        - parameter-5:
                Default Value: -c
                Position: 5
Error: Invalid parameter value - The value of 'output-3@tag1' parameter is neither a list nor templated variable!

ralf.grubenmann · 7 June 2022 08:56

You need to write it like renku workflow iterate --map parameter-2@tag1=[1,2] --map output-3@tag1='["first.out","second.out"]' parliamentdb_fetch (ideally you’d also use '[1,2]' quoted, but that works fine without quotes).

Otherwise your shell transforms it to just a string "x,y", losing the square brackets and renku doesn’t see it as a list.

jkminder · 7 June 2022 09:12

Ou thats super stupid of me, i could have seen that:D Thanks a lot!

Topic		Replies	Views
Feedback renku version 1.0.0rc2: update does not recognize input file changes after workflow execution with new parameter Renku (CLI)	6	303	24 November 2021
Renku-python v1.0.1: Possible feature: Automatic output detection for parameter executions of workflows Renku (CLI)	4	390	17 December 2021
Renku-python v1.0.1: Output path recognized as output instead of output file Renku (CLI)	1	384	13 December 2021
Workflow recording bug: plans vs plans-by-name Renku (CLI)	6	310	13 May 2022
Renku CLI v1.0.0 released Announcements	0	530	21 February 2022

Feedback from new user

Related topics