Workflow iterate: output naming

Hi,

I’m testing the renku workflow iterate in a test project where I’m working on a workflow which takes a file and a parameter as input and creates a file out of it:

Id: /plans/3743bc29ee7a4ba78c08489a15b8b48b
Name: test_iter
Command: Rscript --vanilla script.R --input_folder data/input1 --input_param 1 --model_name test
Success Codes: 
Inputs: 
        - vanilla-1:
                Default Value: script.R
                Position: 1
                Prefix: --vanilla 
        - input_folder-2:
                Default Value: data/input1
                Position: 2
                Prefix: --input_folder 
Outputs: 
        - output-0639:
                Default Value: data/output/test.txt
                Position: None
Parameters: 
        - input_param-3:
                Default Value: 1
                Position: 3
                Prefix: --input_param 
        - model_name-4:
                Default Value: test
                Position: 4
                Prefix: --model_name

I’ve tried renku workflow iterate in several manners to create multiple outputs, without success.

Case 1

I don’t force any naming of the output, simply iterate over the parameters, stored in a param-yaml:

renku workflow iterate --mapping param.yaml test_iter

But only 1 file is created, although I specified 3 variations of the parameter.

Case 2

I try to specify explicitly the output with {iter_index}

renku workflow iterate --mapping param.yaml test_iter --map "output-0639=data/output/test_{iter_index}.txt"

But CWL stops, telling me that
Did not find output file with glob pattern: '['data/output/test_0.txt']'.", {}).

Case 3

I try to name the outputs inside my script, based on the parameters used:

renku run --name test_iter2 Rscript --vanilla script2.R --input_folder input1 --input_param 1 --model_name test
renku workflow iterate --mapping param.yaml test_iter2

But again, CWL cannot find file with glob pattern: '['data/output/testinput11.txt']'.", {})

Summary

So I’m a bit puzzled about how I should use renku workflow iterate. I guess something is wrong about how I define the outputs but I tried to copy as closely as possible the examples (although I’m working here with R). Ultimately, I would like to :

  • iterate over parameters
  • iterate over input files
  • map the outputs to a specific name

Hi There.
In your case2, you should probably do --map output-0639="data/output/test_{iter_index}.txt" (note the differing position of the quotes).

But the main issue is that you always write to write.table(out, file = paste0("data/output/", model_name, ".txt")). Renku does not rename output for you, it just passes the name of an expected output to your script.

So you could either accept the output as an additional parameter and then use that as the filename when writing. Or since you write based on the model-name parameter, you could do --map output-0639="data/output/test_{iter_index}.txt" --map model_name-4="test_{iter_index}" and then with your existing code, it’d write to e.g. data/output/test_0.txt and that would be picked up as the proper output of the command.

Essentially what you’re doing at the moment is telling renku "I expect an output at data/output/test_0.txt but then your script writes to data/output/test.txt and renku goes “hey, you said there’d be a file at data/output/test_0.txt but I don’t see any such file created by your script”.

I hope that makes sense.

You can also do all of it in the param.yaml, like

input_param-3: ["1", "2", "3"]
model_name-4: "test_{iter_index}"
output-0639: "data/output/test_{iter_index}.txt"

And we recently added new functionality that will be in the next release that allows relative references, so once we release that, you can do:

input_param-3: ["1", "2", "3"]
model_name-4: "test_{input_param-3}"
output-0639: "data/output/{model_name-4}.txt"

to have your files named based on the parameter instead of just an incrementing index.

1 Like

Thank you for your reply Ralf! The following worked very well:

renku workflow iterate --mapping param.yaml test_iter --map output-0639="data/output/test_{iter_index}.txt" --map model_name-4="test_{iter_index}"

I now understand the problem.

I’m now facing a new challenge: I’m now trying to iterate over inputs (data/input* folders) with something like this:

renku run --name test_iter_data --output data/output/test.txt --input data/input1 --input data/input2 Rscript --vanilla script.R --input_folder data/input1 --input_param 1 --model_name test

I’m setting the two data folders as input so that it will be imported by CWL when I run the iteration, but the workflow looks like this:

Success Codes: 
Inputs: 
        - vanilla-1:
                Default Value: script.R
                Position: 1
                Prefix: --vanilla 
        - input_folder-2:
                Default Value: data/input1
                Position: 2
                Prefix: --input_folder 
        - input-2c3a:
                Default Value: data/input2
                Position: None
Outputs: 
        - output-7140:
                Default Value: data/output/test.txt
                Position: None
Parameters: 
        - input_param-3:
                Default Value: 1
                Position: 3
                Prefix: --input_param 
        - model_name-4:
                Default Value: test
                Position: 4
                Prefix: --model_name 

And from what I understand, the problem here is that the input_folder is recognized as an input and not as a parameter. Is there a way to have an argument treated both as input and parameter ? Or do you see an alternative way of running it ?

It can be either an input (file/folder) or a parameter (essentially a string) but not both. The main difference being, if it’s an input renku can watch its content and check if any output is out of date, and it copies it to the temporary directory where the CWl is run (both of which a parameter wouldn’t do).

Of course, you can always manually declare it as a parameter by using --param data/input1 if that’s what you want, and if you use absolute paths (although that’s not that nice) then it doesn’t matter that the workflow gets executed in a temporary directory (no copying/linking necessary).

But may I ask what the issue with it being an input is? You can also use a list for an input or template it with {iter_index} and then it gets resolved for each run and the resolved path gets copied by cwl.

By the way, you can separate you run command using -- to make it more clear what is a parameter fpr renku and what is a parameter for your script, like renku run --input data/input1 -- Rscript script.R --input something, without the -- this would be ambiguous, but this way it’s clear that the first --input is meant for renku and the second is meant for your script. Not really relevant here, though, I just find it helps me read the command when I work with Renku :slight_smile:

Ok, got it! I could set up the data folders as input and simply set --input_folder input1 to iterate over it latter on.

My problem with data/input1 being an input is that I wanted to use it as a parameter and, as you said, it can either be an input or a parameter.

And thanks for the tip about --, I didn’t know it could be used as a separator, very usefull to know :slight_smile:

Yes.

And just to make it clear, in your yaml you can do

input_folder: ["data/input1", "data/input2", "data/input3"]

with input_folder being an input (not a parameter) and everything should work.

And that way, if e.g. data/input2's contents were to change in the future, Renku would detect that and a renku update --all would rerun that part of the pipeline, without rerunning the one for input1 or input3. And this is something that wouldn’t work if it was a parameter.
And of course it tracks that this specific version of input2 was used to create the output, so someone else could later reconstruct exactly which file was used (we store the hash of each input file at the moment of execution, something we don’t do for a parameter as to us that’s just a string).

Speaking about changes to the wokflow: what if I latter add another input (say data/input4 in our example) ? I’ve tried it but renku update --all will only update tracked inputs. Is there a way of adding a new input and rerun the workflow iterate, but ignoring the combinations that were already computed ?

You could just renku workflow execute with that new value (or run a new renku workflow iterate if there’s more than one, with only those new values).

But there’s no way to “extend” a previous iterate.