Renku update: cwl error

Hi,

I’m working on a test project to create and latter update a dataset on Renku.

To explain you a bit how the dataset is created:

  • The dummy dataset to be saved in the knowledge graph is created from the src/dummy_data6.R, which basically simulates a matrix and some metadata and save them in data/dummy_data6/.
  • src/load_dataset creates the renku dataset and run the .R script. If the data are already here, they are updated with renku update.

The first run of src/load_dataset.sh executes fine and saves the dataset in the knowledge graph. However, if I change the metadata of one of the file from the R script, renku update fails, which seems to be related to cwl (see the error message here). I’m a bit puzzled because I use the same scripts with other data, where only the content of the matrix and the metadata change, and the update runs fine.

Do you know where the error related to renku update could come from and how I could resolve it ?

Thanks in advance for you help!
Anthony

Hi Anthony.

The problem is that your script uses absolute paths, but renku update uses cwl to re-execute the workflow, which it does in a temporary directory. It expects outputs to be created in that temporary directory.

I.e. it’s looking for something like /tmp/tmpk6vcyr58/data/dummy_data6/meta_dummy.json that it can copy back to the project directory, but instead the script created /work/project/data/dummy_data6/meta_dummy.json. So it complains that it couldn’t find an expected output in the temporary working directory.

I hope that helps.

Regards,

Ralf

Thanks for your suggestion Ralf!

So I’ve removed the dataset from the project (renku dataset rm), the data files and the absolute paths from all scripts. However I still have a problem with cwl (updated log here). Do you maybe know if it is the same error, or if I need to clean the project in another way ? (maybe git reset --hard, but I’m not sure if it takes care of the data uploaded on the knowledge graph…)

Did you record the workflow again? I see a define_workflow.sh that only gets executed if workflow_exists = False. So it might be that the recorded workflow still refers to the old version of your script.

The error still happens unfortunately, although I’ve manually run the define_workflow.sh to create a new workflow. I’ve also created a ‘clean copy’ of the project to rerun everything from scratch (with relative paths), but the same error message appears.

Is there any reason why the command wouldn’t produce data/dummy_data6/counts_dummy.mtx.gz when run? Because it seems that file isn’t being created.

I’m not too familiar with R, so I’m not sure I can be of help there, unfortunately.

In the error message, if you check e.g.
("Error collecting output for parameter 'outputs_9d220fb7e262446ea0d9f7dcbe95ce4a':\n../../tmp/tmp97zty416/93fb75a0-1448-48da-b959-58b09adaf565.cwl:53:5: Did not find output file with glob pattern: '['data/dummy_data6/counts_dummy.mtx.gz']'", {})
then /tmp/tmp97zty416/ should be the temporary directory it ran the workflow in. Though cwl is a bit weird, it uses multiple tmp directories (one for the script, one with the outputs I think) and I never remember which one shows up in the log. You could take a look inside that temp dir to see if you can find anything amiss. Maybe the file exists but is no in data/dummy_data6/counts_dummy.mtx.gz or maybe it wasn’t generated at all?

Also keep in mind, renku run tries to detect inputs automatically and you can specify them manually as well (I saw you already use --input to declare a manual input). This is not just for the provenance, it is also to let CWL know which files it has to copy into the temp dir for the command to work. So if you script needs a file but this file doesn’t show up on the commandline (–> so no automatic detection) and is not specified manually, your command might be failing on update because it’s missing an input?

Normally the src/dummy_data6.R script first creates a data/dummy_data6/counts_dummy.mtx which is then compressed to a .gz file (with an overwrite=TRUE option). I’ve tried it manually but apparently none of the intermediate or final files are empty. But maybe the presence of intermediate files can create the problem … ?

I will have a deeper look at the temporary directory, and also at the renku run options.

Sorry for the late reply:
So eventually my problem was indeed linked to a problem with the input. My main script (which was correctly flagged as input) was sourcing another script that was unknown to the cwl as it was not noticed by the initial renku run.

I realize that the error message comes from cwl and that it was my mistake that I forgot to specify this other file as input, but I would maybe suggest to make renku run check if all required files are correctly listed as input ? For example in this case, it was a bit hard to debug the problem as cwl wrongly directed my research to a problem with the output where, in fact, it was due to a missing input.

Anyway, thank you very much for your comments! I wouldn’t have been able to solve this issue without your last suggestion.

Glad to hear you could fix the issue!

Renku doesn’t inspect scripts or anything of the sort, so all it sees is what was provided on the command line and what changed in the project directory after the command ran (detecting outputs). Since you could run arbitrary programs that do arbitrary things with renku, we really couldn’t detect all inputs that a script or program might use automatically. Even just with one programming language, this would be all but impossible, if it supports dynamic imports and the many ways there are of opening a file.

Given that, I think the biggest issue is the very cryptic error message. Does the R command print an error message if that file is missing? Does it returns an exit code that suggest an error, i.e. something other than 0 (you can test this by executing the command and then running echo $? that should print the exit code)?

Maybe we could show the command stdout/stderr when a rerun fails, to support users in debugging the issue. I’d have to look into cwl-runner to check if that’s possible.

Right, it’s true that it would not be feasible to detect all requirements/ inputs. I was more thinking about running a “fake” renku update after the initial renku run, to check whether all input are self-sufficient to reproduce the output ? (just random thought)

So yes, I think that the error message is certainly the biggest issue. In my case, the R error message was quite straightforward:

cannot open file 'src/r_utils.R': No such file or directory.Execution halted

but the main challenge was to find this file somewhere in the cwl cache folders. I indeed think that a stdout/stderr would be very helpful :slight_smile:

Oh that’s a cool idea! Maybe with a flag like --verify.

We’re working on adding a --dryrun option, this could go hand in hand with this.

I’ll open an issue for this on our end.

can’t you use renku rerun for this?

At least on my script, it could be a solution, yes. While a run of renku update straight after my renku run (without any change to the input files) yields

All files were generated from the latest inputs.

running renku rerun returns a bug. I could use this in my future scripts to check if all inputs are correctly set.

@ansonrel you mean it crashes because it can’t find one of the dependencies? Or it actually gives you a stacktrace? If it’s because of a missing input, it should give a better error than a stacktrace and it’s something we should fix. But yes, this was the original intent of rerun, i.e. to use it potentially as a part of continuous integration and verify that the project remains reproducible.

Sorry, my knowledge in bash is limited and I’m not sure what you mean with “stacktrace” ?

Basically, executing renku rerun without specifying a script that is sourced somewhere in the workflow will result in an error message finishing with:

[job 0edc002c-a60a-4fd4-aed0-ad077b5a579f] Job error:
("Error collecting output for parameter 'outputs_9d220fb7e262446ea0d9f7dcbe95ce4a':\n../../tmp/tmp97zty416/93fb75a0-1448-48da-b959-58b09adaf565.cwl:53:5: Did not find output file with glob pattern: '['data/dummy_data6/counts_dummy.mtx.gz']'", {})
ERROR:cwltool:[job 0edc002c-a60a-4fd4-aed0-ad077b5a579f] Job error:
("Error collecting output for parameter 'outputs_9d220fb7e262446ea0d9f7dcbe95ce4a':\n../../tmp/tmp97zty416/93fb75a0-1448-48da-b959-58b09adaf565.cwl:53:5: Did not find output file with glob pattern: '['data/dummy_data6/counts_dummy.mtx.gz']'", {})
[job 0edc002c-a60a-4fd4-aed0-ad077b5a579f] completed permanentFail
WARNING:cwltool:[job 0edc002c-a60a-4fd4-aed0-ad077b5a579f] completed permanentFail
Error: Unable to finish re-executing workflow; check the workflow execution outline above and the generated /tmp/tmp97zty416/93fb75a0-1448-48da-b959-58b09adaf565.cwl file for potential issues, then remove the /tmp/tmp97zty416/93fb75a0-1448-48da-b959-58b09adaf565.cwl file and try again

which comes from this missing dependency (in this case, the script that is sourced somewhere in the workflow but wasn’t tagged as input).

I see, yes if it’s a missing dependency then it makes sense that you get an error on rerun. If you specify the script as a dependency with --input or similar, it should work. Incidentally, sounds like this recent discussion was addressing the same problem :slight_smile: