In the old renku versions, git log myfile
always showed the exact command how it was created. I used this a lot, it was super useful, as you could check which script was used and with which settings in case you want to do something similar. The newest renku-version only shows renku run: committing 1 newly added files
,which is not very informative. How can I see the full command now?
Hi @rcnijzink,
We are working on improving the way the knowledge graph stores metadata and these improvements (when ready) should help with the issue you are having.
In the meantime you can get the almost same thing you are looking for by doing the following:
work ❯ test-project-10 ▶ master ▶ $ ▶ renku run python script.py --output file2.txt
work ❯ test-project-10 ▶ master ▶ 2⬆ ▶ $ ▶ renku show outputs -v
PATH COMMIT GENERATION TIME WORKFLOW
--------- ---------------------------------------- ------------------- ------------------------------------------------------------
file.txt fc0713229e316990872ab8431452089e52bbc46d 2021-02-25 14:51:28 .renku/workflow/d230e7d2d103427db46ad93401491cc1_rerun.yaml
file2.txt cd90aca00964bf09affaf121eee896f2caac7173 2021-02-25 15:33:22 .renku/workflow/588bc59ccad544358c91f03caf5e0300_python.yaml
work ❯ test-project-10 ▶ master ▶ 2⬆ ▶ $ ▶ git log -n 1 .renku/workflow/588bc59ccad544358c91f03caf5e0300_python.yaml
commit 747e13c5c55ca13f5c9480acc349a3e4c6005eb6 (HEAD -> master)
Author: Tasko Olevski <tasko.olevski@sdsc.ethz.ch>
Date: Thu Feb 25 15:33:22 2021 +0000
renku run python script.py --output file2.txt
So basically running renku show outputs -v
will show you all outputs from renku run
commands you have done as well as the workflows associated with them.
The workflows are commited and the commit message will store the command that was run. So you can pick the right workflow filename from renku show outputs -v
and use that in the git log -n 1 <workflow_filename>
command to get what you are looking for.
And you can even combine the few steps above into a one liner like this:
renku show outputs -v <output_file_name> | tail -n +3 | awk '{ print $2 }' | git show --quiet
Thanks! But this just works partially, as the command is cut-off in the end, like this:
commit 6a2c006844a06513ee8f51574c8480fba6fc362c (HEAD -> master)
Author: Remko Nijzink <remko.nijzink@list.lu>
Date: Mon Mar 1 11:32:44 2021 +0100
renku run python3 src_py/plot_meanannuals_vom.py -i data/VOM_output/additional_analyses/comp2015/...
But it is just in case of the more complex and longer commands, that I’d like to have a look at it. Just running renku show outputs -v file also takes super long, and only the one-liner actually shows something.
I came across a similar problem today. When going through my commit messages to find out how I added certain datasets, I found out that the commit messages are truncated and I never see the full path. This is really annoying. Is the full command stored somewhere else?
Yes, would be great to have a fix here, I installed an older renku-version in a conda-environment to avoid this. It was probably a simple feature, but for me the most useful one actually.
The reason this was added was that long git commit messages are discouraged, usually, 50/72 characters (for summary respectively body) are recommended, though we opted for 100 as things like renku dataset import zenodo
are already 28 characters long, and also because long commit messages caused issues for some users: Overwrite default commit when adding files in bunch to renku · Issue #1633 · SwissDataScienceCenter/renku-python · GitHub The biggest issue is when someone does e.g. renku dataset add folder/*
, where *
gets expanded by the shell and you get a very long command that’s probably not useful to anyone.
If you look at the implementation, it actually is flexible, it unfortunately just isn’t configurable: renku-python/scm.py at master · SwissDataScienceCenter/renku-python · GitHub
I think there is merit in limiting the length of the first line of the commit message, so I wouldn’t want to change that. But we could make the length configurable on a per-project level (via renku config set
), and to turn on the wrapping that’s already supported, so no information is lost in any case.
Thanks, @ralf.grubenmann, I understand now why the length of commit messages has to be limited, but I don’t understand how to ensure that no information is lost, or how renku could allow that part of a command is lost. Is it not possible to retrieve the full command if it was too long? Wouldn’t this break reproducibility?
I was mostly thinking that since a commit message can have multiple lines, we can wrap the command, so you have something like
renku run --python myscript.py file1 file2...
file3 file4 file5 file6...
file7 file8
So the information is still there, just not on a single long line. It might still need an upper limit to not break things, I’m not sure about that.
But honestly, I think using the commit message to figure out what happened isn’t necessarily the right thing to do and is more of a crutch to achieve something that we don’t yet properly support.
We are currently in the process of designing a more fully featured and improved renku workflow experience, so at least on the workflow side, I think we should handle this specifically with this use-case in mind. The current design docs for this are at workflow UX improvements · Issue #1875 · SwissDataScienceCenter/renku-python · GitHub and I think that for instance the proposed renku workflow history
command would be a much better place to retrieve the command used in an execution. Something like renku workflow history --full-command myfile
would seem much cleaner, rather than the git history which is more of a side-effect of renku operations than a proper user-case. Also, with these changes, there might not even be a commit or multiple workflow executions could end up in a single commit, so using git log
for this purpose wouldn’t work anymore anyways.
All of this is still in the design phase, so subject to change. But any wishes, suggestions or criticism is very welcome!
Thanks a lot, @ralf.grubenmann! This sounds all good, but I have 2 comments and a question.
Comment1: Please don’t forget to include renku database add
in these discussions, as the link is often truncated as well.
Comment2: I think it would still be good to have the full command in the commit message. Renku could keep the subject of the commit message short and put the full command in the body.
Question: How can we access the full commands executed in the current and previous versions of renku?
Regarding comment1: Assuming you mean renku dataset add
, do you use the git log
to know when which file was added? Because with renku dataset ls-files
you can already see the files in a dataset. So I’m interested in hearing your use-case for seeing the full command in git log
. In any case, having a truncated summary line and the full command in the body as discussed above would apply to all renku commands, so this case would be covered.
Comment2: Agreed
As to the question, I don’t think it’s easily possible at the moment. We do know what command was run through the renku metadata, and eventually you can see that with e.g. the renku workflow history
command mentioned above and probably also in the UI on renkulab, once those features are implemented.
But right now the information is spread across several nodes in the knowledge graph and you’d probably have to write python code importing renku classes to get it in a human-readable form.
Other than that, I think the closest you can get at the moment is by doing renku log --format Makefile <paths>
which will output a makefile with the commands used to create <paths>
.
Yes, thanks for all the clarifications! I wondered, would looking at the workflows also be an option? I found
renku workflow set-name create output_file
but didn’t manage to see the workflow for one specific file. When I did the above, it just added a commit with that command, but didn’t see any exported workflow file.
set-name
just gives an identifier to a file.
I just saw that our docs mention this command for exporting workflows, but I think that part of the docs is wrong, it should read: renku workflow create output_file
. This will generate a CWL file. On a side-note, create
isn’t the best naming, it should really be called renku workflow export
.
Okay, thanks! But then I get an error:
Traceback (most recent call last):
File "[...]/renku/cli/exception_handler.py", line 121, in main
result = super().main(*args, **kwargs)
File "[...]/renku/cli/exception_handler.py", line 87, in main
return super().main(*args, **kwargs)
File "[...]/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "[...]/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "[...]/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "[...]/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "[...]/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "[...]/renku/cli/workflow.py", line 172, in create
result = create_workflow_command().build().execute(output_file=output_file, revision=revision, paths=paths)
File "[...]/renku/core/incubation/command.py", line 131, in execute
output = context["click_context"].invoke(self._operation, context["client"], *args, **kwargs)
File "[...]/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "[...]/renku/core/commands/workflow.py", line 101, in _create_workflow
workflow = graph.as_workflow(outputs=outputs,)
File "[...]/renku/core/commands/graph.py", line 435, in as_workflow
assert isinstance(node.activity, ProcessRun)
AssertionError
This also happens also for different files. Sometimes, I get this error:
Error: Found multiple activities that produced the same entity at commit c88685865ec400ed03cdb0ea74180fb953fe938d
Any thoughts on what is going wrong?
The UX around that command isn’t that great, it’s a command that hasn’t gotten a lot of love, unfortunately.
While I can’t be sure exactly, that error indicates that the file you passed to the command was created in a regular git add & commit, not through a workflow. The <path>
passed to the command has to be a file generated by a workflow (i.e. an output file) for it to work.
so e.g.
$ renku run cp myfile myoutputfile
$ renku workflow create myoutputfile # this works
$ renku workflow create myfile # this gives the error you got
The semantics of the command are “Produce a CWL that generates a file as it was generated by renku run/rerun/update commands”.
But with how little known/used that command is, it might also be a proper bug.
Okay, thanks! They were all created with renku run though. But I was also asking, as the renku log --format Makefile <paths>
, that you suggested before, takes a really long time. But that probably relates to this issue: Importing dataset: resource not in KG - #2 by jachro
I am currently looking at these workflows and histories, mainly to check the lineage and see if everything is indeed reproducible as this repo comes with a paper we hopefully soon submit. Is there also an option to do a renku update --dry-run
for example? The repository has quite some long model runs in it, and I don’t actually want to re-run anything, but mainly check if everything is their to reproduce the results.
We are working on improving how we store and process the metadata for workflows that should significantly improve performance. This is already partly implemented, though hidden in the CLI help for now.
You can do
$ renku graph generate # this generates the new metadata format alongside the old format
$ renku graph update --dry-run
But I think the output of that command is not as detailed as you’d need for your purposes, it just lists the names of all the steps that would be involved in the update, not the commands. You could probably manually edit the renku source to output what you want here, by editing this line renku-python/graph.py at master · SwissDataScienceCenter/renku-python · GitHub if you need a (hacky) solution right now. You’d want to output p.to_run().to_argv() + p.to_run().to_stream_repr()
instead of p
.
renku log
currently still works by walking and processing individual commits, and commits those commits depend on (O(n^2)
upper bound), so in a project with as many commits as yours, it can take quite long. The new way we handle metadata stores all relevant metadata in the head commit in two files, so all of it is immediately available (plus the time it takes to load this into memory), so it has much better performance and is more robust towards git rebases and things like that.
I remember there being some issues in your repository that we had to fix as part of implementing the renku graph generate
command, so you might be running into those in the renku log
/renku workflow create
commands.
The initial generate
still takes a while though, as it has to walk commits to process this metadata. This new metadata storage also prompted us to start the workflow UX changes I mentioned above, as it enables us to be much more flexible with what we do with workflows. But unfortunately most of that only exists in our heads so far.
Ok, super, that looks useful, I will try it out! Thanks a lot for your quick responses!