I am using jupyter notebooks for experimental data acquisition. Basically, I set up my instruments in the notebook, and then run a sequence of interactions while reading and plotting data from the instruments and saving it along with notes on the interactions in a .csv file for later analysis. Since I do this interactively, I cannot run the notebook through papermill, but I would still like to save the notebook at the end, along with the information which csv files were created in the course of its execution. Is there a way to do this manually after the experiment is finished?
we currently don’t support the usecase you describe above. We are working on a user-facing api that allows setting input/outputs etc. from code, but for now that is limited to scripts executed with
Since we don’t just track the inputs/outputs of commands, but also offer easy ways to reproduce results with
renku rerun or
renku update, just telling renku what happened, but in a non-reproducible way, goes a bit against our goals for renku.
We might add tracking of individual cell executions in notebooks at some point in the future, but there it gets difficult to not make things to messy, especially in the initial stages of development where there’s lots of trial and error and executing things that don’t matter in the end.
You can of course fake this information. by doing something like
renku run echo "" --input notebook.ipynb --output file1.csv --output file2.csv which would track the execution of echo (which wouldn’t do anything in this case, any other operation that does nothing would work as well, like
renku run cat notebook.ipynb --output file1) but specifies the notebook as an explicit input and two csv files as an explicit output (see https://renku.readthedocs.io/projects/renku-python/en/latest/commands.html#detecting-input-paths for more details). Of course, rerunning/updating wouldn’t work in this case and might even break
renku update for other workflows in your repo, so I wouldn’t recommend this (as you’d essentially be lying to renku about what happened). but it is a way of manually adding some information on how a file was generated to the knowledge graph. So if you just want it as a way to remember what you did and don’t use
rerun, this should be fine.
Since we want to add more support for use-cases like yours in the future, do you have any idea what workflow you would like to use to achieve your goal? We’re always gathering use-cases and would love to hear how you’d ideally like a feature like this to look like.
I hope I could help you with your question.
Thanks a lot, Ralf! You are right, my fake renku run command would stuff up renku update. Can I undo it again somehow? I mean could I remove the renku run entry without removing the data generated? Maybe I should treat this part of the workflow like adding a dataset and put the information how this dataset was created (by executing commands recorded in Noteboox XX) in the metadata? Would it be possible to standardize this somehow so that I would see in the knowledge graph that a given dataset was created by a certain notebook, so that I could view the notebook in the state it was in when the dataset was created?
For example, I would have a notebook ‘Sensor_calibration.ipynb’, which I would use to re-calibrate the sensor every now and then. Every time, I would create a new, dated calibration dataset and update the notebook. Maybe I will find out that a different way of reading the data from the device is more effective or I find a faster way of doing the calibration, so I would like to be able to find out in the history which calibration dataset was created with which version of Sensor_calibration.ipynb. Does this make sense and do you see a way to achieve this already?