I would like to be able to transparently import code files from other projects, similar to renku dataset add
. For example, I would like to use https://renkulab.io/projects/remko.nijzink/vomcases/files/blob/src_py/plot_et_ass.py in a different project, but be clear about where I took it from. Should i just pretend that plot_et_ass.py is a dataset and use renku dataset add
, or is there a more appropriate function to do this? I was also thinking of submodules, but this would require that plot_et_ass.py would be in a separate git repo.
Hi! As you point out, using a submodule would be one solution. I think this is probably the best solution if you want to use the entire other project, but if you just want a file or two, it could be overkill.
The semantics of renku dataset add [file from a git repo] are pretty close to what you want here, but the main problem is that it would put the file into git-lfs by default, and it is cumbersome (though not impossible) to store the file directly in git.
I will see if anyone here has a better idea and get them to contribute here.
Hello,
ATM, what @cramakri proposed is the best solution. We have an open issue to allow adding/sharing/tracking code similar to what can be done with data: https://github.com/SwissDataScienceCenter/renku-python/issues/671.
Thanks a lot, @cramakri and @mohammad-sdsc! I had not seen the issue. Just added a comment to it, as it seems to be concerning data files only, not code. As @cramakri pointed out, placing a code file in git-lfs is counter-productive. Would it not be easy to copy the functionality of renku dataset add
but remove the git-lfs part? Or add an option to renku dataset add
to let the user choose where the file is placed, and maybe allow a tag for code?
This has been fixed in v0.10.0
- you need to include the --no-external-storage
flag to renku commands like this
renku --no-external-storage dataset add ...
Thanks so much for the update on this. It worked like a charm, except that I just got tripped up by an error message that git lfs was not installed. Could it be added to the instructions for installing renku, so that the users know that they need to install git-lfs separately?
Hi @schymans,
sorry for taking a while to respond and thanks for the suggestion - I’ve made an issue to add this to the docs: https://github.com/SwissDataScienceCenter/renku/issues/1396
I am still struggling a little with the following workflow: I take a jupyter notebook from another repo and then modify it to work with my new repo, but I want to record the original provenance of the file for giving proper credit at the end. I can do this with renku --no-external-storage dataset add ...
as suggested earlier, but this will put the file into data/externaldataset1/...
, even if I use the --destination
option, whereas I keep my jupyter files in jupyter/...
. What would be the best way of moving or copying the file to my jupyter folder transparently?
I am having a bit of similar issues with sharing code across projects. Now I wondered if it would be helpful to collect scripts, that I use more often, in a python package. In that case, I could just add the package and version in requirements.txt and update this if I make changes to some scripts. But will renku also notice that outputs are generated with an older version of a certain package?
Good idea, but to include it in requirements.txt, you would need to put the package on pypi, right? And then I don’t think that renku would track which version of a pypi package was used for a given result. How about renku --no-external-storage dataset add ...
followed by renku mv
to put the script where you want it?
HOORAY! This actually works, I just tried it out. I created some .py scripts in git@renkulab.io:wave/li-6800.git
in the folder modules_exported
, then imported them in another one as:
renku --no-external-storage dataset add modules --source modules/* git@renkulab.io:wave/li-6800.git
Then I moved them from data/modules
to modules
in the receiving repo:
renku mv data/modules/ modules/
Now I updated one of the .py files in the original repo, commited, then executed renku dataset update modules
in the receiving repo and the file was updated there and in the right place! Awesome!
Unfortunately, this approach does not work any more:
$ renku --no-external-storage dataset add --create modules --source modules/* git@renkulab.io:wave/li-6800.git
Error: Invalid parameter value - Cannot use '--source' with multiple URLs.
Is there another way now, or do I have to list every single file in the folder?
Hi @schymans,
The reason you get this error is that you have a modules
directory in the root of your project and your shell expands the wildcard. To avoid the shell from expanding, just put modules/*
inside single quotes (double quotes won’t work):
renku --no-external-storage dataset add --create modules --source 'modules/*' git@renkulab.io:wave/li-6800.git
Let me know if this doesn’t fix your issue.
Thank you, @mohammad-sdsc , this works! I probably would have never found this simple solution. I’ve been running a python script to add every file individually…
In renku 2.7.0 the above procedure does not seem to work any more:
$ renku --no-external-storage dataset add --create ESSM_plotting git@github.com:schymans/ESSM_plotting.git
Error: Cannot find file Cannot find '/home/stan/notebooks/jupyter/WAVE/renkulab/LI-6800/.renku/cache/github.com/schymans/ESSM_plotting.git/.git/config' in the remote project
The file renku complains about definitely exists:
$ more /home/stan/notebooks/jupyter/WAVE/renkulab/LI-6800/.renku/cache/github.com/schymans/ESSM_plotting.git/.git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[submodule]
active = .
[remote "origin"]
url = git@github.com:schymans/ESSM_plotting.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "main"]
remote = origin
merge = refs/heads/main
[lfs]
repositoryformatversion = 0
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge --skip -- %f
process = git-lfs filter-process --skip
required = true
@mohammad-sdsc @cramakri do see what I might have overlooked? Or should I use a different method now?
OK, the original command still works, but not without --source 'modules/*'
.
This also works:
renku dataset add --create ESSM_plotting --source '*' git@github.com:schymans/ESSM_plotting.git
Why not without --source
? I thought that option is only needed if we want to limit ourselves to a subfolder. Is this a bug?
Hello @schymans
You must always specify files (i.e. sources) that will be added from a git repository. This means that you cannot call renku dataset add
without using the --source
option (when adding data from a git repository). There’s no default value to add the whole git repository if no source is specified. I’ll create an enhancement issue to allow this in future.
Argh, there is another problem. Previously, I could just execute renku mv data/modules/ modules/
to move the entire dataset, but if I execute this command now, I get a warning that the files will be removed from the dataset. So the location of the dataset is now hard-wired?
Just ran into this problem again. Is there a way to move code imported as ‘dataset’ out of the ‘data’ folder without removing it from the dataset itself? Otherwise, it gets confusing to have code in the folder ‘data’. I used to be able to use renku mv
on a dataset, but now I get:
Warning: You are trying to move dataset files out of a datasets data directory. These files will be removed from the source dataset: ...
you can set a dataset’s data directory when you create it, see renku dataset — Renku documentation .
You can’t change it once a dataset is created, but you can create a new dataset with a data dir you want and move files into it (if they’re already in a dataset, otherwise just add them) using renku mv with this flag: renku mv — Renku documentation