Add code from different project

I would like to be able to transparently import code files from other projects, similar to renku dataset add. For example, I would like to use https://renkulab.io/projects/remko.nijzink/vomcases/files/blob/src_py/plot_et_ass.py in a different project, but be clear about where I took it from. Should i just pretend that plot_et_ass.py is a dataset and use renku dataset add, or is there a more appropriate function to do this? I was also thinking of submodules, but this would require that plot_et_ass.py would be in a separate git repo.

1 Like

Hi! As you point out, using a submodule would be one solution. I think this is probably the best solution if you want to use the entire other project, but if you just want a file or two, it could be overkill.

The semantics of renku dataset add [file from a git repo] are pretty close to what you want here, but the main problem is that it would put the file into git-lfs by default, and it is cumbersome (though not impossible) to store the file directly in git.

I will see if anyone here has a better idea and get them to contribute here.

Hello,
ATM, what @cramakri proposed is the best solution. We have an open issue to allow adding/sharing/tracking code similar to what can be done with data: https://github.com/SwissDataScienceCenter/renku-python/issues/671.

Thanks a lot, @cramakri and @mohammad-sdsc! I had not seen the issue. Just added a comment to it, as it seems to be concerning data files only, not code. As @cramakri pointed out, placing a code file in git-lfs is counter-productive. Would it not be easy to copy the functionality of renku dataset add but remove the git-lfs part? Or add an option to renku dataset add to let the user choose where the file is placed, and maybe allow a tag for code?

This has been fixed in v0.10.0 - you need to include the --no-external-storage flag to renku commands like this

renku --no-external-storage dataset add ...

Thanks so much for the update on this. It worked like a charm, except that I just got tripped up by an error message that git lfs was not installed. Could it be added to the instructions for installing renku, so that the users know that they need to install git-lfs separately?

1 Like

Hi @schymans,

sorry for taking a while to respond and thanks for the suggestion - I’ve made an issue to add this to the docs: https://github.com/SwissDataScienceCenter/renku/issues/1396

I am still struggling a little with the following workflow: I take a jupyter notebook from another repo and then modify it to work with my new repo, but I want to record the original provenance of the file for giving proper credit at the end. I can do this with renku --no-external-storage dataset add ... as suggested earlier, but this will put the file into data/externaldataset1/..., even if I use the --destination option, whereas I keep my jupyter files in jupyter/.... What would be the best way of moving or copying the file to my jupyter folder transparently?

I am having a bit of similar issues with sharing code across projects. Now I wondered if it would be helpful to collect scripts, that I use more often, in a python package. In that case, I could just add the package and version in requirements.txt and update this if I make changes to some scripts. But will renku also notice that outputs are generated with an older version of a certain package?

Good idea, but to include it in requirements.txt, you would need to put the package on pypi, right? And then I don’t think that renku would track which version of a pypi package was used for a given result. How about renku --no-external-storage dataset add ... followed by renku mv to put the script where you want it?

HOORAY! This actually works, I just tried it out. I created some .py scripts in git@renkulab.io:wave/li-6800.git in the folder modules_exported, then imported them in another one as:

renku --no-external-storage dataset add modules --source modules/* git@renkulab.io:wave/li-6800.git

Then I moved them from data/modules to modules in the receiving repo:

renku mv data/modules/ modules/

Now I updated one of the .py files in the original repo, commited, then executed renku dataset update modules in the receiving repo and the file was updated there and in the right place! Awesome!

Unfortunately, this approach does not work any more:

$ renku --no-external-storage dataset add --create modules --source modules/* git@renkulab.io:wave/li-6800.git
Error: Invalid parameter value - Cannot use '--source' with multiple URLs.

Is there another way now, or do I have to list every single file in the folder?

Hi @schymans,

The reason you get this error is that you have a modules directory in the root of your project and your shell expands the wildcard. To avoid the shell from expanding, just put modules/* inside single quotes (double quotes won’t work):

renku --no-external-storage dataset add --create modules --source 'modules/*' git@renkulab.io:wave/li-6800.git

Let me know if this doesn’t fix your issue.

Thank you, @mohammad-sdsc , this works! I probably would have never found this simple solution. I’ve been running a python script to add every file individually…

In renku 2.7.0 the above procedure does not seem to work any more:

$ renku --no-external-storage dataset add --create ESSM_plotting git@github.com:schymans/ESSM_plotting.git
Error: Cannot find file Cannot find '/home/stan/notebooks/jupyter/WAVE/renkulab/LI-6800/.renku/cache/github.com/schymans/ESSM_plotting.git/.git/config' in the remote project

The file renku complains about definitely exists:

$ more /home/stan/notebooks/jupyter/WAVE/renkulab/LI-6800/.renku/cache/github.com/schymans/ESSM_plotting.git/.git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[submodule]
	active = .
[remote "origin"]
	url = git@github.com:schymans/ESSM_plotting.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[branch "main"]
	remote = origin
	merge = refs/heads/main
[lfs]
	repositoryformatversion = 0
[filter "lfs"]
	clean = git-lfs clean -- %f
	smudge = git-lfs smudge --skip -- %f
	process = git-lfs filter-process --skip
	required = true

@mohammad-sdsc @cramakri do see what I might have overlooked? Or should I use a different method now?

OK, the original command still works, but not without --source 'modules/*'.
This also works:

renku dataset add --create ESSM_plotting --source '*' git@github.com:schymans/ESSM_plotting.git

Why not without --source? I thought that option is only needed if we want to limit ourselves to a subfolder. Is this a bug?

Hello @schymans

You must always specify files (i.e. sources) that will be added from a git repository. This means that you cannot call renku dataset add without using the --source option (when adding data from a git repository). There’s no default value to add the whole git repository if no source is specified. I’ll create an enhancement issue to allow this in future.

1 Like

Argh, there is another problem. Previously, I could just execute renku mv data/modules/ modules/ to move the entire dataset, but if I execute this command now, I get a warning that the files will be removed from the dataset. So the location of the dataset is now hard-wired?

Just ran into this problem again. Is there a way to move code imported as ‘dataset’ out of the ‘data’ folder without removing it from the dataset itself? Otherwise, it gets confusing to have code in the folder ‘data’. I used to be able to use renku mv on a dataset, but now I get:
Warning: You are trying to move dataset files out of a datasets data directory. These files will be removed from the source dataset: ...

you can set a dataset’s data directory when you create it, see renku dataset — Renku documentation .
You can’t change it once a dataset is created, but you can create a new dataset with a data dir you want and move files into it (if they’re already in a dataset, otherwise just add them) using renku mv with this flag: renku mv — Renku documentation