Import renku dataset in a notebook / script

Hellooo dear Renkuers.

I have a question / request! It’s about something I was using not long ago (-- some months ago, so might be a while), but seems to have changed in the meantime.

Context: I want to import a renku dataset without using the CLI, so that I can easily and flexibly import datasets from notebooks or scripts, where I can easily control local directories and their content.

Problem: I used to do the following, to import data from another gitlab repo into a renku dataset:

from renku import LocalClient
client = LocalClient('.') 

dataset_repo = 'repo_url' # Or whatever repo
dataset_name = 'my_data' # the name of the newly imported dataset
source_name = 'my_source_directory' # relative to repo root

with client.with_dataset(dataset_name) as dataset:
    client.add_data_to_dataset(dataset, urls=[dataset_repo],
                sources=(f'{source_name}', ), 
                destination=f'data/{dataset_name}')

But it now errors with an (unsurprising, tbh) error DatasetNotFound: Dataset "my_data" is not found. I thought a new dataset was created (and I think that’s what used to happen), but now it seem I have to crate a new one (which is fine – but I need to know how to do so).

Furthermore, I would like to do it directly from Zenodo – so the variable dataset_repo would be a Zenodo DOI. But being able to do it again from another renkulab.io/gitlab repository would be still great.

I wonder how I can restore that functionality. I also think that in many situations having a nice interface to renku dataset might be helping, in particular when datasets handling might need to be streamlined but depending on local constraints. I know a bash script might be the obvious answer, but I’d prefer using the snake. I think also many users would agree on this.

Thanks a lot for any help. I can provide more evidence if you contact me directly as I would need to bring you into a private repo.

Ciao,
– Michele

1 Like

Hey Michele,

with_dataset has a parameter “create=False”. If you set that to true, it should work as expected.

Cheers,

Ralf

Hi Michele,

Internals of Renku have changed a lot during past recent months. As @ralf.grubenmann mentioned you can get around this specific problem by passing create=True to with_dataset. You also shouldn’t pass destination like that; files by default go to data/dataset_name and any value that you pass to destination will be treated as additional subdirectories, so files will go to data/dataset_name/data/dataset_name in this case.

Adding (or as we call it, importing) from Zenodo is a completely different story and it’s not possible to do it from client with a single function call. However, you can take a look at renku/core/commands/dataset.py for a list of functions that are at a higher level of abstraction for working with datasets. They do some extra work and might call multiple functions from LocalClient to achieve the desired functionalities. If you want to use these functions in your code, keep in mind that 1. Data/metadata will be committed to Git after each call, and 2. A LocalClient is passed automatically to them and you don’t need to pass yours.

For example, to do the same as above:

from renku.core.commands import dataset
dataset.add_file(urls=[dataset_repo], short_name=dataset_name, create=True, sources=[f'{source_name}'], commit_message='Create and add new data')

Or to import a dataset from Zenodo:

dataset.import_dataset(uri='10.5281/zenodo.3266798', commit_message='imported from Zenodo')

Note that if you don’t pass a commit_message you will get some weird (and perhaps ugly) commit message in you git repo.

We totally agree with you on having an interface to reveal some of the Renku internals to end-users. We’ve had some discussion before but ATM it is not clear how this interface should look like (-> use-cases are welcome).

Feel free to ping me if you need further information!

Regards,
Mohammad

Hi Mohammad and Ralph,

Well, you both solved my concerns and problems in no time!

Seeing things like that make me wonder whether a simpler interface would be really needed, but I agree that something exposing the functionalities would be nice to have – in particular in light of creating a dataset from current analyses: so creating the dataset object, handling metadata and ultimately even publishing it to zenodo from withing your notebook or script, if it would make sense to you.

Still, as you pointed out, looking at the functions in renku/core/commands/dataset.py almost answers any question I might come up with – so thanks for pointing it out!

Happy to discuss further for use cases, but for now, thanks a ton for solving my issue!

Cheers,
– Michele

Edited to add the comment about renku/core/commands/dataset.py