Renku Python API

When executing a command or script using renku run, Renku tracks a lineage of all arguments that appear on the command-line. These arguments include files and directories containing input/output data and parameters that are passed to a command or script. This enables users to create reproducible workflows by simply prepending renku run to their processing commands without any other modification.
Renku also supports more advanced use cases where dependencies are within a script and don’t appear on the command line. For example, parameter values can be read from a database or a list of input files derived from the current date. Users can use features like indirect inputs and outputs to tell Renku about such inputs/outputs/parameters.

To further simplify tracking and storing such indirect information in Renku, we created a Python library that provides an interface to Renku internals: the Renku Python API. It introduces Renku concepts of Input, Output, Parameter and Dataset as Python objects that users can access in their scripts. The first version of this library allows accessing a project’s metadata inside a script or tracking specific linage information.

In this example, we will use the Renku API in a python script to demonstrate some of its features. Our example will read the first n lines from each file in all datasets and write them to an output file. The following script achieves this goal without using the Renku API:

from pathlib import Path

n = 10

output_file = "output.txt"

with open(output_file, "w") as result:
    for path in Path("data").rglob("*/*"):  # Walk the list of all data files
        with open(path) as input:
            for line in input.readlines()[:n]:  # Get the first n lines of an input

We can execute this as a pure Python script, but we can also run it through Renku to track its lineage. In this case, we need to provide a list of all of its dependencies on the command-line:

renku run --input data/sets/sets.csv --input ... --output output.txt python3

Note that in future executions of this script we may need to adapt the list of input files if they change. Overall, this could be a rather cumbersome
and error-prone process. In addition, if the list of files is long, the script invocation becomes difficult to read and understand.

We can solve these shortcomings by re-writing the same script using the Renku Python API. The following shows the new script (a sample project is available at Renku API Demo):

from renku.api import Dataset, Input, Output, Parameter

n = Parameter(name="n_lines", value=10)  # Define a Parameter

output_file = Output("output.txt")  # Define an Output

with open(output_file, "w") as result:
    for dataset in Dataset.list():  # Iterate through all Datasets
        for dataset_file in dataset.files:  # Walk the list of a dataset's files
            with open(Input(dataset_file.path)) as input:  # Define an Input
                for line in input.readlines()[:n]:
  • Parameter(name="n_lines", value=10) creates a Parameter instance that tells Renku a parameter with name n_lines and value of 10 is used in this execution.
  • Output("output.txt") creates an Output path. Instances of this class can be used in Python functions that work with paths, like passing it to the standard open function.
  • Dataset.list() returns a list of all available datasets in a project. A Dataset object holds the available metadata for a Renku dataset like its name, title, list of creators, etc.
  • dataset.files is used to get a list of all data files in a dataset.
  • Input(dataset_file.path) informs Renku that file’s path is an input to our script by creating instances of the Input class.

We can run this script in a Renku project using the following command:

renku run python3

Because the script registers the files as inputs as it processes them, we can reuse the same command even if the list of files changes and the results will remain self-consistent.

When the execution is over, Renku uses the information that we provided in the script to produce proper metadata for the execution: It records all dataset files as inputs (along with, sets output.txt as the output, and marks that an n_lines parameter with a value of 10 was used in this run. The resulting linage from this execution looks like the following:

Comparing the two different versions of script/command shows that it’s simpler, cleaner, and more reliable to access Renku metadata and record linage using the Renku Python API.

For more information see Renku Python API documentation.

This is still a work in progress. We will include more features in future versions. Feel free to reach out to us for any feedback, comments, or feature requests that you might have.


Hi @mohammad-sdsc

Wow, this feature is fantastic!! I can finally test it out, and I was wondering: when specifying the inputs, is it possible to just indicate the path to a folder? Instead of a specific file. Or would that cause an error?

Thank you so much!

Hi @lusamino, thank!

It should work with no issue. Basically, you can pass the same input types as renku run which includes directories as well.


1 Like

Fantastic! I know you would have considered that possibility :wink: I will let you know in any case if I encounter any error.

Thanks @mohammad-sdsc

1 Like

Hi @mohammad-sdsc ,

One last quick question: is Renku tracking other generated outputs beyond the ones specified?? Actually, I am generating some temporary files that are used for validation, but that I would not like to track in the workflow. If they are automatically added, is there a way to deactivate this behaviour?

Thank you so much!

Hi @lusamino,
You can pass --no-input-detection and/or --no-output-detection flags to a renku run command to prevent renku from automatically adding inputs/outputs to your workflow. In this case, only inputs/outputs that are explicitly defined by you (using the API, or --input, --output flags) will be recorded.


1 Like

These are truly awesome features @mohammad-sdsc Thank you so much!


1 Like