Renku Python API

mohammad-sdsc · 27 May 2021 15:24

When executing a command or script using renku run, Renku tracks a lineage of all arguments that appear on the command-line. These arguments include files and directories containing input/output data and parameters that are passed to a command or script. This enables users to create reproducible workflows by simply prepending renku run to their processing commands without any other modification.
Renku also supports more advanced use cases where dependencies are within a script and don’t appear on the command line. For example, parameter values can be read from a database or a list of input files derived from the current date. Users can use features like indirect inputs and outputs to tell Renku about such inputs/outputs/parameters.

To further simplify tracking and storing such indirect information in Renku, we created a Python library that provides an interface to Renku internals: the Renku Python API. It introduces Renku concepts of Input, Output, Parameter and Dataset as Python objects that users can access in their scripts. The first version of this library allows accessing a project’s metadata inside a script or tracking specific linage information.

In this example, we will use the Renku API in a python script to demonstrate some of its features. Our example will read the first n lines from each file in all datasets and write them to an output file. The following script achieves this goal without using the Renku API:

from pathlib import Path

n = 10

output_file = "output.txt"

with open(output_file, "w") as result:
    for path in Path("data").rglob("*/*"):  # Walk the list of all data files
        with open(path) as input:
            for line in input.readlines()[:n]:  # Get the first n lines of an input
                result.write(line)

We can execute this as a pure Python script, but we can also run it through Renku to track its lineage. In this case, we need to provide a list of all of its dependencies on the command-line:

renku run --input data/sets/sets.csv --input ... --output output.txt python3 script.py

Note that in future executions of this script we may need to adapt the list of input files if they change. Overall, this could be a rather cumbersome
and error-prone process. In addition, if the list of files is long, the script invocation becomes difficult to read and understand.

We can solve these shortcomings by re-writing the same script using the Renku Python API. The following shows the new script (a sample project is available at Renku API Demo):

from renku.api import Dataset, Input, Output, Parameter

n = Parameter(name="n_lines", value=10)  # Define a Parameter

output_file = Output("output.txt")  # Define an Output

with open(output_file, "w") as result:
    for dataset in Dataset.list():  # Iterate through all Datasets
        for dataset_file in dataset.files:  # Walk the list of a dataset's files
            with open(Input(dataset_file.path)) as input:  # Define an Input
                for line in input.readlines()[:n]:
                    result.write(line)

Parameter(name="n_lines", value=10) creates a Parameter instance that tells Renku a parameter with name n_lines and value of 10 is used in this execution.
Output("output.txt") creates an Output path. Instances of this class can be used in Python functions that work with paths, like passing it to the standard open function.
Dataset.list() returns a list of all available datasets in a project. A Dataset object holds the available metadata for a Renku dataset like its name, title, list of creators, etc.
dataset.files is used to get a list of all data files in a dataset.
Input(dataset_file.path) informs Renku that file’s path is an input to our script by creating instances of the Input class.

We can run this script in a Renku project using the following command:

renku run python3 script.py

Because the script registers the files as inputs as it processes them, we can reuse the same command even if the list of files changes and the results will remain self-consistent.

When the execution is over, Renku uses the information that we provided in the script to produce proper metadata for the execution: It records all dataset files as inputs (along with script.py), sets output.txt as the output, and marks that an n_lines parameter with a value of 10 was used in this run. The resulting linage from this execution looks like the following:

Comparing the two different versions of script/command shows that it’s simpler, cleaner, and more reliable to access Renku metadata and record linage using the Renku Python API.

For more information see Renku Python API documentation.

This is still a work in progress. We will include more features in future versions. Feel free to reach out to us for any feedback, comments, or feature requests that you might have.

lusamino · 6 July 2021 16:08

Hi @mohammad-sdsc

Wow, this feature is fantastic!! I can finally test it out, and I was wondering: when specifying the inputs, is it possible to just indicate the path to a folder? Instead of a specific file. Or would that cause an error?

Thank you so much!
Luis

mohammad-sdsc · 6 July 2021 17:09

Hi @lusamino, thank!

It should work with no issue. Basically, you can pass the same input types as renku run which includes directories as well.

Cheers,
Mohammad

lusamino · 6 July 2021 20:35

Fantastic! I know you would have considered that possibility I will let you know in any case if I encounter any error.

Thanks @mohammad-sdsc
Cheers
Luis

lusamino · 8 July 2021 20:57

Hi @mohammad-sdsc ,

One last quick question: is Renku tracking other generated outputs beyond the ones specified?? Actually, I am generating some temporary files that are used for validation, but that I would not like to track in the workflow. If they are automatically added, is there a way to deactivate this behaviour?

Thank you so much!
Cheers
Luis

mohammad-sdsc · 9 July 2021 06:59

Hi @lusamino,
You can pass --no-input-detection and/or --no-output-detection flags to a renku run command to prevent renku from automatically adding inputs/outputs to your workflow. In this case, only inputs/outputs that are explicitly defined by you (using the API, or --input, --output flags) will be recorded.

Mohammad

lusamino · 9 July 2021 07:29

These are truly awesome features @mohammad-sdsc Thank you so much!

Cheers
Luis

Topic		Replies	Views
Renku.api working depends on where script is run	2	90	30 January 2024
Feedback from new user Renku (CLI)	6	342	7 June 2022
Renku run no workflow recorded	7	153	13 June 2023
Manual definition of inputs and ouputs for lineage	6	214	15 March 2023
Renku-python v1.0.1: Output path recognized as output instead of output file Renku (CLI)	1	384	13 December 2021

Renku Python API

Related topics