When executing a command or script using
renku run, Renku tracks a lineage of all arguments that appear on the command-line. These arguments include files and directories containing input/output data and parameters that are passed to a command or script. This enables users to create reproducible workflows by simply prepending
renku run to their processing commands without any other modification.
Renku also supports more advanced use cases where dependencies are within a script and don’t appear on the command line. For example, parameter values can be read from a database or a list of input files derived from the current date. Users can use features like indirect inputs and outputs to tell Renku about such inputs/outputs/parameters.
To further simplify tracking and storing such indirect information in Renku, we created a Python library that provides an interface to Renku internals: the Renku Python API. It introduces Renku concepts of
Dataset as Python objects that users can access in their scripts. The first version of this library allows accessing a project’s metadata inside a script or tracking specific linage information.
In this example, we will use the Renku API in a python script to demonstrate some of its features. Our example will read the first
n lines from each file in all datasets and write them to an output file. The following script achieves this goal without using the Renku API:
from pathlib import Path n = 10 output_file = "output.txt" with open(output_file, "w") as result: for path in Path("data").rglob("*/*"): # Walk the list of all data files with open(path) as input: for line in input.readlines()[:n]: # Get the first n lines of an input result.write(line)
We can execute this as a pure Python script, but we can also run it through Renku to track its lineage. In this case, we need to provide a list of all of its dependencies on the command-line:
renku run --input data/sets/sets.csv --input ... --output output.txt python3 script.py
Note that in future executions of this script we may need to adapt the list of input files if they change. Overall, this could be a rather cumbersome
and error-prone process. In addition, if the list of files is long, the script invocation becomes difficult to read and understand.
We can solve these shortcomings by re-writing the same script using the Renku Python API. The following shows the new script (a sample project is available at Renku API Demo):
from renku.api import Dataset, Input, Output, Parameter n = Parameter(name="n_lines", value=10) # Define a Parameter output_file = Output("output.txt") # Define an Output with open(output_file, "w") as result: for dataset in Dataset.list(): # Iterate through all Datasets for dataset_file in dataset.files: # Walk the list of a dataset's files with open(Input(dataset_file.path)) as input: # Define an Input for line in input.readlines()[:n]: result.write(line)
Parameter(name="n_lines", value=10)creates a
Parameterinstance that tells Renku a parameter with name
n_linesand value of
10is used in this execution.
Outputpath. Instances of this class can be used in Python functions that work with paths, like passing it to the standard
Dataset.list()returns a list of all available datasets in a project. A
Datasetobject holds the available metadata for a Renku dataset like its name, title, list of creators, etc.
dataset.filesis used to get a list of all data files in a dataset.
Input(dataset_file.path)informs Renku that file’s path is an input to our script by creating instances of the
We can run this script in a Renku project using the following command:
renku run python3 script.py
Because the script registers the files as inputs as it processes them, we can reuse the same command even if the list of files changes and the results will remain self-consistent.
When the execution is over, Renku uses the information that we provided in the script to produce proper metadata for the execution: It records all dataset files as inputs (along with
output.txt as the output, and marks that an
n_lines parameter with a value of 10 was used in this run. The resulting linage from this execution looks like the following:
Comparing the two different versions of script/command shows that it’s simpler, cleaner, and more reliable to access Renku metadata and record linage using the Renku Python API.
For more information see Renku Python API documentation.
This is still a work in progress. We will include more features in future versions. Feel free to reach out to us for any feedback, comments, or feature requests that you might have.