When executing a command or script using renku run
, Renku tracks a lineage of all arguments that appear on the command-line. These arguments include files and directories containing input/output data and parameters that are passed to a command or script. This enables users to create reproducible workflows by simply prepending renku run
to their processing commands without any other modification.
Renku also supports more advanced use cases where dependencies are within a script and don’t appear on the command line. For example, parameter values can be read from a database or a list of input files derived from the current date. Users can use features like indirect inputs and outputs to tell Renku about such inputs/outputs/parameters.
To further simplify tracking and storing such indirect information in Renku, we created a Python library that provides an interface to Renku internals: the Renku Python API. It introduces Renku concepts of Input
, Output
, Parameter
and Dataset
as Python objects that users can access in their scripts. The first version of this library allows accessing a project’s metadata inside a script or tracking specific linage information.
In this example, we will use the Renku API in a python script to demonstrate some of its features. Our example will read the first n
lines from each file in all datasets and write them to an output file. The following script achieves this goal without using the Renku API:
from pathlib import Path
n = 10
output_file = "output.txt"
with open(output_file, "w") as result:
for path in Path("data").rglob("*/*"): # Walk the list of all data files
with open(path) as input:
for line in input.readlines()[:n]: # Get the first n lines of an input
result.write(line)
We can execute this as a pure Python script, but we can also run it through Renku to track its lineage. In this case, we need to provide a list of all of its dependencies on the command-line:
renku run --input data/sets/sets.csv --input ... --output output.txt python3 script.py
Note that in future executions of this script we may need to adapt the list of input files if they change. Overall, this could be a rather cumbersome
and error-prone process. In addition, if the list of files is long, the script invocation becomes difficult to read and understand.
We can solve these shortcomings by re-writing the same script using the Renku Python API. The following shows the new script (a sample project is available at Renku API Demo):
from renku.api import Dataset, Input, Output, Parameter
n = Parameter(name="n_lines", value=10) # Define a Parameter
output_file = Output("output.txt") # Define an Output
with open(output_file, "w") as result:
for dataset in Dataset.list(): # Iterate through all Datasets
for dataset_file in dataset.files: # Walk the list of a dataset's files
with open(Input(dataset_file.path)) as input: # Define an Input
for line in input.readlines()[:n]:
result.write(line)
-
Parameter(name="n_lines", value=10)
creates aParameter
instance that tells Renku a parameter with namen_lines
and value of10
is used in this execution. -
Output("output.txt")
creates anOutput
path. Instances of this class can be used in Python functions that work with paths, like passing it to the standardopen
function. -
Dataset.list()
returns a list of all available datasets in a project. ADataset
object holds the available metadata for a Renku dataset like its name, title, list of creators, etc. -
dataset.files
is used to get a list of all data files in a dataset. -
Input(dataset_file.path)
informs Renku that file’s path is an input to our script by creating instances of theInput
class.
We can run this script in a Renku project using the following command:
renku run python3 script.py
Because the script registers the files as inputs as it processes them, we can reuse the same command even if the list of files changes and the results will remain self-consistent.
When the execution is over, Renku uses the information that we provided in the script to produce proper metadata for the execution: It records all dataset files as inputs (along with script.py
), sets output.txt
as the output, and marks that an n_lines
parameter with a value of 10 was used in this run. The resulting linage from this execution looks like the following:
Comparing the two different versions of script/command shows that it’s simpler, cleaner, and more reliable to access Renku metadata and record linage using the Renku Python API.
For more information see Renku Python API documentation.
This is still a work in progress. We will include more features in future versions. Feel free to reach out to us for any feedback, comments, or feature requests that you might have.