Thoughts arising from the Renku user group meeting

I just wanted to share some thought’s that I had written down in my notes following on from the Renku user group meeting that was recently held in Bern that I attended remotely. These cover a pretty broad range of topics, are pretty stream of consciousness, and contain some partially formed thoughts; but I thought that it might be helpful to put them on the record somewhere as a food for thought.

Workflow files

  • Awesome to have a declarative way of creating a renku workflow in a simple yml file!
  • Still using a monolithic environment/resource management approach
    • As I understand it, In contrast to snakemake or nextflow, the dependencies of all steps in a workflow must be in the Renku image and resources necessary to perform the most compute intensive step of the analysis available to that session. Whereas for these other workflow managers environments and compute resources can be independently specified by step.
    • Strengths remain in interactive analysis environment for smaller datasets rather than direct analysis of large, batched, computationally intensive workflows.
    • Integration with other pipeline management tools to extract provenance, and record the steps to reproduce the upstream analysis? - part of dataset metadata?

Project Templates

  • I think these are potentially underutilised as a way of demonstrating the value add of the platform
  • Gitlab CI/CD pipelines can be included in templates and a lot of cool automations can be done here.
    • These can potentially be modularised with includes
  • Examples of automations that might make cool demo templates
    • Automatically minting Zenodo DOIs on new version tags and uploading a git snapshot to zenodo using gitlab2zenodo - Could do with a helper to setup the Zenodo metadata json file.
    • Templates with pipelines which automatically build Jupyterbook / Quarto projects and server them via gitlab pages sites Would be very useful. (pipelines with extra features like also serving previous versions of sites could be made)
      • This could serve as a vehicle for making a dashboard for launching apps in renku just by creating a static web page with links to renku session auto start links - the page could only be visible to members of the gitlab group.
    • Shiny app builds/deployment
      • A shiny app template base of the {golem} framework for shiny development, (see also Engineering Production-Grade Shiny Apps) could provide a shiny app developent and testing environemnt and using golem’s tools to build apps for shinyapps.io (credentials in gitlab env vars), or docker image for shinyproxy if the shinyproxy instance was configured to pull from the renku container registry

Hub Pages

  • How will these relate to gitlab groups/sub-groups?
  • Not having the structure of groups in gitlab mirrored in the renku UI has been a bit of a pain-point. Navigating to https://renkulab.io/projects/// and being able to see the projects in that group would be very convenient
  • A section explicitly for Renku project templates within Hub pages would be a nice addition and promote the templates feature.

What does a Renku native project without a repo look like?

  • These are mostly just fairly open ended questions about what the renku native project ‘format’ will look like with an eye to not loosing what I perceive to be the benefits of having the git repo be the single source of truth for a project.
  • The current approach of having everything in the git repo makes things incredibly portable - I view this as a major contributor to the extremely minimal ‘vendor lock-in’
  • What is authoritative about project state?
  • Repos as the source of truth provide a ‘declarative’ definition of a project insofar as in theory you could right out a complete description of a renku project event if in practice some of this is normally done imperatively by invoking renku CLI commands
  • Is it possible to keep the git repo as the single source of truth with caching of state in DBs etc but still with a repo / flat files as source of truth?
    • Checking that the hashes of things that are cached match their contents in the repo/files and only updating caches when these change?
    • Can efficiencies be gained from making shallow clones when the need arises to write to the git repo?
  • There is a lot that could be done to extend renku automation based on gitlab CI/CD would having github or other git forges as options for the repo provider get in the way of work in this area as such work would not generalise to github actions and similar CI/CD stacks? (use of a 3rd party automation tool e.g. Jenkins? - an extra thing to manage)
  • How does this relate to earlier discussions about bundling / exporting renku projects into publishable formats like RO crates see: Zenodo Code release / dataset tag automation

Data Versioning

  • Provenance, Duplication, & locality challenges
    • How to handle the use of data by DOI/Other URI stored in external public repos when you need a local copy of it to compute on?
    • Reproducibly creating ephemoral versions as a renku datasets?
    • Suggested approaches - templates for common data sources?
  • From various data versioning technologies to replace git-lfs:
    • DVC (Data Version Control) seems to simplify deployment relative to git-lfs but still has the file copying issue. It does nicely integrate with the commit history though like git-lfs.
    • ZFS can be used for data versioning it’s ‘closer to the metal’ with highly tuneable performance but not ‘cloud native’ so needs a layer for network based file sharing between it and the session if not running locally - might be good if you were deploying an instance an ‘appliance’ and wanted to maximise performance.
    • lakeFS would seem to require the least additional effort to integrate further as it seems compatible with the S3 work already done, and avoids the issues of DVC & git-lfs. Though tighter git/renku-cli integration might be helpful for more complete provenance. This seems like the best option to pursue tighter integration with.

Compute Environment Reproducibility & Portability

  • You can only go so far on this front with the combination of a docker file base on a conventional Linux distro and a conda environment because of the need to version system level dependencies.
  • Nix
    • Unlike conda Nix offers the possibility to define even operating system level dependencies it also features the possibility of bundling this environment in to different ‘formats’. Environments can be built in the form of Docker Images, VM images, and Linux binaries from a nix configuration making environments managed with Nix potentially much more readily portable to different infrastructure which favour different and idiosyncratic approaches to deploying compute environments.
    • With flakes Nix provides detailed dependency versioning and a lock file for the entire environment meaning that rebuilding an environment from a nix flake should never fail assuming the sources are still available unlike rebuilding a docker image from a dockerfile
    • The major downside is of course exposing Nix to end-users as it has something of a learning curve - I’m not personally convinced that the curve for Nix is sufficiently steeper than the curve for combining Docker and Conda to overcome the advantages offered by Nix over the combination of Docker and Conda though.
    • (Note: Reproducible environments are being taken to their most extreme in GUIX which draws from the ideas of NIX. GUIX has full source bootstrap i.e. the possibility to bootstrap the build from source of the entire dependency tree, including build dependencies, of a given package with versioning to git commit resolution of all dependencies if desired)

Misc

  • Idea: possible hack for using Gitlab CI/CD Environment variables as a credentials store?
    • Make a gitlab PAT for the user a default Environment Variable in the Renku session and use the gitlab API to get CI/CD environment variables from the gitlab repo on project start?
    • See: Project-level CI/CD variables API | GitLab
    • (assumes you can automatically create a PAT for users and inject this into the renku session as an environment variable - if not it’s a way of getting it down to 1 env var you have to set when you start a session if you have multiple credentials you need to access)
  • Thought following mention of connecting to sessions over ssh - Should locally run renku sessions serve to 0.0.0.0 by default or 127.0.0.1?
    • 0.0.0.0 may inadvertently expose ports to the user’s renku session to others on the local network as most iptables based firewalls do not block traffic on 0.0.0.0. (e.g. ufw enabled on ubuntu would allow you to connect to any port on something served on 0.0.0.0 but not 127.0.0.1 unless ports were explicitly allowed)
  • Note: federation between Renku instances might be made easier by the work planned by gitlab to integrate with activitypub for cross instance collaboration.
4 Likes

Thank you very much @RichardJActon for joining the meeting and sharing your thoughts in Discourse. They are extremely helpful and will serve as inspiration for further efforts.