I’ve noticed conversations about longer-term reproducibility and archiving of environments have come up here and there over the years, so I wanted to check whether this is still something the community cares about and — if so — whether the idea below might be a useful, practical option.
Below I sketch out an idea for a small, intentionally minimal manifest format (“Renku-Rehydrate”) that records stable references to code (commit SHAs), data (DOIs/PIDs), and environments (conda files or container digests) so a project can be rehydrated later with minimal effort. I’d be grateful for any thoughts on viability, or additional features desired for such an object.
For context, here’s an excerpt from a related Oct 2020 discussion on this forum (full thread: https://renku.discourse.group/t/how-to-ensure-reproducibility-of-environment/181/19):
schymans (Oct 2020): If I don’t specify package versions, the currently up-to-date version will be installed — what happens in a year? Is there a way to find the original package versions?
rrrrrok (Oct 2020): Images on Renku are persisted for quite a long time while available in the registry, but rebuilding can make them unstable. Maybe a deliberate action likerenku freeze(e.g.,pip freeze > requirements.txt) could help; for proper preservation exporting the image to an archive like Zenodo is recommended.
A proposed solution:
Make a very small manifest file (YAML or JSON) that contains only stable identifiers: exact git commits for code, DOIs/PIDs/handles for data, and concrete environment references (conda files or container digests). The file is intentionally lightweight (no bundled data), portable (can live in Zenodo/GitHub/institutional archives), and human readable so it can serve as a durable archival “bootstrap” for reconstructing a project in Renku.
I’ve also drafted up an idea markdown in a git repo with the draft spec and examples, if anyone wants to collaboratively edit that way:
https://gitlab.eawag.ch/chase.nunez/renku_rehydrate
Why this helps Renku users
-
Provides a compact archival snapshot that is easy to store/share and robust to changes in infrastructure, promoting reproducibility.
-
Lets users or reviewers reconstruct the same code + data + environment without hunting for commits or dataset versions, promoting collaboration.
-
Enables a simple user experience path (e.g., drag-and-drop a
renku-rehydrate.yml) to recreate a project workspace, adding additional functionality to Renku’s platform. -
Offers an interoperable artifact that can be used by other platforms (RO-Crate, institutional archives, Zenodo).
Example manifest
manifest-version: 1.0
project: my-renku-study
description: "Analysis for Paper XYZ, v1.0"
created-by: renku-publish
code:
- repo: https://github.com/alice/my-analysis.git
commit: 3a1f7b9d0e2...
data:
- source: Zenodo
id: 10.5281/zenodo.1234567
- source: SciCat
id: scicat://abcde-12345
environment:
conda-file: environment.yml
commit: 3a1f7b9d0e2...
container-image: myrepo/analysis-env:1.0.0
container-digest: sha256:abcdef123...
container-archive: DOI:10.5281/zenodo.7654321
run:
entrypoint: analysis_script.sh
compute:
platform: "Lustre HPC cluster"
cpus: 16
memory: 64GB
Potential UX / backend approach
Drag-and-drop Rehydration
-
User drops
renku-rehydrate.ymlinto a “Rehydrate Project” area in the Renku portal. -
Renku parses the manifest and:
-
clones referenced repos at specified commits,
-
fetches datasets by DOI/PID,
-
restores the declared environment (conda or container),
-
creates a project workspace / GitLab repo with metadata and provenance.
-
Backend pieces
-
manifest parser (YAML/JSON)
-
GitLab project creation + commit population
-
connectors to Zenodo/SciCat/EnviDat/etc. for dataset retrieval
-
conda/container restore (or link to container archive DOI)
-
provenance logging so rehydrated projects can be archived again
Ways forward
Low-hanging fruit:
-
Define a minimal, unambiguous schema for
code,data, andenvironment. -
Support a “launch link” that pulls the same commit + data into a running session (less than full export format).
-
Implement a parser that validates a manifest and reports missing/invalid identifiers.
Medium / aspirational
-
Full drag-and-drop with GitLab project creation and automatic provenance records.
-
Integration with DOI services for container archives and dataset fetching across repositories.
-
UI to show which parts are available now vs. only available via archived DOIs.
Questions for the community
-
Are there existing Renku design patterns or internal APIs I should align with? (e.g., preferred metadata fields, provenance model)
-
Which dataset registries would you prioritize connectors for (Zenodo, EnviDat, SciCat, others)?
-
Would a “launch link” (that always resolves to the same code+data) be a useful intermediate feature before full import/export?
I’m grateful for any feedback, suggested features, or user stories where this sort of thing would be useful for you. I am happy to iterate based on what the community thinks are the most valuable, doable next steps.