What about a long term archival object combining data, code, and compute details?

chasenunez · 15 October 2025 06:39

I’ve noticed conversations about longer-term reproducibility and archiving of environments have come up here and there over the years, so I wanted to check whether this is still something the community cares about and — if so — whether the idea below might be a useful, practical option.

Below I sketch out an idea for a small, intentionally minimal manifest format (“Renku-Rehydrate”) that records stable references to code (commit SHAs), data (DOIs/PIDs), and environments (conda files or container digests) so a project can be rehydrated later with minimal effort. I’d be grateful for any thoughts on viability, or additional features desired for such an object.

For context, here’s an excerpt from a related Oct 2020 discussion on this forum (full thread: https://renku.discourse.group/t/how-to-ensure-reproducibility-of-environment/181/19):

schymans (Oct 2020): If I don’t specify package versions, the currently up-to-date version will be installed — what happens in a year? Is there a way to find the original package versions?
rrrrrok (Oct 2020): Images on Renku are persisted for quite a long time while available in the registry, but rebuilding can make them unstable. Maybe a deliberate action like renku freeze (e.g., pip freeze > requirements.txt) could help; for proper preservation exporting the image to an archive like Zenodo is recommended.

A proposed solution:

Make a very small manifest file (YAML or JSON) that contains only stable identifiers: exact git commits for code, DOIs/PIDs/handles for data, and concrete environment references (conda files or container digests). The file is intentionally lightweight (no bundled data), portable (can live in Zenodo/GitHub/institutional archives), and human readable so it can serve as a durable archival “bootstrap” for reconstructing a project in Renku.

I’ve also drafted up an idea markdown in a git repo with the draft spec and examples, if anyone wants to collaboratively edit that way:
https://gitlab.eawag.ch/chase.nunez/renku_rehydrate

Why this helps Renku users

Provides a compact archival snapshot that is easy to store/share and robust to changes in infrastructure, promoting reproducibility.
Lets users or reviewers reconstruct the same code + data + environment without hunting for commits or dataset versions, promoting collaboration.
Enables a simple user experience path (e.g., drag-and-drop a renku-rehydrate.yml) to recreate a project workspace, adding additional functionality to Renku’s platform.
Offers an interoperable artifact that can be used by other platforms (RO-Crate, institutional archives, Zenodo).

Example manifest

manifest-version: 1.0
project: my-renku-study
description: "Analysis for Paper XYZ, v1.0"
created-by: renku-publish

code:
  - repo: https://github.com/alice/my-analysis.git
    commit: 3a1f7b9d0e2...

data:
  - source: Zenodo
    id: 10.5281/zenodo.1234567
  - source: SciCat
    id: scicat://abcde-12345

environment:
  conda-file: environment.yml
  commit: 3a1f7b9d0e2...
  container-image: myrepo/analysis-env:1.0.0
  container-digest: sha256:abcdef123...

container-archive: DOI:10.5281/zenodo.7654321

run:
  entrypoint: analysis_script.sh

compute:
  platform: "Lustre HPC cluster"
  cpus: 16
  memory: 64GB

Potential UX / backend approach

Drag-and-drop Rehydration

User drops renku-rehydrate.yml into a “Rehydrate Project” area in the Renku portal.
Renku parses the manifest and:
- clones referenced repos at specified commits,
- fetches datasets by DOI/PID,
- restores the declared environment (conda or container),
- creates a project workspace / GitLab repo with metadata and provenance.

Backend pieces

manifest parser (YAML/JSON)
GitLab project creation + commit population
connectors to Zenodo/SciCat/EnviDat/etc. for dataset retrieval
conda/container restore (or link to container archive DOI)
provenance logging so rehydrated projects can be archived again

Ways forward

Low-hanging fruit:

Define a minimal, unambiguous schema for code, data, and environment.
Support a “launch link” that pulls the same commit + data into a running session (less than full export format).
Implement a parser that validates a manifest and reports missing/invalid identifiers.

Medium / aspirational

Full drag-and-drop with GitLab project creation and automatic provenance records.
Integration with DOI services for container archives and dataset fetching across repositories.
UI to show which parts are available now vs. only available via archived DOIs.

Questions for the community

Are there existing Renku design patterns or internal APIs I should align with? (e.g., preferred metadata fields, provenance model)
Which dataset registries would you prioritize connectors for (Zenodo, EnviDat, SciCat, others)?
Would a “launch link” (that always resolves to the same code+data) be a useful intermediate feature before full import/export?

I’m grateful for any feedback, suggested features, or user stories where this sort of thing would be useful for you. I am happy to iterate based on what the community thinks are the most valuable, doable next steps.

tolevski · 15 October 2025 14:34

Hi @chasenunez we still care about reproducibility. However it is a very broad topic and I think we need a bit more exploration of what aspect of reproducibility you think is most crucial and should be addressed. I think we can capture this in a user story with an outline like the one below:

As a Renku user [or someone else]
I want to [what is the intent of the user, not the feature they will use]
So that I [how does the intent fit in the bigger picture, what is the end goal the user tries to achieve]

Can you come up with 2 or 3 statements like that related to your proposal?

chasenunez · 16 October 2025 06:20

Hi @tolevski — thanks for the steer. Happy to boil this down into agile user stories; I’ve drafted two concise ones below (pulled from the proposal above):

As a Renku user I want a minimal manifest that records stable references to code (commit SHAs), data (DOIs/PIDs), and environments (conda files or container digests), so that I have a compact archival snapshot that is easy to store/share and robust to changes in infrastructure, promoting reproducibility and enabling FAIR archival practices (persistent identifiers and clear metadata for findability and access).
As a Renku user I want an easy way to drag-and-drop that archival manifest into a Renku session and have it fetch the referenced repos at specified commits, datasets by DOI/PID, and restores the declared environment, So that I (or my collaborators, or my children’s children) can reconstruct the same code + data + environment without hunting for commits or dataset versions, ensuring analyses remain reproducible.

leafty · 16 October 2025 07:14

I have to say I do not find the idea very compelling as it is presented at the moment. This is not to disregard any concern about long term archival and reproducibility.

There is no guarantee that the references are stored on platforms providing the same storage lifetime as where the manifest itself is stored. If the code from the manifest is hosted on a GitLab instance long gone, what good is it to create such a manifest?
The obvious answer here is that all the components of the project are either already hosted on a long-term archival platform, or that the components are stored (together or in separate records) in the same platform. When looking around Zenodo, this seems to be a common practice where code is submitted as a separate record, linked to the relevant dataset or article record.
In light of this observation, it then becomes much less obvious that there would be a need for a “Renku manifest”, given that the components should be findable and retrievable. What seems more important is to have proper documentation (i.e. a README, some more docs) which provides help as to how to re-setup the environment (inside or outside of Renku) as a closest match as possible from the original work.
While we do not have some form of “Renku export” then “Renku import” to transfer Renku projects between Renku instances, we have not yet seen the need for a tool to do this at scale.
Providing such tooling would require the tool to keep project-related data stored in a format that can deal with Renku’s changes of API, which is rapidly changing with our pace of feature development. This means that providing a tool for export/import is challenging at the moment.

I do not want you to see this as a deterrent to explore this idea, but I would encourage you to explore more about what the practical use the manifest would have.

In fact, you can already experiment with fetching data from the Renku API and loading data into it.

Here is the API page → Swagger UI
To setup a CLI tool which can talk to the Renku API, the OAuth 2.0 device flow can be used:
- Explainer page: https://auth0.com/docs/get-started/authentication-and-authorization-flow/device-authorization-flow
- RFC: RFC 8628 - OAuth 2.0 Device Authorization Grant
- First example: GitHub - SwissDataScienceCenter/renku-cli
- Second example: renku-dev-utils/pkg/renkuapi/auth.go at main · SwissDataScienceCenter/renku-dev-utils · GitHub

leafty · 16 October 2025 07:22

Re-reading my first point, it becomes clear that providing a way to export a Renku project into a record (or a set of records) which can be hosted on Zenodo and then re-imported into a fully-functioning project seems like a more complete story.

tolevski · 16 October 2025 07:57

@chasenunez I don’t disagree with @leafty. Making a Renku manifest like you describe (that works reliably against an evolving Renku API) is not an easy task. So I am trying to drill down to the user stories to see what the concerns are and whether there is some alternative or compromise that can keep us all happy. I will post some comments on the two user stories you posted. Those are a great start. Thank you for writing them.

tolevski · 16 October 2025 08:55

@chasenunez here are my comments. My notes are basically trying to ask the question “can we achieve the same goals that you are trying to achieve by not assuming we will develop a Renku manifest”. It does not mean that we wont - but lets fully explore the motivation and problem first without focusing on a specific solution.

Story #1

As a Renku user I want a minimal manifest that records stable references to code (commit SHAs), data (DOIs/PIDs), and environments (conda files or container digests), so that I have a compact archival snapshot that is easy to store/share and robust to changes in infrastructure, promoting reproducibility and enabling FAIR archival practices (persistent identifiers and clear metadata for findability and access).

Can we rephrase this as:

As a Renku user
I want project snapshots (as described in story #2) to be easy to store and share
So that I can share with collaborators or add them to research archives (i.e. like Zenodo)

Story #2

As a Renku user I want an easy way to drag-and-drop that archival manifest into a Renku session and have it fetch the referenced repos at specified commits, datasets by DOI/PID, and restores the declared environment, So that I (or my collaborators, or my children’s children) can reconstruct the same code + data + environment without hunting for commits or dataset versions, ensuring analyses remain reproducible.

I think if we remove the specific feature here (drag and drop a manifest) and focus on the outcomes (reproduce the state of the project), we get the following:

As a Renku user
I want to create a snapshot of my project (including the code, data and environments)
So that I (or other users) can reproduce the exact state of my project at a specific point in time

What do you think @chasenunez? If we agree to ignore potential solution(s) do my edits capture the intent and goals of the users in the stories? Please keep editing them or add more stories if I missed something.

chasenunez · 21 October 2025 14:02

Thank you all for the thoughtful engagement — I really appreciate the careful reading and the practical concerns raised.

Thanks, @Tolevski. That rephrasing is useful — I agree it helps focus on outcomes rather than prematurely locking us to a single UI pattern. Being solution-agnostic is the best way to innovate, but I think in this case concretely agreeing on what must be captured (and importantly what is not included) is necessary to differentiate it from currently available options (as @leafty hints at)

To illustrate this, I’ll offer a gentle counterpoint: for researchers who aim to produce reproducible scripts and analyses, the compute environment isn’t an optional ornament — it’s part of the research object. When the environment is only described in prose or scattered across places with no machine-readable glue, reproducing results too often becomes an act of educated guessing rather than a straightforward, verifiable step. That’s the real practical problem I’m trying to name.

The proposal here is not meant to supplant archives like Zenodo or to replace documentation. Rather, it’s intended as a small, machine-readable bridge: a compact way to point at the exact code commit, the PID’d dataset, and the container or environment recipe so tooling can rehydrate a project reliably. Think of it as a recipe card that complements the pantry of preserved ingredients: it doesn’t store the ingredients, it tells you how to combine them so the dish comes out the same every time. The drag and drop functionality is just the logical extension of this idea.

I appreciate the concerns about implementation, API stability, and scope. Those are important and valid. My hope, though, is that we can separate two questions:

is this capability useful and, if so, for whom?; and:
how do we implement it in a way that’s minimally invasive and sustainable?

My original note was about the first question; I’d be glad to address implementation details in response to concrete follow-ups, but I don’t want those operational worries to make us lose sight of whether the capability itself is worth having.

I don’t mean to propose a large overhaul or to understate how complex a platform like Renku is. My intent was simply to start a community conversation about a small, focused feature that would materially advance Renku’s stated goal of enabling reproducible research: a compact, machine-actionable way to record and reference compute state alongside code and data. If the project isn’t inclined to move in this direction, that’s also valid and informative.

Thanks again for the pushback and the questions — they sharpen the problem and help find the most productive way forward.

tolevski · 29 October 2025 20:38

@chasenunez I think what you are saying here is to be precise about what a “snapshot” is. At least as I use it in the two stories:

As a Renku user
I want to create a snapshot of my project (including the code, data and environments)
So that I (or other users) can reproduce the exact state of my project at a specific point in time

As a Renku user
I want project snapshots to be easy to store and share
So that I can share with collaborators or add them to research archives (i.e. like Zenodo)

So can we define snapshot as:

Snapshot: The exact state of a data science project at a point in time. This includes the exact docker image, code via commit SHAs and Git repo urls and datasets via DOIs/PIDs (or equivalent).

Is this acceptable? And am I addressing the point you are trying to raise?

tolevski · 29 October 2025 20:53

What else do we have to define in a solution agnostic way in order to capture your intent?

If the answer is that we have to define the solution (i.e. the yaml file spec and details around it) in order to capture your intent, then this is where I disagree. I would like to agree on core outcomes so that when we decide to implement this we can provide the core outcomes you want. The solution may or may not be what you propose.

chasenunez · 3 November 2025 07:31

Thanks, @tolevski this seems like a great encapsulation of my intent.

Topic		Replies	Views
Feedback from new user Renku (CLI)	6	346	7 June 2022
Thoughts arising from the Renku user group meeting	1	221	19 September 2023
Renku dataset edit Renku (CLI)	19	619	27 March 2020
Is it possible to publish code through Renku?	19	916	5 December 2023
How to ensure reproducibility of environment?	18	524	17 May 2024