How to make available a large Calibration dataset

Hello,
for some specific astrophysical analysis, I need to make available a rather large calibration dataset (3 or 7 Gb, depending on the needs). It is made of many small files. This set of calibration files needs to be available from a specific location in the file system within the Docker container and accessed very frequently, requiring a good performance. So Far, I added it inside the Docker image, but it becomes immense. What solution do you suggest to use?

Many thanks in advance.

Hi @carloferrigno , Renkulab offers the possibility to connect to Cloud storage (at the moment, only s3 format is enabled, and soon other storage formats will be available). Check here for details. Does this solution fit your purpose?

No, not solving, as I have no S3 storage and I need fast access to these files to run my software. I would need some fast storage locally available. Otherwise, I will pack it into the container.

@carloferrigno it shouldn’t be too difficult to obtain an S3 bucket (or some other cloud storage, e.g. switch drive). You can set it up to copy the data into local storage when the container starts. If you put a script called post-init.sh into your project (assuming you are using an image based on one of the renku base images), it will execute the commands inside on start-up. For better performance, you can use the rclone command to do the copy. I can set up an example for you, if you’re not sure how to do this.

Hi @rrrrrok I have access to Switch drive, thanks for your suggestion. This container should be used for a course at Unige with 12 students by a colleague. The files are at SWITCHdrive my image is based on renku/renkulab-py:3.10-0.22.0 In the repository Reproducible Data Science | Open Research | Renku there is Dockerfile-build which I use to build the image that I put on dockerhub, then the Dockerfile is fetching this image to speed up the session. Currently, the Docker image is big as I embed these calibration files, but I can easily remove them. They should be in /opt/ccf If you could send me an example of this post-init.sh I would be grateful

1 Like

hi @carloferrigno, I’ve put together a project that uses rclone to copy data on startup from your switch drive folder. You can find it here: https://renkulab.io/projects/rok.roskar/rclone-sync-demo

The downside is that the session start-up is delayed by ~1 minutes due to the copy. You can see in the post-init.sh script that it’s just a single command - you could also instruct users to run that when they need it instead of running it on start-up.

I made a link from /opt/ccf to /home/jovyan/work/data/ccf – that’s because the /opt directory is provided by the overlay filesystem and should not be used for data. Just keep that in mind - some scripts might complain about symbolic links.

I used rclone with a webdav config because it can parallelize the file transfer. If you want it to be simpler, you could also just use wget, but it’s not as fast.

Hope this helps!

This is the diff:

diff --git a/.renku/renku.ini b/.renku/renku.ini
index 1659a57..0bce25e 100644
--- a/.renku/renku.ini
+++ b/.renku/renku.ini
@@ -1,2 +1,9 @@
 [interactive]
 default_url = /lab
+disk_request = 50G
+
+[renku]
+autocommit_lfs = false
+lfs_threshold = 100kb
+check_datadir_files = true
+
diff --git a/Dockerfile b/Dockerfile
index a485fcb..6117695 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -41,6 +41,10 @@ FROM renku/renkulab-py:3.10-0.18.1
 #    vim
 # USER ${NB_USER}
 
+USER root
+RUN ln -s ${HOME}/work/data/ccf /opt/ccf && chown -R 1000:100 /opt/ccf
+USER ${NB_USER}
+
 # install the python dependencies
 COPY requirements.txt environment.yml /tmp/
 RUN mamba env update -q -f /tmp/environment.yml && \
diff --git a/post-init.sh b/post-init.sh
new file mode 100644
index 0000000..8fad2a8
--- /dev/null
+++ b/post-init.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+# copy over the data
+rclone --config ./rclone.conf copy data: /home/jovyan/work/data/ccf -P --transfers 8
diff --git a/rclone.conf b/rclone.conf
new file mode 100644
index 0000000..db7c6fe
--- /dev/null
+++ b/rclone.conf
@@ -0,0 +1,6 @@
+[data]
+type = webdav
+url = https://drive.switch.ch/public.php/webdav
+vendor = owncloud
+user = O8CE613GjhKMhtU
+