Feedback on offline deployment with custom root CA

Hi there!

We deployed Renku on our own premises. The situation is a bit strange. Our infrastructure is offline, and all internal certificates are signed by a custom root Certificate Authority (root CA), as LetsEncrypt is not an option for offline services. This feedback is a quick overview of our deployment strategy, with some deeper insight in a few interesting points.

:warning: Do not take that as a detailled tutorial on how to deploy Renku on an air gapped environment with custom root CA. This post lacks many important details, and writing a full tutorial would take much more than a Discord post.

Our environment
As mentioned, our infrastructure is close to offline. Inbound connections are forbidden (making LetsEncrypt unusable), and outbound connections are very restricted (making the access to external resources impossible).
This situation caused 2 major problems:

  • all Renku pieces accessing an internal service must be modified in a way to integrate our own root CA
  • all resources have to be stored internally before being used (helm charts, docker images, template repos, …)

Additionally, as kaniko was already installed on our kubernetes cluster, we wanted Renku to use it instead of having access to a host for building Docker images.

The resources already available were:

To keep a high level of reproducibility and keep track of the heavy patching of the deployment, we decided to patch and deploy Renku from the CI/CD of a dedicated renku-deployment GitLab repository.

Getting all the resources internally
Most of the outbound connections being forbidden, we had to download all required resources first. A quick checklist:

The certificate hell
Now, if you try to helm install the downloaded chart, it may be deployed properly, but nothing else is going to work. Looking at the log of the different pods, you’ll notice a metric ton of

SunCertPathBuilderException
Certificate not trusted
Impossible to connect to <something>

and other messages having any kind of combination of “certificate” and “invalid”. This basically means that our custom root CA is not imported in the various containers, and any HTTPS connection to a service (mostly Gitlab) will fail. Depending on the way the connection is initiated, the fix will vary. I listed 4 different ways:

  • Python (through http or request library)
  • git
  • OS command
  • Java/Scala

Python case
The fix for Python was fairly easy. By setting the environment variable REQUESTS_CA_BUNDLE, we can tell Python to use a custom root CA. The fix was then to kustomize the helm chart to mount the root CA in the container and add the right env var for each container using Python for HTTPS requests. The patch itself looks something like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <DEPLOYMENT_NAME>
  namespace: renku
spec:
  template:
    spec:
      containers:
      - name: <CONTAINER_NAME>
        env:
        - name: REQUESTS_CA_BUNDLE
          value: /etc/ssl/certs/ca-cert.pem
        volumeMounts:
        - name: ca-pemstore
          mountPath: /etc/ssl/certs/ca-cert.pem
          subPath: ca-cert.pem
          readOnly: true
      volumes:
      - name: ca-pemstore
        configMap:
          name: ca-pemstore

Git case
The fix is quiet similar. Git does not use an env var for setting which root CA you want to use, but a config file. The patch therefore mounts a file with the right config.
The config file is defined in k8s as a configMap with the following content:

gitconfig: |-
  [filter "lfs"]
          clean = git-lfs clean -- %f
          smudge = git-lfs smudge -- %f
          process = git-lfs filter-process
          required = true
  [http]
          sslCAInfo = /etc/ssl/certs/ca-cert.pem

Patching the containers to include the config file is then only mounting the 2 configMaps (config file itself, and the certificate) in the right places:

spec:
  template:
    spec:
      containers:
      - name: <CONT_NAME>
        volumeMounts:
        - name: ca-pemstore
          mountPath: /etc/ssl/certs/ca-cert.pem
          subPath: ca-cert.pem
          readOnly: true
        - name: gitconfig
          mountPath: /root/.gitconfig
          subPath: gitconfig
          readOnly: true
      volumes:
      - name: ca-pemstore
        configMap:
          name: ca-pemstore
      - name: gitconfig
        configMap:
          name: gitconfig

Easy, right?

OS commands
Most of the OS commands (curl, wget, openssl, …) can take a custom root CA through an env var named SSL_CERT_FILE. Great! The fix for these tools is very similar to the one we used for Python.

Java/Scala
Now, we touch a sensitive part. First problem, I’m not very familiar with Java. Second problem, Java does not seem to have an easy setting for custom root CA, but instead, a Certificate Management Tool named keytool.

Unfortunately, it means we cannot add a custom root CA by changing the environment and a few files, but we need to actually run a command. And it makes a huge difference, because we cannot simply modify the k8s deployment file for this to work.

The solution we came up with was to modify the Docker image. To make it work, we need first to write a new Dockerfile to execute the keytool command. This trial miserably failed, and I still don’t why. The other solution was to properly add the root CA “container-wide” and hope that Java picks it up. Here is the Dockerfile for building a new image of webhook-service:

FROM renku/webhook-service:1.13.2

USER root
ADD GC-rootCA.crt /usr/local/share/ca-certificates/gc-ca.crt
RUN chmod 644 /usr/local/share/ca-certificates/gc-ca.crt && update-ca-certificates
USER ${NB_USER}

This image is then built, and we need to modify the helm chart to use this new image instead of renku/webhook-service:1.13.2. It’s easily done with yet another patch for kustomize.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: renku-webhook-service
  namespace: renku
spec:
  template:
    spec:
      containers:
      - name: webhook-service
        image: <WEBHOOK_IMAGE>

A similar patch is applied to other containers running Java/Scala. But then, we got blessed by this merge request. From this PR, we can simply give the custom root CA as part of the values.yaml file, as follow:

global:
  clientCertificate:
    value: |-
      -----BEGIN CERTIFICATE-----
      \<whole cert>
      -----END CERTIFICATE-----

Much cleaner! But the previous trick is still needed for the following deployments which use containers not written by SDSC:

  • hub
  • renku-keycloak

Jobs case
The end? Not yet. If we re-deploy with all these patchs, we can notice that the jobs still fail with a crappy certificate problem. They should have been patched, though… It’s actually not the case because of this helm issue. kustomize was executed via the --post-render argument of helm install. But the rendered template is not complete, and misses the jobs. We modified the process to work around the issue:

  • Write the output of helm template to a deployment.yaml
  • Apply all patches to deployment.yaml
  • kubectl apply -f deployment.yaml

And now it works.

Kaniko
Now that we are out of the certificate hell, we can take care of a much funnier topic. Instead of building the Docker images of the Renku projects in a VM, we want to use Kaniko. A simple modification of the CI/CD file in each project template to update the image_build step is enough. The modifications can be summarized as:

  • changing the docker image to use
  • updating the script to launch kaniko
  • and of course, adding our root CA everywhere

After modification, the interesting part of the .gitlab-ci.yml look like:

image_build:
  stage: build
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  before_script:
    - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
    - |
      echo "-----BEGIN CERTIFICATE-----
     \<root CA>
      -----END CERTIFICATE-----" >> /kaniko/ssl/certs/additional-ca-cert-bundle.crt

  script:
    - CI_COMMIT_SHA_7=$(echo $CI_COMMIT_SHA | cut -c1-7)
    - echo $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
    - /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
  tags:
    - image-build

Other
I could mention many other things. But you most likely can find the solutions somewhere else than in this post.

  • Add the root CA to the Docker images in Renku projects
  • Use our own GitLab as identity provider
  • Use custom repositories as project templates
  • Fix permissions and ownership on NFS volumes
  • Make sure that the previously defined secrets are not overwritten with a new deployment
  • And probably many other things

Thanks
A big thanks to @pameladelgado who helped me a lot through the deployment process

5 Likes

Great post, thanks @ahungler and for your effort invested in this deployment as well.
We will soon make issues on things we can improve on our side to make an air-gapped deployment go a bit more smoothly. I will post update here so that you can follow up.

This is great @ahungler, thanks very much for taking the time to write it up! I don’t want to derail the topic, but I’m curious what your experience with Kaniko has been?

Kaniko works great in our setup. All docker image building for Renku deployment is actually done with Kaniko. Once you figure out that you need to write the pipeline credentials in /kaniko/.docker/config.json, everything goes well.

The .gitlab-ci.yml I posted in my first message really includes all useful content to make it work anywhere.

Thanks @ahungler that’s good to know - I’ve been playing with it in a kubernetes gitlab runner on and off for a few years but it always seemed to me like it was still a bit buggy and prone to strange failures. Maybe we’ll try again more seriously.

Thanks a lot @ahungler for this useful writeup. I’m troubleshooting a similar setup (custom CA) and there seems to be a problem with the communication between traefik (the gateway component) and gitlab. Out of the 4 different cases mentioned in your post

  • Python (through http or request library)
  • git
  • OS command
  • Java/Scala

do you remember which one applies for the traefik? Does it pick up the SSL_CERT_FILE environment variable?

@andreas we applied the Python case to gateway and gateway-auth pods

Thanks @pameladelgado for answering this question. I didn’t remember having seen “traefik” anywhere during the deployment.
I do confirm, though, that the only modification we do to gateway and gateway-auth is to define an environment variable REQUESTS_CA_BUNDLE, and of course mount the ConfigMap containing the certificate.

Note that there is now an epic in which we will try to persist the modifications described here at the chart level and thus make such deployments easier to create and maintain: Enable propagation of custom root CA certificate through chart · Issue #2281 · SwissDataScienceCenter/renku · GitHub

2 Likes

For several renku versions now we have initContainers in our helm chart that inject any custom CA certs into any service that need them.

The certificates that should be trusted should be saved in k8s secrets separately. Then the secrets are listed in the renku values file at global.certificates.customCAs. That is all that is required - no more patching or manual edits.

Great news!
We are several releases late, but I’ll try to re-deploy an up-to-date instance when I find some free time. Many thanks for your work!

@ahungler one thing to note is that this functionality is not extended to the Gitlab that comes with Renku and can be optionally enabled. So if you use the Gitlab that comes bundled with the Renku Helm chart then this is not yet an option for you. Going forward we recommend that users deploy the official Gitlab helm chart separately from Renku and then connect the Renku deployment to this Gitlab deployment through the configuration in the Renku Helm chart.

If you are using the Gitlab Helm chart that comes pre-bundled with Renku then I can take a look at how I can extend this functionality there. As a matter of fact the way we inject the CA certificates in our Helm chart is borrowed from the official Gitlab Helm chart.

There is nothing wrong with the Gitlab chart that comes pre-bundled with Renku. It is just easier to manage the two things separately. And we are a bit slower when it comes to rolling out updates for our Gitlab Helm chart compared to the official Gitlab chart. The main reason we have our own version of the Gitlab Helm chart is that when Renku first came out there was no official Helm chart for Gitlab.

But reading your initial post it seems that you already had a separately deployed and already running Gitlab instance. So then you can definitely give this a try. Please let me know how it goes. I am happy to help with any issues you may stumble upon.

Indeed, we use a standalone gitlab installation.
I hope to get some calmer days during the summer. If it happens, I should deploy the updated instance during August.

1 Like

The new instance (0.15.0) is not fully functional, but I believe there is no SSL issue left.
And for an additional “feel good” info, this is a summary of the PR for our internal deployment:

Showing 17 changed files with 137 additions and 680 deletions

@ahungler thanks for testing this out, and I am glad it is helping (as far as I can tell from your message).

I am happy to jump on a quick call if you need help with troubleshooting Renku deployment issues. SSL-related or otherwise. Let me know, I can reach out in a direct message.