Feedback on offline deployment with custom root CA

Hi there!

We deployed Renku on our own premises. The situation is a bit strange. Our infrastructure is offline, and all internal certificates are signed by a custom root Certificate Authority (root CA), as LetsEncrypt is not an option for offline services. This feedback is a quick overview of our deployment strategy, with some deeper insight in a few interesting points.

:warning: Do not take that as a detailled tutorial on how to deploy Renku on an air gapped environment with custom root CA. This post lacks many important details, and writing a full tutorial would take much more than a Discord post.

Our environment
As mentioned, our infrastructure is close to offline. Inbound connections are forbidden (making LetsEncrypt unusable), and outbound connections are very restricted (making the access to external resources impossible).
This situation caused 2 major problems:

  • all Renku pieces accessing an internal service must be modified in a way to integrate our own root CA
  • all resources have to be stored internally before being used (helm charts, docker images, template repos, …)

Additionally, as kaniko was already installed on our kubernetes cluster, we wanted Renku to use it instead of having access to a host for building Docker images.

The resources already available were:

To keep a high level of reproducibility and keep track of the heavy patching of the deployment, we decided to patch and deploy Renku from the CI/CD of a dedicated renku-deployment GitLab repository.

Getting all the resources internally
Most of the outbound connections being forbidden, we had to download all required resources first. A quick checklist:

The certificate hell
Now, if you try to helm install the downloaded chart, it may be deployed properly, but nothing else is going to work. Looking at the log of the different pods, you’ll notice a metric ton of

SunCertPathBuilderException
Certificate not trusted
Impossible to connect to <something>

and other messages having any kind of combination of “certificate” and “invalid”. This basically means that our custom root CA is not imported in the various containers, and any HTTPS connection to a service (mostly Gitlab) will fail. Depending on the way the connection is initiated, the fix will vary. I listed 4 different ways:

  • Python (through http or request library)
  • git
  • OS command
  • Java/Scala

Python case
The fix for Python was fairly easy. By setting the environment variable REQUESTS_CA_BUNDLE, we can tell Python to use a custom root CA. The fix was then to kustomize the helm chart to mount the root CA in the container and add the right env var for each container using Python for HTTPS requests. The patch itself looks something like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <DEPLOYMENT_NAME>
  namespace: renku
spec:
  template:
    spec:
      containers:
      - name: <CONTAINER_NAME>
        env:
        - name: REQUESTS_CA_BUNDLE
          value: /etc/ssl/certs/ca-cert.pem
        volumeMounts:
        - name: ca-pemstore
          mountPath: /etc/ssl/certs/ca-cert.pem
          subPath: ca-cert.pem
          readOnly: true
      volumes:
      - name: ca-pemstore
        configMap:
          name: ca-pemstore

Git case
The fix is quiet similar. Git does not use an env var for setting which root CA you want to use, but a config file. The patch therefore mounts a file with the right config.
The config file is defined in k8s as a configMap with the following content:

gitconfig: |-
  [filter "lfs"]
          clean = git-lfs clean -- %f
          smudge = git-lfs smudge -- %f
          process = git-lfs filter-process
          required = true
  [http]
          sslCAInfo = /etc/ssl/certs/ca-cert.pem

Patching the containers to include the config file is then only mounting the 2 configMaps (config file itself, and the certificate) in the right places:

spec:
  template:
    spec:
      containers:
      - name: <CONT_NAME>
        volumeMounts:
        - name: ca-pemstore
          mountPath: /etc/ssl/certs/ca-cert.pem
          subPath: ca-cert.pem
          readOnly: true
        - name: gitconfig
          mountPath: /root/.gitconfig
          subPath: gitconfig
          readOnly: true
      volumes:
      - name: ca-pemstore
        configMap:
          name: ca-pemstore
      - name: gitconfig
        configMap:
          name: gitconfig

Easy, right?

OS commands
Most of the OS commands (curl, wget, openssl, …) can take a custom root CA through an env var named SSL_CERT_FILE. Great! The fix for these tools is very similar to the one we used for Python.

Java/Scala
Now, we touch a sensitive part. First problem, I’m not very familiar with Java. Second problem, Java does not seem to have an easy setting for custom root CA, but instead, a Certificate Management Tool named keytool.

Unfortunately, it means we cannot add a custom root CA by changing the environment and a few files, but we need to actually run a command. And it makes a huge difference, because we cannot simply modify the k8s deployment file for this to work.

The solution we came up with was to modify the Docker image. To make it work, we need first to write a new Dockerfile to execute the keytool command. This trial miserably failed, and I still don’t why. The other solution was to properly add the root CA “container-wide” and hope that Java picks it up. Here is the Dockerfile for building a new image of webhook-service:

FROM renku/webhook-service:1.13.2

USER root
ADD GC-rootCA.crt /usr/local/share/ca-certificates/gc-ca.crt
RUN chmod 644 /usr/local/share/ca-certificates/gc-ca.crt && update-ca-certificates
USER ${NB_USER}

This image is then built, and we need to modify the helm chart to use this new image instead of renku/webhook-service:1.13.2. It’s easily done with yet another patch for kustomize.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: renku-webhook-service
  namespace: renku
spec:
  template:
    spec:
      containers:
      - name: webhook-service
        image: <WEBHOOK_IMAGE>

A similar patch is applied to other containers running Java/Scala. But then, we got blessed by this merge request. From this PR, we can simply give the custom root CA as part of the values.yaml file, as follow:

global:
  clientCertificate:
    value: |-
      -----BEGIN CERTIFICATE-----
      \<whole cert>
      -----END CERTIFICATE-----

Much cleaner! But the previous trick is still needed for the following deployments which use containers not written by SDSC:

  • hub
  • renku-keycloak

Jobs case
The end? Not yet. If we re-deploy with all these patchs, we can notice that the jobs still fail with a crappy certificate problem. They should have been patched, though… It’s actually not the case because of this helm issue. kustomize was executed via the --post-render argument of helm install. But the rendered template is not complete, and misses the jobs. We modified the process to work around the issue:

  • Write the output of helm template to a deployment.yaml
  • Apply all patches to deployment.yaml
  • kubectl apply -f deployment.yaml

And now it works.

Kaniko
Now that we are out of the certificate hell, we can take care of a much funnier topic. Instead of building the Docker images of the Renku projects in a VM, we want to use Kaniko. A simple modification of the CI/CD file in each project template to update the image_build step is enough. The modifications can be summarized as:

  • changing the docker image to use
  • updating the script to launch kaniko
  • and of course, adding our root CA everywhere

After modification, the interesting part of the .gitlab-ci.yml look like:

image_build:
  stage: build
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  before_script:
    - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
    - |
      echo "-----BEGIN CERTIFICATE-----
     \<root CA>
      -----END CERTIFICATE-----" >> /kaniko/ssl/certs/additional-ca-cert-bundle.crt

  script:
    - CI_COMMIT_SHA_7=$(echo $CI_COMMIT_SHA | cut -c1-7)
    - echo $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
    - /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
  tags:
    - image-build

Other
I could mention many other things. But you most likely can find the solutions somewhere else than in this post.

  • Add the root CA to the Docker images in Renku projects
  • Use our own GitLab as identity provider
  • Use custom repositories as project templates
  • Fix permissions and ownership on NFS volumes
  • Make sure that the previously defined secrets are not overwritten with a new deployment
  • And probably many other things

Thanks
A big thanks to @pameladelgado who helped me a lot through the deployment process

4 Likes

Great post, thanks @ahungler and for your effort invested in this deployment as well.
We will soon make issues on things we can improve on our side to make an air-gapped deployment go a bit more smoothly. I will post update here so that you can follow up.

This is great @ahungler, thanks very much for taking the time to write it up! I don’t want to derail the topic, but I’m curious what your experience with Kaniko has been?

Kaniko works great in our setup. All docker image building for Renku deployment is actually done with Kaniko. Once you figure out that you need to write the pipeline credentials in /kaniko/.docker/config.json, everything goes well.

The .gitlab-ci.yml I posted in my first message really includes all useful content to make it work anywhere.

Thanks @ahungler that’s good to know - I’ve been playing with it in a kubernetes gitlab runner on and off for a few years but it always seemed to me like it was still a bit buggy and prone to strange failures. Maybe we’ll try again more seriously.