Hi there!
We deployed Renku on our own premises. The situation is a bit strange. Our infrastructure is offline, and all internal certificates are signed by a custom root Certificate Authority (root CA), as LetsEncrypt is not an option for offline services. This feedback is a quick overview of our deployment strategy, with some deeper insight in a few interesting points.
Do not take that as a detailled tutorial on how to deploy Renku on an air gapped environment with custom root CA. This post lacks many important details, and writing a full tutorial would take much more than a Discord post.
Our environment
As mentioned, our infrastructure is close to offline. Inbound connections are forbidden (making LetsEncrypt unusable), and outbound connections are very restricted (making the access to external resources impossible).
This situation caused 2 major problems:
- all Renku pieces accessing an internal service must be modified in a way to integrate our own root CA
- all resources have to be stored internally before being used (helm charts, docker images, template repos, …)
Additionally, as kaniko was already installed on our kubernetes cluster, we wanted Renku to use it instead of having access to a host for building Docker images.
The resources already available were:
- A GitLab instance (with a certificate signed by our own root CA)
- A kubernetes cluster
- A virtual machine with
kubectl
,helm3
, and agitlab-runner
- Kaniko on k8s
To keep a high level of reproducibility and keep track of the heavy patching of the deployment, we decided to patch and deploy Renku from the CI/CD of a dedicated renku-deployment
GitLab repository.
Getting all the resources internally
Most of the outbound connections being forbidden, we had to download all required resources first. A quick checklist:
- Helm chart (currently version
0.7.0-82ccfc7
, not available in the repo) - Docker images
- Renku project templates
The certificate hell
Now, if you try to helm install
the downloaded chart, it may be deployed properly, but nothing else is going to work. Looking at the log of the different pods, you’ll notice a metric ton of
SunCertPathBuilderException
Certificate not trusted
Impossible to connect to <something>
and other messages having any kind of combination of “certificate” and “invalid”. This basically means that our custom root CA is not imported in the various containers, and any HTTPS connection to a service (mostly Gitlab) will fail. Depending on the way the connection is initiated, the fix will vary. I listed 4 different ways:
- Python (through
http
orrequest
library) - git
- OS command
- Java/Scala
Python case
The fix for Python was fairly easy. By setting the environment variable REQUESTS_CA_BUNDLE
, we can tell Python to use a custom root CA. The fix was then to kustomize
the helm chart to mount the root CA in the container and add the right env var for each container using Python for HTTPS requests. The patch itself looks something like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: <DEPLOYMENT_NAME>
namespace: renku
spec:
template:
spec:
containers:
- name: <CONTAINER_NAME>
env:
- name: REQUESTS_CA_BUNDLE
value: /etc/ssl/certs/ca-cert.pem
volumeMounts:
- name: ca-pemstore
mountPath: /etc/ssl/certs/ca-cert.pem
subPath: ca-cert.pem
readOnly: true
volumes:
- name: ca-pemstore
configMap:
name: ca-pemstore
Git case
The fix is quiet similar. Git does not use an env var for setting which root CA you want to use, but a config file. The patch therefore mounts a file with the right config.
The config file is defined in k8s as a configMap
with the following content:
gitconfig: |-
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge -- %f
process = git-lfs filter-process
required = true
[http]
sslCAInfo = /etc/ssl/certs/ca-cert.pem
Patching the containers to include the config file is then only mounting the 2 configMaps (config file itself, and the certificate) in the right places:
spec:
template:
spec:
containers:
- name: <CONT_NAME>
volumeMounts:
- name: ca-pemstore
mountPath: /etc/ssl/certs/ca-cert.pem
subPath: ca-cert.pem
readOnly: true
- name: gitconfig
mountPath: /root/.gitconfig
subPath: gitconfig
readOnly: true
volumes:
- name: ca-pemstore
configMap:
name: ca-pemstore
- name: gitconfig
configMap:
name: gitconfig
Easy, right?
OS commands
Most of the OS commands (curl
, wget
, openssl
, …) can take a custom root CA through an env var named SSL_CERT_FILE
. Great! The fix for these tools is very similar to the one we used for Python.
Java/Scala
Now, we touch a sensitive part. First problem, I’m not very familiar with Java. Second problem, Java does not seem to have an easy setting for custom root CA, but instead, a Certificate Management Tool named keytool
.
Unfortunately, it means we cannot add a custom root CA by changing the environment and a few files, but we need to actually run a command. And it makes a huge difference, because we cannot simply modify the k8s deployment file for this to work.
The solution we came up with was to modify the Docker image. To make it work, we need first to write a new Dockerfile
to execute the keytool
command. This trial miserably failed, and I still don’t why. The other solution was to properly add the root CA “container-wide” and hope that Java picks it up. Here is the Dockerfile
for building a new image of webhook-service
:
FROM renku/webhook-service:1.13.2
USER root
ADD GC-rootCA.crt /usr/local/share/ca-certificates/gc-ca.crt
RUN chmod 644 /usr/local/share/ca-certificates/gc-ca.crt && update-ca-certificates
USER ${NB_USER}
This image is then built, and we need to modify the helm chart to use this new image instead of renku/webhook-service:1.13.2
. It’s easily done with yet another patch for kustomize.
apiVersion: apps/v1
kind: Deployment
metadata:
name: renku-webhook-service
namespace: renku
spec:
template:
spec:
containers:
- name: webhook-service
image: <WEBHOOK_IMAGE>
A similar patch is applied to other containers running Java/Scala. But then, we got blessed by this merge request. From this PR, we can simply give the custom root CA as part of the values.yaml
file, as follow:
global:
clientCertificate:
value: |-
-----BEGIN CERTIFICATE-----
\<whole cert>
-----END CERTIFICATE-----
Much cleaner! But the previous trick is still needed for the following deployments which use containers not written by SDSC:
- hub
- renku-keycloak
Jobs case
The end? Not yet. If we re-deploy with all these patchs, we can notice that the jobs still fail with a crappy certificate problem. They should have been patched, though… It’s actually not the case because of this helm issue. kustomize
was executed via the --post-render
argument of helm install
. But the rendered template is not complete, and misses the jobs. We modified the process to work around the issue:
- Write the output of
helm template
to adeployment.yaml
- Apply all patches to
deployment.yaml
kubectl apply -f deployment.yaml
And now it works.
Kaniko
Now that we are out of the certificate hell, we can take care of a much funnier topic. Instead of building the Docker images of the Renku projects in a VM, we want to use Kaniko. A simple modification of the CI/CD file in each project template to update the image_build
step is enough. The modifications can be summarized as:
- changing the docker
image
to use - updating the
script
to launch kaniko - and of course, adding our root CA everywhere
After modification, the interesting part of the .gitlab-ci.yml
look like:
image_build:
stage: build
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
before_script:
- echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
- |
echo "-----BEGIN CERTIFICATE-----
\<root CA>
-----END CERTIFICATE-----" >> /kaniko/ssl/certs/additional-ca-cert-bundle.crt
script:
- CI_COMMIT_SHA_7=$(echo $CI_COMMIT_SHA | cut -c1-7)
- echo $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
- /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
tags:
- image-build
Other
I could mention many other things. But you most likely can find the solutions somewhere else than in this post.
- Add the root CA to the Docker images in Renku projects
- Use our own GitLab as identity provider
- Use custom repositories as project templates
- Fix permissions and ownership on NFS volumes
- Make sure that the previously defined secrets are not overwritten with a new deployment
- And probably many other things
Thanks
A big thanks to @pameladelgado who helped me a lot through the deployment process