On-prem binderhub deployment w/ internal CA

Hi,

I’m in the process of setting up an on-prem deployment. My company prefers to host almost everything internally, and we have our own certificate authority that issues certs for all of our internal sites. Our self-hosted gitlab, binderhub, and the in-deployment jupyterhub will all make use of this CA.

I’ve successfully stood up the site. However, in the process, I ran into an issue where in order to resolve the ref for the repo being requested, binderhub (or specifically the request being via AsyncHttpClient) needed to be aware of the CA. If it’s wasn’t, then I ran into ssl cert issues. In order to get past this, I mounted a secret, went into the code, and explicitly configured the individual api request to use ca_certs=’<ca_cert_bundle_path>’:

resp = yield client.fetch(
    api_url,
    user_agent="BinderHub",
    ca_certs='<<ca cert>>'
)

I rebuilt, used my tweaked image, and successfully made the call to gitlab. Then I ran into the same ssl issues when calling the hub api after the completion of a build, and had to do some further tweaking. So I’ve gotten the deployment setup and it’s now working, but with a hacked copy of your source. This is unsustainable from my end, and I’d like to figure out a way to install and/or use my ca certs either through init containers or configuration options.

TL;DR: Is is possible to configure binderhub to use a specific client cert for all outgoing requests?

If you would like me to elaborate at all on what I know, let me know. And please feel free to direct me elsewhere if this isn’t the right place to raise this type of request.

Thanks for your help,
Dale Mittleman

Thinking more about this, what I’m looking for is the ability to configure a cafile arg for two separate interfaces: calls made to gitlab, and calls made to the hub api.


On the latter…In the jupyterhub codebase, from what I’ve seen, there are two places that leverage environment variables to configure the ssl context. The relevant envrionment variables are…

  • JUPYTERHUB_SSL_KEYFILE
  • JUPYTERHUB_SSL_CERTFILE
  • JUPYTERHUB_SSL_CLIENT_CA

…and their usages can be found in mixins.py and auth.py. What configuration of the ssl context looks like in the code can be seen below. It would be great if binderhub performed the same configuration work for its calls out to the hub. I’m still new-ish to the code, but I think the only place this happens is in launcher.py.


In regards to the calls to the interally-hosted gitlab, a cacert could either come from additonal environment variables or additional configuration options on one of either the GitLabRepoProvider or the base RepoProvider.

Let me know your thoughts. I’m willing to work towards a PR if that’s needed. I understand this is an awfully particular use case

I don’t know the answer but I have a hunch.

Instead of passing a specific bundle to all calls made in the BinderHub code (which means modifying lots of code, not sustainable) I would try and add your CA’s certs to the trusted roots of the docker image in which BinderHub and repo2docker and JupyterHub run.

On a Ubuntu base image something like the following adds certificates and adds them to the trusted roots for all(?) software running in that image:

RUN mkdir /usr/local/share/ca-certificates/extra
COPY some-certs-for-my-CA-here.crt /usr/local/share/ca-certificates/extra/
RUN update-ca-certificates

I unfortunately don’t have a link handy to the docs/blog where I found those lines.

And right after posting my reply I found

which suggests that tornado does not use the OS level trusted roots?! This is annoying :frowning: I’d still try and dig into why/how to persuade tornado to use the OS level CA bundle and only resort to having to customise the code after exhausting that option.

I’m glad you pinged my message here; I’m still struggling with this problem, and via cURL and Python3 calls via PoolManager, Request, etc., all work successfully against our custom CA-enabled K8s API endpoint.

But whenever the request originates from Tornado, it seems a different set of certs are being used, and it’s not adhering to any of the HTTPS verify=disable environment variables I’ve tried setting.

Frustrating.

2 Likes

JupyterHub tries to use pycurl if available:


It looks like Tornado leaves the handling of CA certs to pycurl:

But I couldn’t find a clear statement of which CA bundle pycurl uses by default.

pycurl is “just” a wrapper around libcurl so it is weird that curl works but tornado (via pycurl via libccurl) doesn’t.

On Ubuntu there seem to be different packages that provide libcurl: https://packages.ubuntu.com/bionic/libcurl-dev maybe checking if one of them works is worth a try/understanding which one would use the certificate store that is updated by the above three line snippet?

I don’t have a custom CA right now. Is there one we can use to test? Thinking of a webhost that uses a certificate that is self-signed with the certificate of that “self-signed CA” available somewhere.

1 Like

The currently deployed BinderHub and JupyterHub images on gke.mybinder.org give slightly different answers when I run curl-config --configure:

# on binderhub
... '--with-ca-path=/etc/ssl/certs' ...
# on jupyterhub
... '--with-ca-path=/etc/ssl/certs' '--with-ca-bundle=/etc/ssl/certs/ca-certificates.crt' ...

which makes me wonder if this means only the bundle is used on JupyterHub? This would mean installing extra certs wouldn’t have any effect?

(I am massively beyond what I know about how all these things should work together and even more so when it comes to how they actually work together)

Hmmmm. That’s a good find. From what I’m seeing, the difference has to do with the fact that Jupyterhub is hosted on a focal (ubuntu 20.04) image, while binderhub is hosted on buster (debian 10).

I wonder if this would be a non-issue if binderhub were hosted on ubuntu

It feels a bit like we have a case of “the blind leading the blind” here. For example I am not even sure if the different arguments for ./configure matter, just noted that they are different :-/

Does someone know of a public webserver that is running with a self-signed cert that we could use to establish what works and what doesn’t?

I put together a set of docker containers and instructions to get a “test setup”.

For me things “work” with this setup. So adding the certificate to the OS level CA bundle makes BinderHub’s AsyncHTTPClient “just work” without the need of specifying certificates “per call”.

Let me know what you find.

1 Like

@betatim I really appreciate the help. Good idea to setup a test scenario.

I’m seeing the same thing. I setup your test containers, and was able to 1) curl https://jovyan.example.com and then 2) configure the AsyncHTTPClient and use that to successfully hit the same url.

I also went one step further and created a cert using a CA external to the certificate. This changes the flow to

  1. Create CA .pem and .key
  2. Create jovyan.example.com .key and .csr
  3. Sign the .csr using the CA stuff, producing a .crt for nginx to use
  4. Add CA .pem to client CA cert store

I’m not seeing any difference in behavior between the two patterns. Both curl and the AsyncHTTPClient can get content from https://jovyan.example.com

This tells me that the tornado client, by default, does pick up CA certs from the default OS store.

When I run similar tests in my live environment (within the binderhub pod deployed via helm), however, I still see SSL issues. I’ve installed my company’s CA cert, so I know I have the required certs in the store.

So, what could be different between the test case and the deployed copy? Just brainstorming here, no confidence in any of these paths:

  • The pod container environment has a variable set that changes the behavior of libcurl.
  • My company proxies all http(s) trafic by default. Something about the proxy setup is screwing up the flow
  • Something else is happening in the binderhub code that is causing pycurl to look away from the default ca store.

I’ll try to falsify each of these one at a time. I think we’re getting closer here. Thanks again for the help

1 Like

Could you perhaps try disabling pycurl to force BinderHub to use the built-in client instead, and see if that makes a difference? Either uninstall it, or hack the code to use simple_httpclient

Nice! If you have the openssl (?) commands handy for doing this could you make a PR to the repo to add them? I skipped it because I didn’t know how but it would be cool to know.

1 Like

One thing worth trying is if things also work if you use the JupyterHub image as the “base”. Maybe Ubuntu vs Debian does make a difference?

It would be good to triple check whether we mean BinderHub or JupyterHub when posting here and in which container image we are running things. I would not be surprised if they behave differently. There is waaaaay more code in JupyterHub that could influence the exact settings of AsyncHTTPClient (for example the “use SSL internally” feature) than in the BinderHub code base.

Which gives me to another idea for something to investigate: @manics linked to a snippet from the Tornado code which has a comment about “once we set the CAINFO option we can never restore the old behaviour”. This makes me wonder if in JupyterHub (via the “internal SSL” feature) we ever end up calling curl.setopt(pycurl.CAINFO, request.ca_certs) with an empty string or something. And because that wipes out the “default value” we somehow get screwed? hashtag-speculation :smiley:

Ok, I think I got it.

For all of our internal applications, we install a cert bundle which contains 2 certs. Solution: un-bundle the certs inside and install both separately.

For context: update-ca-certificates looks for any file suffixed with .crt that lives in /usr/local/share/ca-certificates (and maybe sub-folders?) and 1) copies it to /etc/ssl/certs + changes the suffix to .pem and 2) appends the content to /etc/ssl/certs/ca-certificates.crt. I’m sure some other stuff is done but this is what’s relevant to my following explanation

The difference in libcurl configuration between ubuntu and debian definitely matters. I don’t know what exactly is going on, but my sense is that without the --with-ca-bundle flag, the library looks in the --with-ca-path location for anything it recognizes as a single cert. Before I split the cert bundle in two, the library was ignoring my installed bundle.

On ubuntu, with a similar flow, the bundle installation works just fine. This is because curl is configured to explicitly look at the ca-certificates.crt file, which after installation has the bundle content appended.

2 Likes

P.S. For anyone that runs into a similar problem… I was able to setup my CA certs in the container without having to override the image. You can do this by storing your cert bundle as a secret, using an init container to un-bundle the certs within, pushing them to a shared emptyDir volume that is then mapped to /usr/local/share/ca-certificates on the binder container. You still need to run update-ca-certificates on the container in order to install those certs, but you can do so at startup using the available extraConfig option.

The volume mounts and init container:

initContainers:
  - name: init-install-ca-certs
    image: alpine
    command: 
      - 'sh'
      - '-c'
      - awk 'BEGIN {c=0;} /BEGIN CERT/{c++} { print > "certs/cert-" c ".crt"}' < CertBundle.pem
    volumeMounts:
      - name: ca-bundle
        mountPath: /CertBundle.pem
        subPath: CertBundle.pem
      - name: ca-bundle-unbundled
        mountPath: /certs
extraVolumes:
  # ** External to the chart **
  - name: ca-bundle
    secret:
      secretName: ca-bundle
  - name: ca-bundle-unbundled
    emptyDir: {}
extraVolumeMounts:
  - name: ca-bundle-unbundled
    mountPath: /usr/local/share/ca-certificates
    readOnly: false

And the installation via python

extraConfig: 
  00-install-ca-certs: |
    import os
    os.system('update-ca-certificates')
3 Likes

Thanks for the help debugging this and posting your solutions!! Happy Sunday!