Trouble getting HTTPS / letsencrypt working with 0.9.0-beta.4

I’ve been using JH 0.8 (included as a requirements.yaml for my own service) on google cloud, tried playing with 0.9.0-beta.4, and have not been able to get https working after an upgrade. It looks like autohttps isn’t properly serving the acme challenge on port 80 is 0.9.0-betas?

When I visit https://improc.ceresimaging.net (which maps correctly to 35.203.130.226) I get an SSL protocol error (basically https server had an internal error). If I visit http://improc.ceresimaging.net, I get redirected to 443/https.

All the logs look ok, except the autohttps pod is failing to complete the letsencript http challenge, timing out trying to access .well-know/acme-challenge on port 80:

Running wget on my computer produces the same results:

➜  ~ wget http://improc.ceresimaging.net/.well-known/acme-challenge/018QQqoEpMphNo8_7J61TOcmQ7oGhZ7WOAl3VMfJuJc
--2020-03-10 15:37:04--  http://improc.ceresimaging.net/.well-known/acme-challenge/018QQqoEpMphNo8_7J61TOcmQ7oGhZ7WOAl3VMfJuJc
Resolving improc.ceresimaging.net (improc.ceresimaging.net)... 35.203.130.226
Connecting to improc.ceresimaging.net (improc.ceresimaging.net)|35.203.130.226|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-03-10 15:38:06 ERROR 404: Not Found.

The services look like they’re up and healthy, as do the pods.

Relevant part of values.yaml is as follows:

  proxy:
    https:
      hosts:
        - improc.ceresimaging.net
      letsencrypt:
        contactEmail: seth@ceresimaging.net
    secretToken: "SECRETS DELETED"
    service:
      loadBalancerIP: 35.203.130.226

In case its helpful, here’s the full values.yaml with secrets elided:

jupyterhub:
  proxy:
    https:
      hosts:
        - improc.ceresimaging.net
      letsencrypt:
        contactEmail: seth@ceresimaging.net
    secretToken: "SECRETS DELETED"
    service:
      loadBalancerIP: 35.203.130.226
	  
  singleuser:
    defaultUrl: "/lab"
    image:
      name: gcr.io/ceres-imaging-science/improc-notebook
      tag: latest

    extraEnv:
      JUPYTER_ENABLE_LAB: "yes"
      GRANT_SUDO: "yes"

    storage:
      homeMountPath: /home/{username}
      extraVolumes:
        - name: ceres-flights
          persistentVolumeClaim:
            claimName: ceres-flights
      extraVolumeMounts:
        - name: ceres-flights
          mountPath: /home/{username}/flights

    cmd: "start-singleuser.sh"

    # start as root, we drop privs once NB_USER is set by CustomGoogleOAuthenticator below
    uid: 0
  hub:
    image:
      name: gcr.io/ceres-imaging-science/improc-hub
      tag: latest
    imagePullSecret:
      registry: gcr.io
      username: _json_key
      password: |-
        {
          "type": "service_account",
		  # SECRETS DELETED
        }
    extraConfig:
      logo: |
        c.JupyterHub.logo_file = '/usr/local/share/jupyterhub/static/images/ceres-logo.svg'
      useCeresOAuthenticator: |
        c.JupyterHub.authenticator_class = CeresOAuthenticator
  prePuller:
    hook:
      enabled: false

  auth:
    admin:
      users:
        - SECRETS DELETED
    type: google
    google:
	  # SECRETS DELETED

    state:
      enabled: true
      cryptoKey: SECRETS DELETED

debug:
  enabled: true
2 Likes

Which pod should be responding to the acme challenge, and what’s the path of loadbalancer/service/route that the request should be taking from public-proxy to that pod?

I notice the kube-lego pod(s) and service are no longer present, I’m guessing that autohttps is taking over this roll?

Hi,

I’m running into a very similar problem - the default LetsEncrypt step is failing at the challenge.

I’m using 0.9.0 - but I’ve also tried the latest 0.9.0 chart.

My config.yaml is as simple as I could make it:

proxy:
  secretToken: "need-to-know-basis"
  https:
    hosts:
      - uobhub.org
    letsencrypt:
      contactEmail: matthew.brett@gmail.com
  service:
    loadBalancerIP: 35.189.82.198

Log from kubectl logs pod/autohttps-7b465f7b8b-lp5ww traefik -f gives:

time="2020-07-02T14:21:55Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *acme.Provider {\"email\":\"matthew.brett@gmail.com\",\"caServer\":\"https://acme-v02.api.l
etsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\
",\"store\":{},\"ChallengeStore\":{}}"
time="2020-07-02T14:21:55Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-07-02T14:22:11Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme
time="2020-07-02T14:22:24Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme

I can make LetsEncrypt work on my own Mac - here’s the result of a LetsEncrypt certification running on my home machine: https://jupyterhub.dynevor.org/

Any suggestions of what I could try next to debug?

Cheers,

Matthew

When not using a very recent version of the Helm chart, newer than 0.9.0, the autohttps pod can save a failed attempt into a k8s secret and get stuck in a bad state. Due to this, I suggest:

  1. Verify your domain points to the external IP you should see by writing kubectl get svc proxy-public.
  2. Upgrade to use a Helm chart version like 0.9.0-n116.h1c766a1 or newer to get a version of the autohttps setup that avoid getting stuck in corrupt states by saving it to a secret which it reloads on startup.
  3. Delete both the secret named proxy-public-tls-acme and the autohttps pod.

If done in order and these still fail, try deleting the autohttps pod some times. If this still fails:

  1. Inspect logs of the autohttps pod
  2. Inspect logs of proxy pod
  3. Set helm chart configuration: debug.enabled: true and repeat for more details in logs.

This is the autohttps pod. With it around, traffic is routed from proxy-public svc to autohttps pod to proxy-http svc to proxy pod to whatever destination depending on path (/hub to hub pod, unknown paths to hub pod, and /user to user pods if they have servers running).

The TLS termination is done by the autohttps pod, which is now traefik v2 using the LEGO acme client library.

Upcoming fix in Traefik’s use of LEGO

The need to restart the autohttps pod is caused by this issue that I opened with Traefik. They have successfully reproduced this issue and a PR is now open to resolve it.

4 Likes

@consideRatio - thank you!

I found and deleted the secret:

$ kubectl get secrets
$ kubectl delete secret proxy-public-tls-acme
$ kubectl get secrets

I found the latest chart from https://jupyterhub.github.io/helm-chart/#development-releases-jupyterhub, which was 0.9.0-n116.h1c766a1.

I then purged and restarted using this chart:

$ helm delete jhub-testing --purge
$ helm upgrade --install jhub-testing jupyterhub/jupyterhub   --namespace jhub-testing --version=0.9.0-n116.h1c766a1 --values config.yaml

Then I checked the logs, but got the same error:

$ kubectl logs pod/$(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-) traefik -f

giving:

time="2020-07-03T17:46:42Z" level=error msg="Unable to obtain ACME certificate for domains \"testing.uobhub.org\" : unable to generate a certificate for th
e domains [testing.uobhub.org]: error: one or more domains had a problem:\n[testing.uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :
: Fetching http://testing.uobhub.org/.well-known/acme-challenge/QfUNDgaKU_3dw_WvkDiPaAADbFAOciVMXCMG99nZCiI: Timeout during connect (likely firewall proble
m), url: \n" providerName=default.acme

Finally, I tried deleting the autohttps pod:

$ kubectl delete pods $(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-)

And - hey presto - it worked! Thanks very much for your help.

Do you know why I had to delete, even with the newest chart? Is that something that will be easy to fix in due course?

Cheers,

Matthew

2 Likes

Hmm - interesting - the exact same procedure also worked for my not-testing cluster - I had to delete the autohttps pod once … I wonder - could it be starting up before the external IP is assigned?

Cheers,

Matthew

I tried a similar approach but doesn’t seem to resolve the issue for me.

1 Like

@Yasharth_Bajpai - maybe it’s worth posting the exact steps you took and their output, just in case you missed something, or I missed out a step in what I did?

For example I didn’t record the nslookup output, but it is correct, in that it matches my config.yaml and the output from kubectl get svc --namespace jhub-testing:

$ nslookup testing.uobhub.org
Server:         169.254.169.254
Address:        169.254.169.254#53
Non-authoritative answer:
Name:   testing.uobhub.org
Address: 34.89.20.96

It could be one reason, but not the most common one I think. I’m not sure if Traefik make retries after a while, but if it does, that would only delay the process until a retry would be made.

The key reason for this issue is reported with Traefik, who sometimes end up making multiple requests to the ACME server when only one should be made, and then responds to the wrong challenge. It is on its way to be resolved already, and then we will update to use the new version of Traefik which avoids this issue when they use the LEGO as an ACME client interacting with Let’s Encrypt as an ACME server.

1 Like

Interesting - thanks.

Is that multiple request issue compatible with my “Timeout during
connect” error?

Cheers,

Matthew

Just following up - I have this same problem every time I start my cluster.

For the last four times or so, I did not delete the stored secret, I only deleted the autohttps pod:

kubectl delete pods $(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-)

The last time I did this, I had to do it twice.

Only to say then, that deleting the secret does not seem to be relevant in my case.

Cheers,

Matthew

1 Like

See LetsEncrypt certificate generation failing on basic default z2jh / GKE setup · Issue #2601 · jupyterhub/zero-to-jupyterhub-k8s · GitHub as a followup. This issue related to the k8s clusters networking wasn’t setup quick enough after the Pod was scheduled to a node. This won’t happen in all k8s clusters, but is confirmed to be a problem in GKE clusters as 2022-02-25, both when using default settings for a GKE cluster and when using Calico which is an opt-in feature.

Hi all, just wanted to share that (a) I really appreciate all of you for posting here and in the associated github issue, and (b) there’s a really critical comment that’s a bit buried in the GitHub thread, which is not replicated here.

Since this Discourse thread is now linked from all over the web, I want to share that comment from @consideRatio explicitly below as well, in hopes it will save someone else the tremendous amounts of time I spent troubleshooting over the past days, before coming across the GitHub comment.


consideRatio commented on Feb 25, 2022

I think I’ve nailed it as I could make sure it didn’t occur by introducing a delay from when the k8s Pod had been scheduled on a node and received an IP. Either by pulling a new image on a node, or by tweaking the startup command to first run sleep 10 before starting Traefik as usual.

I’ve proposed a new feature for Traefik in traefik/traefik#8803. But for now, a workaround could be to redeploy by setting new image tags to force restarts of the autohttps pod. Example config:

proxy:
  traefik:
    image:
      # tag modified to trigger a restart of the autohttps pod
      # and induce a delay while downloading the image
      # that ensures networking gets setup in time
      # which allows the requested ACME challenge
      # where the Pod will receive inbound network traffic
      # can succeed.
      tag: 2.6.0 # default is 2.6.1

Another option to this is to edit the autohttps deployment like this.

kubectl edit deploy autohttps
        # ...
       containers:
       - image: traefik:v2.6.1
+        command: ["sh", "-c", "sleep 10 && /entrypoint.sh traefik"]
         imagePullPolicy: IfNotPresent
         # ...
1 Like