I’ve been using JH 0.8 (included as a requirements.yaml for my own service) on google cloud, tried playing with 0.9.0-beta.4, and have not been able to get https working after an upgrade. It looks like autohttps isn’t properly serving the acme challenge on port 80 is 0.9.0-betas?
When I visit https://improc.ceresimaging.net (which maps correctly to 35.203.130.226) I get an SSL protocol error (basically https server had an internal error). If I visit http://improc.ceresimaging.net, I get redirected to 443/https.
All the logs look ok, except the autohttps pod is failing to complete the letsencript http challenge, timing out trying to access .well-know/acme-challenge on port 80:
Running wget
on my computer produces the same results:
➜ ~ wget http://improc.ceresimaging.net/.well-known/acme-challenge/018QQqoEpMphNo8_7J61TOcmQ7oGhZ7WOAl3VMfJuJc
--2020-03-10 15:37:04-- http://improc.ceresimaging.net/.well-known/acme-challenge/018QQqoEpMphNo8_7J61TOcmQ7oGhZ7WOAl3VMfJuJc
Resolving improc.ceresimaging.net (improc.ceresimaging.net)... 35.203.130.226
Connecting to improc.ceresimaging.net (improc.ceresimaging.net)|35.203.130.226|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-03-10 15:38:06 ERROR 404: Not Found.
The services look like they’re up and healthy, as do the pods.
Relevant part of values.yaml is as follows:
proxy:
https:
hosts:
- improc.ceresimaging.net
letsencrypt:
contactEmail: seth@ceresimaging.net
secretToken: "SECRETS DELETED"
service:
loadBalancerIP: 35.203.130.226
In case its helpful, here’s the full values.yaml with secrets elided:
jupyterhub:
proxy:
https:
hosts:
- improc.ceresimaging.net
letsencrypt:
contactEmail: seth@ceresimaging.net
secretToken: "SECRETS DELETED"
service:
loadBalancerIP: 35.203.130.226
singleuser:
defaultUrl: "/lab"
image:
name: gcr.io/ceres-imaging-science/improc-notebook
tag: latest
extraEnv:
JUPYTER_ENABLE_LAB: "yes"
GRANT_SUDO: "yes"
storage:
homeMountPath: /home/{username}
extraVolumes:
- name: ceres-flights
persistentVolumeClaim:
claimName: ceres-flights
extraVolumeMounts:
- name: ceres-flights
mountPath: /home/{username}/flights
cmd: "start-singleuser.sh"
# start as root, we drop privs once NB_USER is set by CustomGoogleOAuthenticator below
uid: 0
hub:
image:
name: gcr.io/ceres-imaging-science/improc-hub
tag: latest
imagePullSecret:
registry: gcr.io
username: _json_key
password: |-
{
"type": "service_account",
# SECRETS DELETED
}
extraConfig:
logo: |
c.JupyterHub.logo_file = '/usr/local/share/jupyterhub/static/images/ceres-logo.svg'
useCeresOAuthenticator: |
c.JupyterHub.authenticator_class = CeresOAuthenticator
prePuller:
hook:
enabled: false
auth:
admin:
users:
- SECRETS DELETED
type: google
google:
# SECRETS DELETED
state:
enabled: true
cryptoKey: SECRETS DELETED
debug:
enabled: true
2 Likes
Which pod should be responding to the acme challenge, and what’s the path of loadbalancer/service/route that the request should be taking from public-proxy to that pod?
I notice the kube-lego pod(s) and service are no longer present, I’m guessing that autohttps is taking over this roll?
Hi,
I’m running into a very similar problem - the default LetsEncrypt step is failing at the challenge.
I’m using 0.9.0 - but I’ve also tried the latest 0.9.0 chart.
My config.yaml
is as simple as I could make it:
proxy:
secretToken: "need-to-know-basis"
https:
hosts:
- uobhub.org
letsencrypt:
contactEmail: matthew.brett@gmail.com
service:
loadBalancerIP: 35.189.82.198
Log from kubectl logs pod/autohttps-7b465f7b8b-lp5ww traefik -f
gives:
time="2020-07-02T14:21:55Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *acme.Provider {\"email\":\"matthew.brett@gmail.com\",\"caServer\":\"https://acme-v02.api.l
etsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\
",\"store\":{},\"ChallengeStore\":{}}"
time="2020-07-02T14:21:55Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-07-02T14:22:11Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme
time="2020-07-02T14:22:24Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme
I can make LetsEncrypt work on my own Mac - here’s the result of a LetsEncrypt certification running on my home machine: https://jupyterhub.dynevor.org/
Any suggestions of what I could try next to debug?
Cheers,
Matthew
When not using a very recent version of the Helm chart, newer than 0.9.0, the autohttps pod can save a failed attempt into a k8s secret and get stuck in a bad state. Due to this, I suggest:
- Verify your domain points to the external IP you should see by writing
kubectl get svc proxy-public
.
- Upgrade to use a Helm chart version like
0.9.0-n116.h1c766a1
or newer to get a version of the autohttps setup that avoid getting stuck in corrupt states by saving it to a secret which it reloads on startup.
- Delete both the secret named
proxy-public-tls-acme
and the autohttps pod.
If done in order and these still fail, try deleting the autohttps pod some times. If this still fails:
- Inspect logs of the autohttps pod
- Inspect logs of proxy pod
- Set helm chart configuration: debug.enabled: true and repeat for more details in logs.
This is the autohttps pod. With it around, traffic is routed from proxy-public svc to autohttps pod to proxy-http svc to proxy pod to whatever destination depending on path (/hub to hub pod, unknown paths to hub pod, and /user to user pods if they have servers running).
The TLS termination is done by the autohttps pod, which is now traefik v2 using the LEGO acme client library.
Upcoming fix in Traefik’s use of LEGO
The need to restart the autohttps pod is caused by this issue that I opened with Traefik. They have successfully reproduced this issue and a PR is now open to resolve it.
4 Likes
@consideRatio - thank you!
I found and deleted the secret:
$ kubectl get secrets
$ kubectl delete secret proxy-public-tls-acme
$ kubectl get secrets
I found the latest chart from https://jupyterhub.github.io/helm-chart/#development-releases-jupyterhub, which was 0.9.0-n116.h1c766a1
.
I then purged and restarted using this chart:
$ helm delete jhub-testing --purge
$ helm upgrade --install jhub-testing jupyterhub/jupyterhub --namespace jhub-testing --version=0.9.0-n116.h1c766a1 --values config.yaml
Then I checked the logs, but got the same error:
$ kubectl logs pod/$(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-) traefik -f
giving:
time="2020-07-03T17:46:42Z" level=error msg="Unable to obtain ACME certificate for domains \"testing.uobhub.org\" : unable to generate a certificate for th
e domains [testing.uobhub.org]: error: one or more domains had a problem:\n[testing.uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :
: Fetching http://testing.uobhub.org/.well-known/acme-challenge/QfUNDgaKU_3dw_WvkDiPaAADbFAOciVMXCMG99nZCiI: Timeout during connect (likely firewall proble
m), url: \n" providerName=default.acme
Finally, I tried deleting the autohttps
pod:
$ kubectl delete pods $(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-)
And - hey presto - it worked! Thanks very much for your help.
Do you know why I had to delete, even with the newest chart? Is that something that will be easy to fix in due course?
Cheers,
Matthew
2 Likes
Hmm - interesting - the exact same procedure also worked for my not-testing cluster - I had to delete the autohttps
pod once … I wonder - could it be starting up before the external IP is assigned?
Cheers,
Matthew
I tried a similar approach but doesn’t seem to resolve the issue for me.
1 Like
@Yasharth_Bajpai - maybe it’s worth posting the exact steps you took and their output, just in case you missed something, or I missed out a step in what I did?
For example I didn’t record the nslookup
output, but it is correct, in that it matches my config.yaml
and the output from kubectl get svc --namespace jhub-testing
:
$ nslookup testing.uobhub.org
Server: 169.254.169.254
Address: 169.254.169.254#53
Non-authoritative answer:
Name: testing.uobhub.org
Address: 34.89.20.96
It could be one reason, but not the most common one I think. I’m not sure if Traefik make retries after a while, but if it does, that would only delay the process until a retry would be made.
The key reason for this issue is reported with Traefik, who sometimes end up making multiple requests to the ACME server when only one should be made, and then responds to the wrong challenge. It is on its way to be resolved already, and then we will update to use the new version of Traefik which avoids this issue when they use the LEGO as an ACME client interacting with Let’s Encrypt as an ACME server.
1 Like
Interesting - thanks.
Is that multiple request issue compatible with my “Timeout during
connect” error?
Cheers,
Matthew
Just following up - I have this same problem every time I start my cluster.
For the last four times or so, I did not delete the stored secret, I only deleted the autohttps pod:
kubectl delete pods $(kubectl get pods -o custom-columns=POD:metadata.name | grep autohttps-)
The last time I did this, I had to do it twice.
Only to say then, that deleting the secret does not seem to be relevant in my case.
Cheers,
Matthew
1 Like
See LetsEncrypt certificate generation failing on basic default z2jh / GKE setup · Issue #2601 · jupyterhub/zero-to-jupyterhub-k8s · GitHub as a followup. This issue related to the k8s clusters networking wasn’t setup quick enough after the Pod was scheduled to a node. This won’t happen in all k8s clusters, but is confirmed to be a problem in GKE clusters as 2022-02-25, both when using default settings for a GKE cluster and when using Calico which is an opt-in feature.