I’ve been using JH 0.8 (included as a requirements.yaml for my own service) on google cloud, tried playing with 0.9.0-beta.4, and have not been able to get https working after an upgrade. It looks like autohttps isn’t properly serving the acme challenge on port 80 is 0.9.0-betas?
All the logs look ok, except the autohttps pod is failing to complete the letsencript http challenge, timing out trying to access .well-know/acme-challenge on port 80:
Which pod should be responding to the acme challenge, and what’s the path of loadbalancer/service/route that the request should be taking from public-proxy to that pod?
I notice the kube-lego pod(s) and service are no longer present, I’m guessing that autohttps is taking over this roll?
Log from kubectl logs pod/autohttps-7b465f7b8b-lp5ww traefik -f gives:
time="2020-07-02T14:21:55Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *acme.Provider {\"email\":\"matthew.brett@gmail.com\",\"caServer\":\"https://acme-v02.api.l
etsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\
",\"store\":{},\"ChallengeStore\":{}}"
time="2020-07-02T14:21:55Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-07-02T14:21:55Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-07-02T14:22:11Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme
time="2020-07-02T14:22:24Z" level=error msg="Unable to obtain ACME certificate for domains \"uobhub.org\" : unable to generate a certificate for the doma
ins [uobhub.org]: acme: Error -> One or more domains had a problem:\n[uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching h
ttp://uobhub.org/.well-known/acme-challenge/btuQKX8X9Q6RlJGzpIgN7wi9RsCDxB8luT7r6oI2IE0: Timeout during connect (likely firewall problem), url: \n" provi
derName=le.acme
I can make LetsEncrypt work on my own Mac - here’s the result of a LetsEncrypt certification running on my home machine: https://jupyterhub.dynevor.org/
Any suggestions of what I could try next to debug?
When not using a very recent version of the Helm chart, newer than 0.9.0, the autohttps pod can save a failed attempt into a k8s secret and get stuck in a bad state. Due to this, I suggest:
Verify your domain points to the external IP you should see by writing kubectl get svc proxy-public.
Upgrade to use a Helm chart version like 0.9.0-n116.h1c766a1 or newer to get a version of the autohttps setup that avoid getting stuck in corrupt states by saving it to a secret which it reloads on startup.
Delete both the secret named proxy-public-tls-acme and the autohttps pod.
If done in order and these still fail, try deleting the autohttps pod some times. If this still fails:
Inspect logs of the autohttps pod
Inspect logs of proxy pod
Set helm chart configuration: debug.enabled: true and repeat for more details in logs.
This is the autohttps pod. With it around, traffic is routed from proxy-public svc to autohttps pod to proxy-http svc to proxy pod to whatever destination depending on path (/hub to hub pod, unknown paths to hub pod, and /user to user pods if they have servers running).
The TLS termination is done by the autohttps pod, which is now traefik v2 using the LEGO acme client library.
Upcoming fix in Traefik’s use of LEGO
The need to restart the autohttps pod is caused by this issue that I opened with Traefik. They have successfully reproduced this issue and a PR is now open to resolve it.
time="2020-07-03T17:46:42Z" level=error msg="Unable to obtain ACME certificate for domains \"testing.uobhub.org\" : unable to generate a certificate for th
e domains [testing.uobhub.org]: error: one or more domains had a problem:\n[testing.uobhub.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :
: Fetching http://testing.uobhub.org/.well-known/acme-challenge/QfUNDgaKU_3dw_WvkDiPaAADbFAOciVMXCMG99nZCiI: Timeout during connect (likely firewall proble
m), url: \n" providerName=default.acme
Hmm - interesting - the exact same procedure also worked for my not-testing cluster - I had to delete the autohttps pod once … I wonder - could it be starting up before the external IP is assigned?
@Yasharth_Bajpai - maybe it’s worth posting the exact steps you took and their output, just in case you missed something, or I missed out a step in what I did?
For example I didn’t record the nslookup output, but it is correct, in that it matches my config.yaml and the output from kubectl get svc --namespace jhub-testing:
It could be one reason, but not the most common one I think. I’m not sure if Traefik make retries after a while, but if it does, that would only delay the process until a retry would be made.
The key reason for this issue is reported with Traefik, who sometimes end up making multiple requests to the ACME server when only one should be made, and then responds to the wrong challenge. It is on its way to be resolved already, and then we will update to use the new version of Traefik which avoids this issue when they use the LEGO as an ACME client interacting with Let’s Encrypt as an ACME server.
Hi all, just wanted to share that (a) I really appreciate all of you for posting here and in the associated github issue, and (b) there’s a really critical comment that’s a bit buried in the GitHub thread, which is not replicated here.
Since this Discourse thread is now linked from all over the web, I want to share that comment from @consideRatio explicitly below as well, in hopes it will save someone else the tremendous amounts of time I spent troubleshooting over the past days, before coming across the GitHub comment.
I think I’ve nailed it as I could make sure it didn’t occur by introducing a delay from when the k8s Pod had been scheduled on a node and received an IP. Either by pulling a new image on a node, or by tweaking the startup command to first run sleep 10 before starting Traefik as usual.
I’ve proposed a new feature for Traefik in traefik/traefik#8803. But for now, a workaround could be to redeploy by setting new image tags to force restarts of the autohttps pod. Example config:
proxy:
traefik:
image:
# tag modified to trigger a restart of the autohttps pod
# and induce a delay while downloading the image
# that ensures networking gets setup in time
# which allows the requested ACME challenge
# where the Pod will receive inbound network traffic
# can succeed.
tag: 2.6.0 # default is 2.6.1
Another option to this is to edit the autohttps deployment like this.