Server becomes "stale" in a matter of days, even hours

I have noticed that our current (Z2JH-based) JupyterHub instance becomes unresponsive and single-user server spawn requests disappear in some K8s galaxy “black hole” (that is, never reach the hub pod). I assumed (based on kubectl describe output that it is because, for some reason, autohttps pod (or connectivity to it) becomes broken [liveness tests report 1 success and 3 errors]. To test my assumption, I have deleted relevant autohttps pod and, indeed, upon a new pod instantiation, the server responds to user requests as expected and spawns corresponding single-user servers. Have you seen such behavior and what optimal approach (or options) to prevent this issue would you suggest?

Also posted here: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1359.

Oh hmm…

This was new to me. I would suggest the use of cert-manager and nginx-ingress-controller instead of kube-lego as the optimal long term solution. I don’t know why there is a dependency for autohttps to function at all time, it is my understanding that it will simply renew certificates from time to time if needed. It apparently works in an interruptive manner somehow.

I setup cert-manager for use of a binderhub deployment, but there is a z2jh deployment nested within it, see the configuration options I provide in values.yaml.

Note that this deployment is a helm chart with dependencies nested within it like nginx-ingress and cert-manager that helps me accomplish everything needed.


This is not a fully functional setup right now, but this is the relevant parts extracted for such setup.

# default values for a meta-chart that has a requirements.yaml file
# that depends on:
# - jupyterhub chart
# - cert-manager chart (provides HTTPS certificates from letsencrypt)
# - nginx-ingress chart (routes traffic, use HTTPS certificates)

# Entries for the meta-chart
## needed by clusterissuer.yaml
letsencrypt:
  contactEmail: <email here>

# Default values: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/154913a496b6f8fbaa20f3a3e641c548832ddafa/jupyterhub/values.yaml
jupyterhub:
  proxy:
    service:
      type: ClusterIP
  ingress:
    enabled: true
    annotations:
      kubernetes.io/tls-acme: "true"
      kubernetes.io/ingress.class: nginx
    hosts:
      - myhub.example.local
    tls:
      - secretName: jupyterhub-tls
        hosts:
          - myhub.example.local

# Default values: https://github.com/helm/charts/blob/f2fdf5332eb722d0f260e71984e7bf8a9907c687/stable/nginx-ingress/values.yaml
nginx-ingress:
  controller:
    service:
      # I reserved an regional ip address in us-east1...
      # > gcloud compute addresses create <name here> --region us-east1
      # ... and inspected what address got ...
      # > gcloud compute addresses list
      # ... and assigned the IP to the field below
      loadBalancerIP: 35.190.132.172
    config:
      # Allow POSTs of upto 64MB, for large notebook support.
      proxy-body-size: 64m

# Default values: https://github.com/helm/charts/blob/f2fdf5332eb722d0f260e71984e7bf8a9907c687/stable/cert-manager/values.yaml
cert-manager:
  ingressShim:
    defaultIssuerName: "neurips"
    defaultIssuerKind: "ClusterIssuer"
    defaultACMEChallengeType: "http01"
# A clusterissuer.yaml helm template file of relevance provided with the custom helm chart
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: neurips
  labels:
    helm.sh/chart: {{ include "neurips.chart" . }}
    app.kubernetes.io/name: {{ include "neurips.name" . }}
    app.kubernetes.io/managed-by: {{ .Release.Service }}
    app.kubernetes.io/instance: {{ .Release.Name }}
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    # production: https://acme-v02.api.letsencrypt.org/directory
    # staging:    https://acme-staging-v02.api.letsencrypt.org/directory
    email: {{ .Values.letsencrypt.contactEmail }}
    privateKeySecretRef:
      name: neurips-acme-key
    http01: {}
# NOTE: The dependent charts versions were updated as of 17 Aug 2019
# by looking at the latest version within the Chart.yaml file of the
# git repos.
dependencies:
  # https://github.com/helm/charts/tree/master/stable/nginx-ingress
  - name: nginx-ingress
    version: "1.16.0"
    repository: "https://kubernetes-charts.storage.googleapis.com"

  # https://github.com/helm/charts/tree/master/stable/cert-manager
  - name: cert-manager
    version: "v0.6.2"
    repository: "https://kubernetes-charts.storage.googleapis.com"

  # https://github.com/jupyterhub/jupyterhub
  # https://jupyterhub.github.io/helm-chart/
  - name: jupyterhub
    version: "0.9-470ec04"
    repository: "https://jupyterhub.github.io/helm-chart"
1 Like

@consideRatio I appreciate you sharing your thoughts and suggested solution. Since it is seemingly more complex (or, at least, more verbose) solution, I will wait a bit to hear other people’s thoughts and suggestions on the kube-lego approach. Since it is a default approach recommended in Z2JH documentation (and a less verbose one), I doubt that kube-lego does not support uninterrupted operations (otherwise, it should not be a default recommended approach in the documentation) and I would prefer to keep using it, if possible. Thank you, again!

1 Like

@minrk @yuvipanda Any thoughts on stability of kube-lego in the context of my question?

In continuation of this discussion, I have two questions, as follows.

  1. Current Z2JH documentation has Ingress section, with a kube-lego-based approach that seems to be a suggested solution: https://zero-to-jupyterhub.readthedocs.io/en/latest/advanced.html#ingress-and-automatic-https-with-kube-lego-let-s-encrypt. However, an older suggested solution, based on previous versions of the documentation (and/or advice on GitHub) is different from the currently recommended one. Currently, I have this in my config.yaml:
proxy:
  secretToken: "<SECRET>"
  type: ClusterIP
  https:
    enabled: true
    type: letsencrypt
    letsencrypt:
      contactEmail: <EMAIL_ADDR>
    hosts:
      - <FQDN>

whereas the current documentation recommends this:

ingress:
  annotations:
    kubernetes.io/tls-acme: "true"
  tls:
   - hosts:
      - <hostname>
     secretName: kubelego-tls-jupyterhub

So, relevant questions are: 1) Is one of these approaches preferable to the other and, if Yes, why and 2) could the former approach still to be expected to work with current Z2JH chart?

  1. A brief look at kube-lego GitHub repository reveals the following.

kube-lego is in maintenance mode only. There is no plan to support any new features. The latest Kubernetes release that kube-lego officially supports is 1.8 . The officially endorsed successor is cert-manager .

With this in mind, I’m wondering about whether current Z2JH Helm chart supports cert-manager-based ingress (if Yes, please point to relevant docs or resources) or (if No) whether such plans are on the near-term roadmap.

cc: @consideRatio @minrk @yuvipanda

Below is some additional information (logs from the hub pod), which has confused me even more. Because why, if the core issue is Preventing implicit spawn for ablekh because last spawn failed: Timeout[hub/server], deleting autohttps pod clears up the issue [ingress]… Clarifications will be much appreciated.

[E 2019-08-21 23:12:03.586 JupyterHub log:158] 500 GET /hub/user/ablekh/ (ablekh@10.244.1.1) 27.49ms
[I 2019-08-21 23:12:38.235 JupyterHub proxy:301] Checking routes
[I 2019-08-21 23:13:07.713 JupyterHub log:158] 302 GET / -> /hub (@10.244.1.1) 1.21ms
[I 2019-08-21 23:13:07.879 JupyterHub log:158] 302 GET /hub -> /hub/ (ablekh@10.244.1.1) 0.82ms
[I 2019-08-21 23:13:07.923 JupyterHub log:158] 302 GET /hub/ -> /user/ablekh/ (ablekh@10.244.1.1) 31.11ms
[I 2019-08-21 23:13:07.993 JupyterHub log:158] 302 GET /user/ablekh/ -> /hub/user/ablekh/ (@10.244.1.1) 0.94ms
[E 2019-08-21 23:13:08.253 JupyterHub base:1011] Preventing implicit spawn for ablekh because last spawn failed: Timeout
[E 2019-08-21 23:13:08.253 JupyterHub web:1788] Uncaught exception GET /hub/user/ablekh/ (10.244.1.1)
    HTTPServerRequest(protocol='https', host='<FQDN>', method='GET', uri='/hub/user/ablekh/', version='HTTP/1.1', remote_ip='10.244.1.1')
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/web.py", line 1699, in _execute
        result = await result
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 1013, in get
        raise copy.copy(exc).with_traceback(exc.__traceback__)
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 589, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 589, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 636, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 489, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 409, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2019-08-21 23:13:08.255 JupyterHub log:150] {
      "Cookie": "<CONTENT>",
      "Upgrade-Insecure-Requests": "1",
      "Dnt": "1",
      "Referer": "<REFERRER_FQDN>",
      "Accept-Encoding": "gzip, deflate, br",
      "Accept-Language": "en-US,en;q=0.5",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0",
      "X-Scheme": "https",
      "X-Original-Uri": "/hub/user/ablekh/",
      "X-Forwarded-Proto": "https,http",
      "X-Forwarded-Port": "443,80",
      "X-Forwarded-Host": "FQDN",
      "X-Forwarded-For": "10.244.1.1,10.244.1.20",
      "X-Real-Ip": "10.244.1.1",
      "X-Request-Id": "8316450b12991d29bb7b64ddb87460cd",
      "Connection": "close",
      "Host": "FQDN"
    }
[E 2019-08-21 23:13:08.255 JupyterHub log:158] 500 GET /hub/user/ablekh/ (ablekh@10.244.1.1) 54.96ms

There’s an open PR with a docs update:

And in future Traefik may handle letsencrypt without an additional app:

1 Like

@manics Thank you so much for the update. Please share your opinion (if you have one) on the original issue. As I said above, the more I investigate this, the more confusing the issue is (it is not clear to me what is the exact relationship dynamics between components involved).

The fact that kube-lego is in maintenance mode isn’t nice but it works well and reliably. We have been using it with mybinder.org for a long time and haven’t had any problems. So I would not try and switch to cert-manager.

It seems very unlikely that HTTPS would influence internal communication between the hub and the singleuser pod it is trying to start.

Network connections sometimes not working sounds like a problem I’ve seen when a node gets cordoned. What happens is a node gets cordoned (for what ever reason) and then all pods that are on this node are not considered “ready” any more by the kubernetes Service objects. Sometimes we make a mistake when admin’ing mybinder.org and cordon all nodes on which ingress pods are running which shuts down all traffic to the cluster, but if you do a quick kubectl get pods all pods seem to be running and are happy.

Not sure this is exactly your issue, but it is the direction I would investigate.

It is very hard to come up with an answer or even a suggestion to your problem. What we need ideally is a minimal example to run and inspect ourselves. This is hard to do for this case! Without it people are limited to reading what is written and then (unless they happen to recognise it) speculate about what might or might not be the problem and maybe what the solution could be.

A rule I try to follow/strive for when asking questions is: could someone standing on public transport on their way home from work with a crappy old smartphone write a reply that gets me further towards debugging this? If the answer is no, I find, you are unlikely to receive good replies.

2 Likes

@betatim I appreciate your extended comment (and I hope that you didn’t write it “standing on public transport on their way home from work with a crappy old smartphone” :wink: ). Being a long-time member of several StackExchange sites (and a moderator of one of them), I certainly understand your point about minimal reproducible examples (MREs) and, in many cases, I do my best to follow that route. However, with everyone using Z2JH on different cloud providers’ infrastructure, it is unlikely that people would be willing to take time and effort to duplicate our environment [we use AKS]. Thus, I think that it is quite reasonable to hope that the amount of details that I usually share would paint a clear enough picture and, consequently, generate helpful enough replies (even if some/all of them are speculations).

Anyway, I very much appreciate your suggestions on potential issues and directions for further investigation. If I can provide any additional information (not sure what MRE would look like in this case) that might help you / others to provide better replies, please let me know and I will be happy to do that.

1 Like

My further investigation revealed some additional information, which I have posted here: https://github.com/jupyterhub/jupyterhub/issues/1677#issuecomment-524143230 (a potentially related JupyterHub issue). Also see two following comments …

1 Like

Nods. It is pretty tough to build MREs for this, I also don’t have a good idea on how to do it. The problem I see is that even with good MREs it is hard to get answers and without I think we stand very little chance. Which is bad because, well, things need debugging and people need help :slight_smile:

I think the fact that there are about as many cloud providers as there are active-in-this-forum deployers of JupyterHub doesn’t help us.

I reckon people read the original post and if they have an idea they will post, otherwise they won’t. Unless new information gets added by someone to the thread I’d consider it a case of “no one knows” unless there is a response early on.

For this kind of cloud setup MREs will remain an aspirational goal for a while but once we figure out how to do them we will make a lot of progress :slight_smile:

1 Like

@betatim I appreciate you sharing your thoughts. And I agree with what you have said. :slight_smile: We just need to continue doing our best to help each other and, in the meantime, hopefully, simplify the codebase (simple things do not break as often as more complicated ones, right?) as well as streamline the documentation as much as possible. On this issue, it seems that it not related to ingress at all, as I have originally suspected, but rather to mysterious and crazy delays when starting and stopping single-user servers. With the output I have provided in https://github.com/jupyterhub/jupyterhub/issues/1677, relevant JupyterHub core developers (hi, @minrk! :slight_smile: ) should be able to figure out what the problem is and how to fix it to everyone’s satisfaction.

By the way, I’m curious … Do you, guys, find that upgrading your Z2JH clusters to the latest (available for relevant vendor) K8s version always results in a more stable environment? Please share your experiences (I can submit this as a separate issue on GitHub, if preferred).

Note to self: below is the log output for a recent normal single-user server startup case (using Chrome in this case; I suspect potential browser dependency, e.g. session tokens or Web Sockets). I will verify my hypothesis by trying to access the cluster again after some time, using Chrome again …

[I 2019-08-24 02:09:42.939 JupyterHub log:158] 200 GET /hub/spawn (ablekh@10.244.1.1) 23.65ms
[W 2019-08-24 02:09:44.797 JupyterHub base:724] User ablekh is slow to start (timeout=0)
[I 2019-08-24 02:09:44.871 JupyterHub log:158] 302 POST /hub/spawn -> /user/ablekh/ (ablekh@10.244.1.1) 223.62ms
[I 2019-08-24 02:09:44.892 JupyterHub spawner:1659] PVC claim-ablekh already exists, so did not create new pvc.
[I 2019-08-24 02:09:44.925 JupyterHub log:158] 302 GET /user/ablekh/ -> /hub/user/ablekh/ (@10.244.1.1) 0.75ms
[I 2019-08-24 02:09:44.999 JupyterHub base:1022] Pending spawn for ablekh didn't finish in 0.0 seconds
[I 2019-08-24 02:09:45.000 JupyterHub base:1028] ablekh is pending spawn
[I 2019-08-24 02:09:45.001 JupyterHub log:158] 200 GET /hub/user/ablekh/ (ablekh@10.244.1.1) 26.28ms
[I 2019-08-24 02:10:15.865 JupyterHub proxy:301] Checking routes
[I 2019-08-24 02:10:22.682 JupyterHub log:158] 200 GET /hub/api (@10.244.1.31) 1.03ms
[I 2019-08-24 02:10:25.511 JupyterHub base:638] User ablekh took 40.840 seconds to start
[I 2019-08-24 02:10:25.512 JupyterHub proxy:242] Adding user ablekh to proxy /user/ablekh/ => http://10.244.1.31:8888
[I 2019-08-24 02:10:25.515 JupyterHub users:533] Server ablekh is ready
[I 2019-08-24 02:10:25.516 JupyterHub log:158] 200 GET /hub/api/users/ablekh/server/progress (ablekh@10.244.1.1) 40402.11ms
[I 2019-08-24 02:10:25.653 JupyterHub log:158] 302 GET /hub/user/ablekh/ -> /user/ablekh/?redirects=1 (ablekh@10.244.1.1) 59.69ms
[I 2019-08-24 02:10:25.904 JupyterHub log:158] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-ablekh&redirect_uri=%2Fuser%2Fablekh%2Foauth_callback&response_type=code&state=[secret] -> /user/ablekh/oauth_callback?code=[secret]&state=[secret] (ablekh@10.244.1.1) 50.23ms
[I 2019-08-24 02:10:26.063 JupyterHub log:158] 200 POST /hub/api/oauth2/token (ablekh@10.244.1.31) 57.44ms
[I 2019-08-24 02:10:26.146 JupyterHub log:158] 200 GET /hub/api/authorizations/token/[secret] (ablekh@10.244.1.31) 56.18ms

Additional suspicious moment: after I stop the server from UI, relevant page displays “Start My Server” button, which implies that the server has been stopped. However, checking pods from CLI results in the output, indicating that the server (jupyter-ablekh) is still running. Clearly, this seems to be a case of some kind of mis-synchronization or a similar issue …

ablekh@mgmt:~$ kubectl get pods --namespace=sbdh-jh-v2-3
NAME                            READY   STATUS    RESTARTS   AGE
autohttps-69bf56889c-bnzwg      2/2     Running   0          14h
continuous-image-puller-lj68z   1/1     Running   0          11d
hook-image-puller-66vt7         1/1     Running   0          11d
hub-5889b46768-f67hx            1/1     Running   1          11d
jupyter-ablekh                  1/1     Running   0          14m
proxy-597c8cc768-qj4hg          1/1     Running   0          11d

UPDATE: after several minutes, refreshing UI in both Chrome and (in another tab) Firefox results in now displaying “My Server” button, clicking on which generates a page with the following content:

Your server is stopping.

You will be able to start it again once it has finished stopping.

(circular progress indicator / icon)

“Refresh” button

In the meantime, there are multiple following messages in the logs output:

[I 2019-08-24 02:33:02.101 JupyterHub base:1028] ablekh is pending stop

So, in addition to suspected mis-synchronization, there is a weird delay in stopping server.

What should we be looking for in this log snippet? It looks normal to me. The fact that the single user pod took 40s to start isn’t by itself a problem. For example it could be because the image for the pod had to be pulled from a registry.

[I 2019-08-24 02:09:44.999 JupyterHub base:1022] Pending spawn for ablekh didn't finish in 0.0 seconds

This line is easy to misunderstand as if there is a problem but actually it is working “as intended”. Maybe a PR to update what is logged when timeout=0 would be good. If you search the forum or GitHub a bit you should find a post where I tried to explain what happens/why the log message “makes sense”.

The button in the control panel changes state immediately. I don’t think it has a way of knowing what the actual state is. This is why if you click it immediately after stopping a server you get the message “pod is pending stop, can’t start it now”. Making the button “dynamic” for all possible spawners would be great but probably a few days work.

What I’d investigate is what kubectl get pods says right after you click stop. Does the status of the user pod change from running to something else? What is in the logs of the single user pod?

If clicking the button leads to log messages in the hub like “blah is pending stop” or “stopping pod” but the kubernetes status of the pod isn’t changing I’d start suspecting that there is a problem with talking to the kubernetes API or that the k8s controllers are broken. This is now a super specific problem. Either with AWS or with how you setup the cluster or modifications you made to the cluster.