Spawn failed: Timeout even when start_timeout is set to 3600 seconds

Hi,

I’m running Jupyterhub on an Azure cluster with node autoscaling enabled. When starting a new Notebook, it takes a while 'till the node is spun up. I’ve the the start_timeout to 3600 seconds (even the http_timeout to 3600), yet after 5 minutes, the UI errors out with

Spawn failed: Timeout

In the logs I’m seeing the following:

[I 2021-02-24 18:42:40.159 JupyterHub log:181] 200 GET /hub/api/users (cull-idle@127.0.0.1) 19.70ms
[I 2021-02-24 18:43:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:44:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:45:40.070 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:46:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:47:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:48:40.071 JupyterHub proxy:320] Checking routes
[E 2021-02-24 18:48:57.191 JupyterHub spawner:1926] Pod jupyter-sam never showed up in reflector, restarting pod reflector
[I 2021-02-24 18:48:57.222 JupyterHub reflector:201] watching for pods with label selector='component=singleuser-server' in namespace jupyterhub2
[E 2021-02-24 18:48:57.223 JupyterHub gen:623] Exception in Future <Task finished name='Task-11690' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py:1871> exception=TimeoutError('pod/jupyter-sam did not start in 3600 seconds!')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py", line 1917, in _start
        await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 179, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: pod/jupyter-sam did not start in 3600 seconds!

So the message TimeoutError: pod/jupyter-sam did not start in 3600 seconds! is fake because it takes only 5 minutes for it to show up.

What am I doing wrong?

Sam

I tried this but it’s not helping: kubespawner: pending spawn in loop until urllib3.exceptions.ReadTimeoutError generated · Issue #382 · jupyter/jupyter · GitHub

I updated the start_timeout even more and I’m still getting this absurd error:

sam's server failed to start in 999999 seconds, giving up

Hi! Could you show us your full Z2JH configuration with secrets redacted?

Here it is:

# a token randomly generated for this instance of jupyterhub
# generated with openssl rand -hex 32
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/installation.html
proxy:
  secretToken: "<redacted>"

# use ingress
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/administrator/advanced.html#ingress
ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt
    ingress.kubernetes.io/proxy-body-size: 4000m
    nginx.ingress.kubernetes.io/proxy-body-size: 4000m
  hosts:
    - <redacted>
  tls:
   - hosts:
      - <redacted>
     secretName: main-tls

# use google SSO as login system
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/administrator/authentication.html#google
# note: this has been reworked in chart version 0.11.0+. It used to be under the 'auth' parameter.
hub:
  config:
    GoogleOAuthenticator:
      client_id: <redacted>.apps.googleusercontent.com
      client_secret: <redacted>
      oauth_callback_url: https://<redacted>/hub/oauth_callback
      hosted_domain:
        - <redacted>
      login_service: Google # this is the string shown in the UI where it says "Sign in with <login_service>"
    JupyterHub:
      authenticator_class: google
    # make some people admin (they can kill other's notebooks):
    # more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-management.html#admin-users
    Authenticator:
      admin_users:
        - <redacted>

# jupyter was first set up for a single user with a single notebook, that's the reason it is still called singleuser
singleuser:
  defaultUrl: "/lab" # boot into the jupyterlab UI instead of the classic UI
  storage:
    capacity: 50Gi # give each user a PVC of 50Gi

  # choose a default Docker image:
  # this is the default notebook if profileList would not be specified or if the image inside profileList is not overwritten
  # more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-environment.html#choose-and-use-an-existing-docker-image
  image:
    name: jupyter/minimal-notebook
    tag: latest

  # add multiple profiles so users can choose their Docker image:
  # more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment
  # this makes use of KubeSpawner, which has its own documentation which can be found here https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html#module-kubespawner.spawner
  profileList:
    - display_name: "Minimal environment"
      description: "To avoid too much bells and whistles: Python."
      default: true
      # no image is defined here, so the default notebook will spawn, see above

    - display_name: "Datascience environment"
      description: "If you want the additional bells and whistles: Python, R, and Julia."
      kubespawner_override:
        image: jupyter/datascience-notebook:latest

    - display_name: "Spark environment"
      description: "The Jupyter Stacks spark image!"
      kubespawner_override:
        image: jupyter/all-spark-notebook:latest

    - display_name: "GPU-Jupyter (GPU-based)"
      description: "Latest Tensorflow and Pytorch based on CUDA 10.1 More info at https://github.com/iot-salzburg/gpu-jupyter"
      kubespawner_override:
        image: cschranz/gpu-jupyter:v1.3_cuda-10.1_ubuntu-18.04_python-only
        extra_resource_limits:
          nvidia.com/gpu: 1
        # tolerations needed because our GPU nodepool has these taints
        tolerations:
          - key: nvidia.com/gpu
            effect: NoSchedule
            value: present
        # increasing timeouts as it can take a while for a node to spin up on Azure (doesn't help):
        start_timeout: 3600
        http_timeout: 3600

It’s only a problem when we start the GPU-based server. Because then Azure has to spin up a node in the background and this takes a while.

The node stays alive for 10 minutes so if we retry within 10 minutes, it does work.

If you set a spawner-wide start_timeout instead of in the profile is it respected? If so this implies it’s a problem with how the profile overrides are handled.

Can you try turning on debug logging and show us the full logs? This should include debug logs like

that’ll help track down the issue.

Setting the start_timeout on a global level with

singleuser:
  # Setting a global start_timeout instead of 'per profile'
  start_timeout: 3600

did not help, the error looks a bit different though:

[W 2021-03-24 10:32:20.725 JupyterHub user:681] sam's server failed to start in 3600 seconds, giving up
[I 2021-03-24 10:32:20.725 JupyterHub spawner:1997] Deleting pod jupyter-sam
[I 2021-03-24 10:32:47.748 JupyterHub proxy:319] Checking routes
[E 2021-03-24 10:33:31.287 JupyterHub gen:623] Exception in Future <Task finished name='Task-59436' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:895> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 902, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 707, in spawn
        raise e
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 606, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

Will try again with

debug:
  enabled: true

The extra debug messages don’t say much:

[D 2021-03-24 13:11:09.189 JupyterHub user:602] Calling Spawner.start for sam
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2133] Applying KubeSpawner override for profile 'GPU-Jupyter (GPU-based)'
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2140] .. overriding KubeSpawner value extra_resource_limits={'nvidia.com/gpu': 1}
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2140] .. overriding KubeSpawner value http_timeout=3600
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2140] .. overriding KubeSpawner value image=cschranz/gpu-jupyter:v1.3_cuda-10.1_ubuntu-18.04_python-only
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2140] .. overriding KubeSpawner value start_timeout=3600
[D 2021-03-24 13:11:09.190 JupyterHub spawner:2140] .. overriding KubeSpawner value tolerations=[{'effect': 'NoSchedule', 'key': 'nvidia.com/gpu', 'value': 'present'}]
[I 2021-03-24 13:11:09.191 JupyterHub spawner:1875] Attempting to create pvc claim-sam, with timeout 3
[I 2021-03-24 13:11:09.192 JupyterHub log:181] 302 POST /hub/spawn -> /hub/spawn-pending/sam (sam@10.244.0.1) 87.66ms
[I 2021-03-24 13:11:09.219 JupyterHub pages:402] sam is pending spawn
[I 2021-03-24 13:11:09.222 JupyterHub log:181] 200 GET /hub/spawn-pending/sam (sam@10.244.0.1) 4.58ms
[I 2021-03-24 13:11:09.238 JupyterHub spawner:1887] PVC claim-sam already exists, so did not create new pvc.
[I 2021-03-24 13:11:09.241 JupyterHub spawner:1840] Attempting to create pod jupyter-sam, with timeout 3
[D 2021-03-24 13:11:09.420 JupyterHub spawner:1711] progress generator: jupyter-sam
[D 2021-03-24 13:11:09.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.69ms
[D 2021-03-24 13:11:11.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.61ms
[D 2021-03-24 13:11:13.718 JupyterHub reflector:275] pods watcher timeout
[D 2021-03-24 13:11:13.719 JupyterHub reflector:206] Connecting pods watcher
[D 2021-03-24 13:11:13.738 JupyterHub reflector:275] events watcher timeout
[D 2021-03-24 13:11:13.738 JupyterHub reflector:206] Connecting events watcher
[D 2021-03-24 13:11:13.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.58ms
[D 2021-03-24 13:11:15.894 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.52ms
[D 2021-03-24 13:11:15.898 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.31ms
... last 7 lines are repeated several times here ...
[D 2021-03-24 13:16:04.074 JupyterHub reflector:275] events watcher timeout
[D 2021-03-24 13:16:04.074 JupyterHub reflector:206] Connecting events watcher
[D 2021-03-24 13:16:04.112 JupyterHub reflector:275] pods watcher timeout
[D 2021-03-24 13:16:04.112 JupyterHub reflector:206] Connecting pods watcher
[D 2021-03-24 13:16:05.894 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.65ms
[D 2021-03-24 13:16:05.898 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.31ms
[D 2021-03-24 13:16:07.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.54ms
[W 2021-03-24 13:16:09.191 JupyterHub user:681] sam's server failed to start in 3600 seconds, giving up
[D 2021-03-24 13:16:09.191 JupyterHub user:790] Stopping sam
[I 2021-03-24 13:16:09.192 JupyterHub spawner:1997] Deleting pod jupyter-sam
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /api/v1/namespaces/jupyterhub2/pods/jupyter-sam?gracePeriodSeconds=1
[D 2021-03-24 13:16:09.244 JupyterHub user:816] Deleting oauth client jupyterhub-user-sam
[D 2021-03-24 13:16:09.267 JupyterHub user:819] Finished stopping sam
[E 2021-03-24 13:16:09.286 JupyterHub gen:623] Exception in Future <Task finished name='Task-11564' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:895> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 902, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 707, in spawn
        raise e
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 606, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout
    
[I 2021-03-24 13:16:09.288 JupyterHub log:181] 200 GET /hub/api/users/sam/server/progress (sam@10.244.0.1) 299871.08ms
[D 2021-03-24 13:16:09.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.52ms
[D 2021-03-24 13:16:11.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.50ms
[D 2021-03-24 13:16:13.899 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.63ms
[D 2021-03-24 13:16:14.085 JupyterHub reflector:275] events watcher timeout
[D 2021-03-24 13:16:14.085 JupyterHub reflector:206] Connecting events watcher
[D 2021-03-24 13:16:14.123 JupyterHub reflector:275] pods watcher timeout
[D 2021-03-24 13:16:14.123 JupyterHub reflector:206] Connecting pods watcher

Can you remove the start_timeout override from the profile and set:

singleuser:
  startTimeout: 3600

I suspect the issue is that this line in jupyterhub, where the timeout starts counting, runs before this line in KubeSpawner.start() where the profile override is allowed to change options.

Then here spawner.start_timeout is read again for the log message when the timeout is reached, which assumes that spawner.start() didn’t change the timeout for start itself, which I suspect is the source of the confusion, leading to this sequence of events:

  1. default spawner.start_timeout is used as the real timeout for Spawner.start()
  2. spawner.start_timeout is changed during start, while the timer is already running, having no effect
  3. original timeout is reached, log message with the updated but not actually used spawner.start_timeout

So I’ve only set this as requested:

singleuser:
  start_timeout: 3600
  http_timeout: 3600

Seems like it is picked up by the config:

[I 2021-04-01 21:12:38.761 JupyterHub service:339] Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10']
[I 2021-04-01 21:12:38.762 JupyterHub service:121] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10

But still it uses 300 seconds (5 minutes) instead of the 3600 seconds. You can see it in the logs:

[D 2021-04-01 22:03:41.088 JupyterHub base:879] 0/64 concurrent spawns
[D 2021-04-01 22:03:41.088 JupyterHub base:884] 0 active servers
[D 2021-04-01 22:03:41.158 JupyterHub user:602] Calling Spawner.start for sam
[D 2021-04-01 22:03:41.158 JupyterHub spawner:2133] Applying KubeSpawner override for profile 'GPU-Jupyter (GPU-based)'
[D 2021-04-01 22:03:41.158 JupyterHub spawner:2140] .. overriding KubeSpawner value extra_resource_limits={'nvidia.com/gpu': 1}
[D 2021-04-01 22:03:41.158 JupyterHub spawner:2140] .. overriding KubeSpawner value image=cschranz/gpu-jupyter:v1.3_cuda-10.1_ubuntu-18.04_python-only
[D 2021-04-01 22:03:41.158 JupyterHub spawner:2140] .. overriding KubeSpawner value tolerations=[{'effect': 'NoSchedule', 'key': 'nvidia.com/gpu', 'value': 'present'}]
[I 2021-04-01 22:03:41.159 JupyterHub spawner:1875] Attempting to create pvc claim-sam, with timeout 3
[I 2021-04-01 22:03:41.160 JupyterHub log:181] 302 POST /hub/spawn -> /hub/spawn-pending/sam (sam@10.244.0.1) 89.17ms
[D 2021-04-01 22:03:41.161 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 1.77ms
[I 2021-04-01 22:03:41.198 JupyterHub spawner:1887] PVC claim-sam already exists, so did not create new pvc.
[I 2021-04-01 22:03:41.200 JupyterHub spawner:1840] Attempting to create pod jupyter-sam, with timeout 3
[I 2021-04-01 22:03:41.208 JupyterHub pages:402] sam is pending spawn
[I 2021-04-01 22:03:41.209 JupyterHub log:181] 200 GET /hub/spawn-pending/sam (sam@10.244.0.1) 2.82ms
[D 2021-04-01 22:03:41.375 JupyterHub spawner:1711] progress generator: jupyter-sam
[D 2021-04-01 22:03:41.675 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.49ms
[D 2021-04-01 22:03:43.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.54ms
[D 2021-04-01 22:03:45.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.58ms
[D 2021-04-01 22:03:45.268 JupyterHub reflector:275] pods watcher timeout
[D 2021-04-01 22:03:45.268 JupyterHub reflector:206] Connecting pods watcher
[D 2021-04-01 22:03:45.996 JupyterHub reflector:275] events watcher timeout
[D 2021-04-01 22:03:45.996 JupyterHub reflector:206] Connecting events watcher
...
[D 2021-04-01 22:08:37.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.53ms
[D 2021-04-01 22:08:38.772 JupyterHub proxy:795] Proxy: Fetching GET http://proxy-api:8001/api/routes
[I 2021-04-01 22:08:38.774 JupyterHub proxy:319] Checking routes
[D 2021-04-01 22:08:39.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.46ms
[D 2021-04-01 22:08:41.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.48ms
[W 2021-04-01 22:08:41.159 JupyterHub user:681] sam's server failed to start in 300 seconds, giving up
[D 2021-04-01 22:08:41.159 JupyterHub user:790] Stopping sam
[I 2021-04-01 22:08:41.159 JupyterHub spawner:1997] Deleting pod jupyter-sam
[E 2021-04-01 22:08:41.640 JupyterHub gen:623] Exception in Future <Task finished name='Task-3952' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py:1906> exception=TimeoutError('pod/jupyter-sam did not start in 300 seconds!')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py", line 1952, in _start
        await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 183, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: pod/jupyter-sam did not start in 300 seconds!
    
[D 2021-04-01 22:08:41.675 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.60ms
...
[D 2021-04-01 22:09:39.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.51ms
[D 2021-04-01 22:09:41.110 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.52ms
[D 2021-04-01 22:09:41.675 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.49ms
[D 2021-04-01 22:09:43.110 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.56ms
[D 2021-04-01 22:09:44.923 JupyterHub user:816] Deleting oauth client jupyterhub-user-sam
[D 2021-04-01 22:09:44.945 JupyterHub user:819] Finished stopping sam
[E 2021-04-01 22:09:44.961 JupyterHub gen:623] Exception in Future <Task finished name='Task-3951' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:895> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 902, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 707, in spawn
        raise e
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 606, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout
    
[I 2021-04-01 22:09:44.962 JupyterHub log:181] 200 GET /hub/api/users/sam/server/progress (sam@10.244.0.1) 363589.21ms
[D 2021-04-01 22:09:45.109 JupyterHub log:181] 200 GET /hub/health (@10.244.0.1) 0.47ms
[D 2021-04-01 22:09:45.612 JupyterHub reflector:275] pods watcher timeout

This configuration is not being picked up because you have snake_case where the helm chart only understands camelCase. Use:

singleuser:
  startTimeout: 3600

The logs your found are from the idle culler, and not related to the startup time. That would be this line:

sam's server failed to start in 300 seconds, giving up

showing that the default config of 300 seconds is being used.

Thanks @minrk.

This finally did it!!!

Thanks again.

Hello, can you help me to check my problem? I am also an AKS cluster deployed on Azure, JupyterHub, but when I started user PODS, some PODS failed to start. No POD was generated on k8s. Error after 300 seconds:

[E 2021-04-09 06:08:01.147 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py:845> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/home/jovyan/.local/lib/python3.6/site-packages/tornado/gen.py", line 593, in error_callback
        future.result()
      File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 852, in finish_user_spawn
        await spawn_future
      File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 656, in spawn
        raise e
      File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 560, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

@minrk I am having a similar issue, with the different that startTimeout is not getting picked up correctly, and the default is showing up as 3 while trying to “create pod”.

The failure message, however, does show the value of startTimeout (30000 in my case). However, the failure happens within 30 seconds of requesting the server. Is there a way to increase the timeout 3 for “attempting to create a pod”. My cluster has a lot of resources, so pod creation, etc. takes up to a minute on times.

I see the following in the hub logs

[D 2023-04-25 17:28:31.789 JupyterHub user:430] Creating <class 'kubespawner.spawner.KubeSpawner'> for user3:
[D 2023-04-25 17:28:31.792 JupyterHub pages:213] Triggering spawn with default options for user3
[D 2023-04-25 17:28:31.792 JupyterHub base:934] Initiating spawn for user3
[D 2023-04-25 17:28:31.792 JupyterHub base:938] 0/64 concurrent spawns
[D 2023-04-25 17:28:31.792 JupyterHub base:943] 0 active servers
[I 2023-04-25 17:28:31.803 JupyterHub provider:651] Creating oauth client jupyterhub-user-user3
[D 2023-04-25 17:28:31.821 JupyterHub user:743] Calling Spawner.start for user3
[I 2023-04-25 17:28:31.822 JupyterHub spawner:2509] Attempting to create pvc claim-user3, with timeout 3
[I 2023-04-25 17:28:31.823 JupyterHub log:186] 302 GET /hub/spawn/user3 -> /hub/spawn-pending/user3 (user3@100.64.31.165) 36.76ms
[D 2023-04-25 17:28:31.844 JupyterHub scopes:796] Checking access via scope servers
[D 2023-04-25 17:28:31.844 JupyterHub scopes:623] Argument-based access to /hub/spawn-pending/user3 via servers
[I 2023-04-25 17:28:31.844 JupyterHub pages:394] user3 is pending spawn
[D 2023-04-25 17:28:31.845 JupyterHub log:186] 304 GET /hub/spawn-pending/user3 (user3@100.64.31.165) 3.24ms
[I 2023-04-25 17:28:31.846 JupyterHub spawner:2525] PVC claim-user3 already exists, so did not create new pvc.
[I 2023-04-25 17:28:31.851 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:31.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.55ms
[D 2023-04-25 17:28:31.928 JupyterHub scopes:796] Checking access via scope read:servers
[D 2023-04-25 17:28:31.928 JupyterHub scopes:623] Argument-based access to /hub/api/users/user3/server/progress via read:servers
[D 2023-04-25 17:28:31.929 JupyterHub spawner:2308] progress generator: jupyter-user3
[D 2023-04-25 17:28:32.901 JupyterHub reflector:362] pods watcher timeout
[D 2023-04-25 17:28:32.902 JupyterHub reflector:281] Connecting pods watcher
[D 2023-04-25 17:28:33.541 JupyterHub reflector:362] events watcher timeout
[D 2023-04-25 17:28:33.541 JupyterHub reflector:281] Connecting events watcher
[D 2023-04-25 17:28:33.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.94ms
[D 2023-04-25 17:28:33.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.92ms
[I 2023-04-25 17:28:34.983 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:35.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.65ms
[D 2023-04-25 17:28:37.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.61ms
[I 2023-04-25 17:28:38.181 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:39.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.72ms
[I 2023-04-25 17:28:41.730 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:41.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.69ms
[D 2023-04-25 17:28:42.913 JupyterHub reflector:362] pods watcher timeout
[D 2023-04-25 17:28:42.913 JupyterHub reflector:281] Connecting pods watcher
[D 2023-04-25 17:28:43.566 JupyterHub reflector:362] events watcher timeout
[D 2023-04-25 17:28:43.566 JupyterHub reflector:281] Connecting events watcher
[D 2023-04-25 17:28:43.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.92ms
[D 2023-04-25 17:28:43.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.81ms
[D 2023-04-25 17:28:45.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.66ms
[I 2023-04-25 17:28:46.027 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:47.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.70ms
[D 2023-04-25 17:28:49.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.63ms
[I 2023-04-25 17:28:51.422 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:51.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.67ms
[D 2023-04-25 17:28:52.922 JupyterHub reflector:362] pods watcher timeout
[D 2023-04-25 17:28:52.923 JupyterHub reflector:281] Connecting pods watcher
[D 2023-04-25 17:28:53.587 JupyterHub reflector:362] events watcher timeout
[D 2023-04-25 17:28:53.588 JupyterHub reflector:281] Connecting events watcher
[D 2023-04-25 17:28:53.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.94ms
[D 2023-04-25 17:28:53.859 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.89ms
[D 2023-04-25 17:28:55.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.69ms
[D 2023-04-25 17:28:57.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.63ms
[I 2023-04-25 17:28:57.899 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:28:59.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.63ms
[D 2023-04-25 17:29:01.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.67ms
[I 2023-04-25 17:29:02.208 JupyterHub spawner:2469] Attempting to create pod jupyter-user3, with timeout 3
[D 2023-04-25 17:29:02.934 JupyterHub reflector:362] pods watcher timeout
[D 2023-04-25 17:29:02.934 JupyterHub reflector:281] Connecting pods watcher
[D 2023-04-25 17:29:03.613 JupyterHub reflector:362] events watcher timeout
[D 2023-04-25 17:29:03.614 JupyterHub reflector:281] Connecting events watcher
[D 2023-04-25 17:29:03.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.81ms
[D 2023-04-25 17:29:03.858 JupyterHub log:186] 200 GET /hub/health (@100.64.24.192) 0.82ms
[W 2023-04-25 17:29:05.211 JupyterHub user:824] user3's server failed to start in 30000 seconds, giving up.
    
    Common causes of this timeout, and debugging tips:
    
    1. Everything is working, but it took too long.
       To fix: increase `Spawner.start_timeout` configuration
       to a number of seconds that is enough for spawners to finish starting.
    2. The server didn't finish starting,
       or it crashed due to a configuration issue.
       Check the single-user server's logs for hints at what needs fixing.
    
[D 2023-04-25 17:29:05.211 JupyterHub user:930] Stopping user3
[D 2023-04-25 17:29:05.216 JupyterHub user:950] Deleting oauth client jupyterhub-user-user3
[D 2023-04-25 17:29:05.224 JupyterHub user:953] Finished stopping user3
[W 2023-04-25 17:29:05.233 JupyterHub base:1030] 3 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2023-04-25 17:29:05.233 JupyterHub gen:630] Exception in Future <Task finished name='Task-1601' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py:954> exception=TimeoutError('Could not create pod dev-jupyterhub/jupyter-user3')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py", line 961, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 850, in spawn
        raise e
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 747, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
      File "/usr/local/lib/python3.9/site-packages/kubespawner/spawner.py", line 2663, in _start
        await exponential_backoff(
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/utils.py", line 236, in exponential_backoff
        raise asyncio.TimeoutError(fail_message)
    asyncio.exceptions.TimeoutError: Could not create pod dev-jupyterhub/jupyter-user3
    
[I 2023-04-25 17:29:05.234 JupyterHub log:186] 200 GET /hub/api/users/user3/server/progress (user3@100.64.31.165) 33307.70ms
[D 2023-04-25 17:29:05.325 JupyterHub proxy:884] Proxy: Fetching GET http://proxy-api:8001/api/routes
[D 2023-04-25 17:29:05.327 JupyterHub proxy:395] Checking routes
[I 2023-04-25 17:29:05.495 JupyterHub log:186] 200 GET /hub/api/ (jupyterhub-idle-culler@127.0.0.1) 7.92ms
[D 2023-04-25 17:29:05.498 JupyterHub scopes:796] Checking access via scope list:users

Also experiencing a spawn that just hangs at pending, even after 30 min.

I’m trying to build the simplest example implementation following ZTJH, using Digital Ocean.

As far as I understand, the only point of failure should be my config.yaml, yet I can’t figure out what I’ve done wrong.



debug:
  enabled: true

hub:
  config:
    JupyterHub:
      authenticator_class: dummy
      log_level: DEBUG
    Authenticator:
      admin_users:
        - daniel
      allowed_users:
        - student1
    DummyAuthenticator:
      password: (redacted)

singleuser: 
  
  image:
    name: jupyter/scipy-notebook
    tag: ae885c0a6226

  # `cmd: null` allows the custom CMD of the Jupyter docker-stacks to be used
  # which performs further customization on startup.
  cmd: null

  # Setting a global start_timeout instead of 'per profile'
  startTimeout: 3600

  storage:
    dynamic:
      storageClass: "standard"
    capacity: 2Gi

  cpu:
    limit: 1
    guarantee: .5 

  memory:
    limit: 2G
    guarantee: 2G

Although I’ve set debug to true, it’s not clear to me how to get detailed JupyterHub log messages. I’m not seeing them when doing a kubectl log on the various existing pods.

Any thoughts appreciated!

ae885c0a6226 is a 7 year old image, can you try a more recent one? Alternatively if you omit singleuser.image you can use Z2JH’s default minimal image which is ideal for testing your installation.

Hi @manics Thanks for your thoughts. That’s the tag noted in the docs…I’ll try with “latest” instead (although the docs seem to frown on using “latest”).

I’ll also try the default image thanks. I had earlier but I’ve been trying all kinds of permutations for the yaml file so it deserves another go.

I’ve reduced my config.yaml down to the bare minimum to see if that helps:

debug:
  enabled: true

hub:
  config:
    JupyterHub:
      authenticator_class: dummy
      log_level: DEBUG
    Authenticator:
      admin_users:
        - daniel
      allowed_users:
        - student1
    DummyAuthenticator:
      password: (redacted)

singleuser: 
  
  # `cmd: null` allows the custom CMD of the Jupyter docker-stacks to be used
  # which performs further customization on startup.
  cmd: null

  # Setting a global start_timeout instead of 'per profile'
  startTimeout: 3600
  
  # Default to base image
  #image:
  #  name: jupyter/scipy-notebook
  #  tag: latest

  #storage:
  #  dynamic:
  #    storageClass: "standard"
  #  capacity: 2Gi

  #cpu:
  #  limit: 1
  #  guarantee: .5 

  #memory:
  #  limit: 2G
  #  guarantee: 2G

What’s interesting is I can’t seem to get debug logs that show what the error might be in the spawning process.

These are my listed nodes:

continuous-image-puller-4pxch     1/1     Running   0          23m
continuous-image-puller-5d8tt     1/1     Running   0          23m
continuous-image-puller-ssb6q     1/1     Running   0          23m
hub-5f5fc74974-k9w8b              1/1     Running   0          3m17s
jupyter-student1                  0/1     Pending   0          2m20s
proxy-5b4b6fcc-9p5lr              1/1     Running   0          3m17s
user-scheduler-7bf8fbbcd8-9kx2j   1/1     Running   0          23m
user-scheduler-7bf8fbbcd8-sphx6   1/1     Running   0          23m

But when I use kubectrl logs -f on each, none of them provide any debug logging from JupyterHub that would elucidate what’s going on. Hmm…

Hey @danielmcquillen

I had a similar issue. The spawner just refused to work, and in my case it had something to do with the communication between the cluster and the hub API.

I added this to the config.yaml file and it solved it:

hub:
  networkPolicy:
    egress:
      - ports:
          - port: 6443
          - port: 443

Not sure if that will help, but let’s see if it changes anything. Good luck!

Thanks @IvanYingX . Tried that to no avail. I tried boosting the side of the nodes as well but that didn’t help.

I do see the following errors in one of the continual-image-puller nodes but I’m not sure if they’re relevant…

I0903 02:31:28.208827       1 reflector.go:243] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
E0903 02:31:28.213678       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.PodDisruptionBudget: failed to list *v1beta1.PodDisruptionBudget: the server could not find the requested resource