I’m running Jupyterhub on an Azure cluster with node autoscaling enabled. When starting a new Notebook, it takes a while 'till the node is spun up. I’ve the the start_timeout to 3600 seconds (even the http_timeout to 3600), yet after 5 minutes, the UI errors out with
Spawn failed: Timeout
In the logs I’m seeing the following:
[I 2021-02-24 18:42:40.159 JupyterHub log:181] 200 GET /hub/api/users (cull-idle@127.0.0.1) 19.70ms
[I 2021-02-24 18:43:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:44:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:45:40.070 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:46:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:47:40.071 JupyterHub proxy:320] Checking routes
[I 2021-02-24 18:48:40.071 JupyterHub proxy:320] Checking routes
[E 2021-02-24 18:48:57.191 JupyterHub spawner:1926] Pod jupyter-sam never showed up in reflector, restarting pod reflector
[I 2021-02-24 18:48:57.222 JupyterHub reflector:201] watching for pods with label selector='component=singleuser-server' in namespace jupyterhub2
[E 2021-02-24 18:48:57.223 JupyterHub gen:623] Exception in Future <Task finished name='Task-11690' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py:1871> exception=TimeoutError('pod/jupyter-sam did not start in 3600 seconds!')> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
future.result()
File "/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py", line 1917, in _start
await exponential_backoff(
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 179, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: pod/jupyter-sam did not start in 3600 seconds!
So the message TimeoutError: pod/jupyter-sam did not start in 3600 seconds! is fake because it takes only 5 minutes for it to show up.
# a token randomly generated for this instance of jupyterhub
# generated with openssl rand -hex 32
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/installation.html
proxy:
secretToken: "<redacted>"
# use ingress
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/administrator/advanced.html#ingress
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt
ingress.kubernetes.io/proxy-body-size: 4000m
nginx.ingress.kubernetes.io/proxy-body-size: 4000m
hosts:
- <redacted>
tls:
- hosts:
- <redacted>
secretName: main-tls
# use google SSO as login system
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/administrator/authentication.html#google
# note: this has been reworked in chart version 0.11.0+. It used to be under the 'auth' parameter.
hub:
config:
GoogleOAuthenticator:
client_id: <redacted>.apps.googleusercontent.com
client_secret: <redacted>
oauth_callback_url: https://<redacted>/hub/oauth_callback
hosted_domain:
- <redacted>
login_service: Google # this is the string shown in the UI where it says "Sign in with <login_service>"
JupyterHub:
authenticator_class: google
# make some people admin (they can kill other's notebooks):
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-management.html#admin-users
Authenticator:
admin_users:
- <redacted>
# jupyter was first set up for a single user with a single notebook, that's the reason it is still called singleuser
singleuser:
defaultUrl: "/lab" # boot into the jupyterlab UI instead of the classic UI
storage:
capacity: 50Gi # give each user a PVC of 50Gi
# choose a default Docker image:
# this is the default notebook if profileList would not be specified or if the image inside profileList is not overwritten
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-environment.html#choose-and-use-an-existing-docker-image
image:
name: jupyter/minimal-notebook
tag: latest
# add multiple profiles so users can choose their Docker image:
# more info at https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment
# this makes use of KubeSpawner, which has its own documentation which can be found here https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html#module-kubespawner.spawner
profileList:
- display_name: "Minimal environment"
description: "To avoid too much bells and whistles: Python."
default: true
# no image is defined here, so the default notebook will spawn, see above
- display_name: "Datascience environment"
description: "If you want the additional bells and whistles: Python, R, and Julia."
kubespawner_override:
image: jupyter/datascience-notebook:latest
- display_name: "Spark environment"
description: "The Jupyter Stacks spark image!"
kubespawner_override:
image: jupyter/all-spark-notebook:latest
- display_name: "GPU-Jupyter (GPU-based)"
description: "Latest Tensorflow and Pytorch based on CUDA 10.1 More info at https://github.com/iot-salzburg/gpu-jupyter"
kubespawner_override:
image: cschranz/gpu-jupyter:v1.3_cuda-10.1_ubuntu-18.04_python-only
extra_resource_limits:
nvidia.com/gpu: 1
# tolerations needed because our GPU nodepool has these taints
tolerations:
- key: nvidia.com/gpu
effect: NoSchedule
value: present
# increasing timeouts as it can take a while for a node to spin up on Azure (doesn't help):
start_timeout: 3600
http_timeout: 3600
It’s only a problem when we start the GPU-based server. Because then Azure has to spin up a node in the background and this takes a while.
The node stays alive for 10 minutes so if we retry within 10 minutes, it does work.
If you set a spawner-wide start_timeout instead of in the profile is it respected? If so this implies it’s a problem with how the profile overrides are handled.
Can you try turning on debug logging and show us the full logs? This should include debug logs like
singleuser:
# Setting a global start_timeout instead of 'per profile'
start_timeout: 3600
did not help, the error looks a bit different though:
[W 2021-03-24 10:32:20.725 JupyterHub user:681] sam's server failed to start in 3600 seconds, giving up
[I 2021-03-24 10:32:20.725 JupyterHub spawner:1997] Deleting pod jupyter-sam
[I 2021-03-24 10:32:47.748 JupyterHub proxy:319] Checking routes
[E 2021-03-24 10:33:31.287 JupyterHub gen:623] Exception in Future <Task finished name='Task-59436' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:895> exception=TimeoutError('Timeout')> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
future.result()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 902, in finish_user_spawn
await spawn_future
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 707, in spawn
raise e
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 606, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
tornado.util.TimeoutError: Timeout
Then herespawner.start_timeout is read again for the log message when the timeout is reached, which assumes that spawner.start() didn’t change the timeout for start itself, which I suspect is the source of the confusion, leading to this sequence of events:
default spawner.start_timeout is used as the real timeout for Spawner.start()
spawner.start_timeout is changed during start, while the timer is already running, having no effect
original timeout is reached, log message with the updated but not actually used spawner.start_timeout
Hello, can you help me to check my problem? I am also an AKS cluster deployed on Azure, JupyterHub, but when I started user PODS, some PODS failed to start. No POD was generated on k8s. Error after 300 seconds:
[E 2021-04-09 06:08:01.147 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py:845> exception=TimeoutError('Timeout',)> after timeout
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/tornado/gen.py", line 593, in error_callback
future.result()
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 852, in finish_user_spawn
await spawn_future
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 656, in spawn
raise e
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 560, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
tornado.util.TimeoutError: Timeout