Hi, I am trying to set up JupyterHub with YarnSpawner against a Hadoop cluster. I’m seeing a peculiar failure in which the notebook is spawned and successfully begins running, at least one activity message is sent back to the Hub (received and logged), but the Hub still times out waiting for the notebook and seems to eventually kill it, as though it never did receive activity. The Hub’s UI shows a progress bar that stops halfway and then never finishes. Ultimately it goes red when the Hub times out the spawn.
No errors are occurring in the container, it just starts, runs for a little while (sending back an activity message, as noted), then seems to be killed by the Hub. The Hub logs the received message but still times out. I’m not sure what to do from here. I’m including some short log excerpts below, but I’m happy to supply whatever might be useful. First, below is the Hub log. Notice the line:
[I 2023-07-10 15:11:40.001 JupyterHub log:186] 200 POST /hub/api/users/sorokaa/activity (sorokaa@192.168.60.12)
That’s the IP of the node with the application container. I understand that to mean that an activity message was successfully received by the Hub? Here is the Hub log section in full.
23/07/10 15:11:30 INFO skein.Driver: Submitting application...
23/07/10 15:11:30 INFO impl.YarnClientImpl: Submitted application application_1687889389675_0006
[W 2023-07-10 15:11:35.587 JupyterHub base:1072] User sorokaa is slow to start (timeout=10)
[I 2023-07-10 15:11:39.969 JupyterHub log:186] 200 GET /hub/api (@192.168.60.12) 0.95ms
[I 2023-07-10 15:11:40.001 JupyterHub log:186] 200 POST /hub/api/users/sorokaa/activity (sorokaa@192.168.60.12) 21.40ms
15:16:09.880 [ConfigProxy] info: 200 GET /api/routes
[W 2023-07-10 15:16:25.621 JupyterHub user:825] sorokaa's server failed to start in 300 seconds, giving up.
Common causes of this timeout, and debugging tips:
1. Everything is working, but it took too long.
To fix: increase `Spawner.start_timeout` configuration
to a number of seconds that is enough for spawners to finish starting.
2. The server didn't finish starting,
or it crashed due to a configuration issue.
Check the single-user server's logs for hints at what needs fixing.
23/07/10 15:16:25 INFO impl.YarnClientImpl: Killed application application_1687889389675_0006
[E 2023-07-10 15:16:25.876 JupyterHub gen:623] Exception in Future <Task finished name='Task-26' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyterhub/handlers/base.py:963> exception=TimeoutError('Timeout')> after timeout
Traceback (most recent call last):
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/tornado/gen.py", line 618, in error_callback
future.result()
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyterhub/handlers/base.py", line 970, in finish_user_spawn
await spawn_future
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyterhub/user.py", line 851, in spawn
raise e
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyterhub/user.py", line 748, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
tornado.util.TimeoutError: Timeout
[I 2023-07-10 15:16:25.885 JupyterHub log:186] 200 GET /hub/api/users/sorokaa/server/progress (sorokaa@::ffff:10.254.127.252) 299089.40ms
[E 2023-07-10 15:16:25.890 JupyterHub gen:623] Exception in Future <Task finished name='Task-27' coro=<YarnSpawner.start() done, defined at /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/yarnspawner/spawner.py:203> exception=Exception('Application application_1687889389675_0006 failed to start, check application logs for more information')> after timeout
Traceback (most recent call last):
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/tornado/gen.py", line 618, in error_callback
future.result()
File "/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/yarnspawner/spawner.py", line 247, in start
raise Exception("Application %s failed to start, check "
Exception: Application application_1687889389675_0006 failed to start, check application logs for more information
Now here an except from the ResourceManager log, showing the successful launching and running of the container with the notebook in it:
23/07/10 15:11:32 INFO nodemanager.DefaultContainerExecutor: Initializing user sorokaa
23/07/10 15:11:32 INFO nodemanager.DefaultContainerExecutor: Copying from /hadoop/yarn/local/nmPrivate/container_1687889389675_0006_01_00000118941375351.tokens to /hadoop/yarn/local/usercache/sorokaa/appcache/application_1687889389675_0006/container_1687889389675_0006_01_000001.tokens
23/07/10 15:11:32 INFO nodemanager.DefaultContainerExecutor: Localizer CWD set to /hadoop/yarn/local/usercache/sorokaa/appcache/application_1687889389675_0006 = file:/hadoop/yarn/local/usercache/sorokaa/appcache/application_1687889389675_0006
23/07/10 15:11:32 INFO container.ContainerImpl: Container container_1687889389675_0006_01_000001 transitioned from LOCALIZING to SCHEDULED
23/07/10 15:11:32 INFO scheduler.ContainerScheduler: Starting container [container_1687889389675_0006_01_000001]
23/07/10 15:11:33 INFO container.ContainerImpl: Container container_1687889389675_0006_01_000001 transitioned from SCHEDULED to RUNNING
23/07/10 15:11:33 INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1687889389675_0006_01_000001
23/07/10 15:11:33 INFO nodemanager.DefaultContainerExecutor: launchContainer: [bash, /hadoop/yarn/local/usercache/sorokaa/appcache/application_1687889389675_0006/container_1687889389675_0006_01_000001/default_container_executor.sh]
23/07/10 15:11:35 INFO monitor.ContainersMonitorImpl: container_1687889389675_0006_01_000001's ip = 192.168.60.12, and hostname = dl-test-03.cm.cluster
and here it is seemingly getting shut down from the Hub:
23/07/10 15:16:25 INFO ipc.Server: Auth successful for appattempt_1687889389675_0006_000001 (auth:SIMPLE) from 192.168.60.11:48346
23/07/10 15:16:25 INFO containermanager.ContainerManagerImpl: Stopping container with container Id: container_1687889389675_0006_01_000001
23/07/10 15:16:25 INFO container.ContainerImpl: Container container_1687889389675_0006_01_000001 transitioned from RUNNING to KILLING
23/07/10 15:16:25 INFO launcher.ContainerCleanup: Cleaning up container container_1687889389675_0006_01_000001
23/07/10 15:16:25 WARN nodemanager.DefaultContainerExecutor: Exit code from container container_1687889389675_0006_01_000001 is : 143
23/07/10 15:16:25 INFO container.ContainerImpl: Container container_1687889389675_0006_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
That auth call which begins the operation is coming from 192.168.60.11, which is the edge node running the Hub.
Lastly, below is a slice of the log from the notebook itself. You’ll see it successfully launch and begin notifying the Hub:
[I 2023-07-10 15:11:39.977 YarnSingleUserNotebookApp mixins:591] Updating Hub with activity every 300 seconds [D 2023-07-10 15:11:39.977 YarnSingleUserNotebookApp mixins:553] Notifying Hub of activity 2023-07-10T19:11:39.915883Z
directly beneath which you’ll see me make a successful HTTP GET against the notebook from inside the cluster, to prove to myself that it really was running and that I am not hallucinating:
[I 2023-07-10 15:13:17.799 YarnSingleUserNotebookApp log:186] 302 GET /user/sorokaa/tree/ -> /user/sorokaa/tree (@192.168.60.11) 0.77ms
Here is the full excerpt from the notebook log:
[I 2023-07-10 15:11:39.881 YarnSingleUserNotebookApp mixins:609] Starting jupyterhub single-user server version 3.1.1
[I 2023-07-10 15:11:39.881 YarnSingleUserNotebookApp mixins:623] Extending __main__.YarnSingleUserNotebookApp from __main__
[I 2023-07-10 15:11:39.881 YarnSingleUserNotebookApp mixins:623] Extending jupyter_server.serverapp.ServerApp from jupyter_server 1.23.6
[D 2023-07-10 15:11:39.900 YarnSingleUserNotebookApp application:190] Searching ['/home/.jupyter', '/home/.local/etc/jupyter', '/cm/shared/datalake/jupyterhub/miniconda/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
[D 2023-07-10 15:11:39.900 YarnSingleUserNotebookApp application:902] Looking for jupyter_config in /etc/jupyter
[D 2023-07-10 15:11:39.900 YarnSingleUserNotebookApp application:902] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2023-07-10 15:11:39.900 YarnSingleUserNotebookApp application:902] Looking for jupyter_config in /cm/shared/datalake/jupyterhub/miniconda/etc/jupyter
[D 2023-07-10 15:11:39.900 YarnSingleUserNotebookApp application:902] Looking for jupyter_config in /home/.local/etc/jupyter
[D 2023-07-10 15:11:39.901 YarnSingleUserNotebookApp application:902] Looking for jupyter_config in /home/.jupyter
[D 2023-07-10 15:11:39.901 YarnSingleUserNotebookApp application:902] Looking for jupyter_server_config in /etc/jupyter
[D 2023-07-10 15:11:39.902 YarnSingleUserNotebookApp application:902] Looking for jupyter_server_config in /usr/local/etc/jupyter
[D 2023-07-10 15:11:39.902 YarnSingleUserNotebookApp application:902] Looking for jupyter_server_config in /cm/shared/datalake/jupyterhub/miniconda/etc/jupyter
[D 2023-07-10 15:11:39.902 YarnSingleUserNotebookApp application:902] Looking for jupyter_server_config in /home/.local/etc/jupyter
[D 2023-07-10 15:11:39.902 YarnSingleUserNotebookApp application:902] Looking for jupyter_server_config in /home/.jupyter
[D 2023-07-10 15:11:39.904 YarnSingleUserNotebookApp config_manager:93] Paths used for configuration of jupyter_server_config:
/etc/jupyter/jupyter_server_config.json
[D 2023-07-10 15:11:39.904 YarnSingleUserNotebookApp config_manager:93] Paths used for configuration of jupyter_server_config:
/usr/local/etc/jupyter/jupyter_server_config.json
[D 2023-07-10 15:11:39.904 YarnSingleUserNotebookApp config_manager:93] Paths used for configuration of jupyter_server_config:
/cm/shared/datalake/jupyterhub/miniconda/etc/jupyter/jupyter_server_config.json
[D 2023-07-10 15:11:39.904 YarnSingleUserNotebookApp config_manager:93] Paths used for configuration of jupyter_server_config:
/home/.local/etc/jupyter/jupyter_server_config.json
[D 2023-07-10 15:11:39.905 YarnSingleUserNotebookApp config_manager:93] Paths used for configuration of jupyter_server_config:
/home/.jupyter/jupyter_server_config.json
/cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/yarnspawner/singleuser.py:16: RuntimeWarning: coroutine 'HubAuth._api_request' was never awaited
self.hub_auth._api_request(method='POST',
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
[I 2023-07-10 15:11:39.963 YarnSingleUserNotebookApp mixins:670] Starting jupyterhub-singleuser server version 3.1.1
[D 2023-07-10 15:11:39.969 YarnSingleUserNotebookApp _version:74] jupyterhub and jupyterhub-singleuser both on version 3.1.1
[I 2023-07-10 15:11:39.970 YarnSingleUserNotebookApp serverapp:2686] Serving notebooks from local directory: /hadoop/yarn/local/usercache/sorokaa/appcache/application_1687889389675_0006/container_1687889389675_0006_01_000001
[I 2023-07-10 15:11:39.970 YarnSingleUserNotebookApp serverapp:2686] Jupyter Server 1.23.6 is running at:
[I 2023-07-10 15:11:39.970 YarnSingleUserNotebookApp serverapp:2686] http://dl-test-03:34484/user/sorokaa/tree/
[I 2023-07-10 15:11:39.970 YarnSingleUserNotebookApp serverapp:2686] or http://127.0.0.1:34484/user/sorokaa/tree/
[I 2023-07-10 15:11:39.970 YarnSingleUserNotebookApp serverapp:2687] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2023-07-10 15:11:39.977 YarnSingleUserNotebookApp mixins:591] Updating Hub with activity every 300 seconds
[D 2023-07-10 15:11:39.977 YarnSingleUserNotebookApp mixins:553] Notifying Hub of activity 2023-07-10T19:11:39.915883Z
[I 2023-07-10 15:13:17.799 YarnSingleUserNotebookApp log:186] 302 GET /user/sorokaa/tree/ -> /user/sorokaa/tree (@192.168.60.11) 0.77ms
[I 2023-07-10 15:13:20.626 YarnSingleUserNotebookApp log:186] 302 GET /user/sorokaa/tree/ -> /user/sorokaa/tree (@192.168.60.11) 0.71ms
[D 2023-07-10 15:13:26.029 YarnSingleUserNotebookApp auth:673] No user identified
[D 2023-07-10 15:13:26.035 YarnSingleUserNotebookApp handlers:273] Using contents: services/contents
[D 2023-07-10 15:13:26.132 YarnSingleUserNotebookApp handlers:881] Path favicon.ico served from /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyter_server/static/favicon.ico
[D 2023-07-10 15:13:26.155 YarnSingleUserNotebookApp handlers:881] Path style/bootstrap.min.css served from /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyter_server/static/style/bootstrap.min.css
[D 2023-07-10 15:13:26.174 YarnSingleUserNotebookApp handlers:881] Path style/bootstrap-theme.min.css served from /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyter_server/static/style/bootstrap-theme.min.css
[D 2023-07-10 15:13:26.185 YarnSingleUserNotebookApp handlers:881] Path style/index.css served from /cm/shared/datalake/jupyterhub/miniconda/lib/python3.10/site-packages/jupyter_server/static/style/index.css
[W 2023-07-10 15:13:26.193 YarnSingleUserNotebookApp log:186] 404 GET /user/sorokaa/tree (@192.168.60.11) 165.33ms
End of LogType:application.driver.log.This log file belongs to a running container (container_1687889389675_0006_01_000001) and so may not be complete.
The only thing that seems obviously strange is the RuntimeWarning:
RuntimeWarning: coroutine 'HubAuth._api_request' was never awaited self.hub_auth._api_request(method='POST', RuntimeWarning: Enable tracemalloc to get the object allocation traceback
but I do not know enough about Jupyter to understand whether this is important or not.
Any help much appreciated. My knowledge of Jupyter is almost nil and this has flummoxed me. Hopefully it is something obvious I just haven’t recognized?