EKS Binderhub launching server fails after successful build

I am having trouble getting my Zero-To-Binderhub instance up and running. Currently I am using EKS with GCR configured as my container registry (service-account has admin credentials). When I try to build an example project: https://github.com/binder-examples/requirements it builds and pushes the container but the server fails to start.

Pushing image
Successfully pushed gcr.io/graceful-karma-302801/binderhub-stage-binder-2dexamples- 
2drequirements-55ab5c:82543a814cdd4cb125d03112a193dd4137d8751cBuilt image, launching...
Launching server...
Launch attempt 1 failed, retrying...
Launch attempt 2 failed, retrying...
Launch attempt 3 failed, retrying...
Internal Server Error

Things I’ve tried:

  • Multiple different repos (all build but fail to launch)
  • Different sizes of worker nodes (t2.small, t2.medium). No change and kubectl top nodes shows plenty of space
  • Checked to make sure the nodes are not restarting causing network instability

Any ideas what could be causing these launches to fail? Or any advice on how to find/dig into error messages for the launch would be helpful.

Have a look at your JupyterHub pod logs. If you also check the list of pods whilst the launch occurs kubectl get pods you should also be able to view the logs of the failing pod.

Thanks for the advice! It helped me dig in a bit further. I also tried to follow the steps for app debugging here. The pods are launching but the never get past the PENDING state. The pod below (*) will restart three times before the launch fails.

kubectl get pods --all-namespaces
NAMESPACE     NAME                                                  READY   STATUS    RESTARTS   AGE
kube-system   aws-node-7k5dz                                        1/1     Running   0          32m
kube-system   aws-node-l2dxk                                        1/1     Running   0          32m
kube-system   aws-node-ttmgq                                        1/1     Running   0          32m
kube-system   binder-854784654c-9v2rn                               1/1     Running   0          27m
kube-system   binderhub-image-cleaner-fsn5h                         1/1     Running   0          31m
kube-system   binderhub-image-cleaner-hzlk5                         1/1     Running   0          31m
kube-system   binderhub-image-cleaner-jnnlc                         1/1     Running   0          31m
kube-system   coredns-66bc8b7b7b-8c88m                              1/1     Running   0          35m
kube-system   coredns-66bc8b7b7b-rcg9v                              1/1     Running   0          35m
kube-system   hub-8546d44db4-2k7k9                                  1/1     Running   0          32m
*kube-system   jupyter-binder-2dexamples-2drequirements-2dftbxioh8   0/1     Pending   0          7s
kube-system   kube-proxy-2s2ln                                      1/1     Running   0          32m
kube-system   kube-proxy-kvcf8                                      1/1     Running   0          32m
kube-system   kube-proxy-wdrqr                                      1/1     Running   0          32m
kube-system   proxy-699bd568c7-8rbls                                1/1     Running   0          32m
kube-system   user-scheduler-78456bb65c-99t2r                       1/1     Running   0          32m
kube-system   user-scheduler-78456bb65c-n6rfn                       1/1     Running   0          32m

When I try to access the logs of this pod to see whats going on I hit some brick walls:

kubectl logs jupyter-binder-2dexamples-2drequirements-2dftbxioh8
Error from server (NotFound): pods "jupyter-binder-2dexamples-2drequirements-2dftbxioh8" not found

and

kubectl describe pods jupyter-binder-2dexamples-2drequirements-2dftbxioh8
Error from server (NotFound): pods "jupyter-binder-2dexamples-2drequirements-2dftbxioh8" not found

Events look fine AFAICT:

LAST SEEN   TYPE     REASON                    OBJECT                                          MESSAGE
41m         Normal   Starting                  node/ip-10-0-1-60.us-east-2.compute.internal    Starting kubelet.
41m         Normal   NodeHasSufficientMemory   node/ip-10-0-1-60.us-east-2.compute.internal  Node ip-10-0-1-60.us-east-2.compute.internal status is now: NodeHasSufficientMemory
41m         Normal   NodeHasNoDiskPressure     node/ip-10-0-1-60.us-east-2.compute.internal  Node ip-10-0-1-60.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
41m         Normal   NodeHasSufficientPID      node/ip-10-0-1-60.us-east-2.compute.internal    Node ip-10-0-1-60.us-east-2.compute.internal status is now: NodeHasSufficientPID
41m         Normal   NodeAllocatableEnforced   node/ip-10-0-1-60.us-east-2.compute.internal    Updated Node Allocatable limit across pods
41m         Normal   RegisteredNode            node/ip-10-0-1-60.us-east-2.compute.internal    Node ip-10-0-1-60.us-east-2.compute.internal event: Registered Node ip-10-0-1-60.us-east-2.compute.internal in Controller
40m         Normal   Starting                  node/ip-10-0-1-60.us-east-2.compute.internal    Starting kube-proxy.
40m         Normal   NodeReady                 node/ip-10-0-1-60.us-east-2.compute.internal    Node ip-10-0-1-60.us-east-2.compute.internal status is now: NodeReady
41m         Normal   Starting                  node/ip-10-0-3-203.us-east-2.compute.internal   Starting kubelet.
41m         Normal   NodeHasSufficientMemory   node/ip-10-0-3-203.us-east-2.compute.internal   Node ip-10-0-3-203.us-east-2.compute.internal status is now: NodeHasSufficientMemory
41m         Normal   NodeHasNoDiskPressure     node/ip-10-0-3-203.us-east-2.compute.internal   Node ip-10-0-3-203.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
41m         Normal   NodeHasSufficientPID      node/ip-10-0-3-203.us-east-2.compute.internal   Node ip-10-0-3-203.us-east-2.compute.internal status is now: NodeHasSufficientPID
41m         Normal   NodeAllocatableEnforced   node/ip-10-0-3-203.us-east-2.compute.internal   Updated Node Allocatable limit across pods
41m         Normal   RegisteredNode            node/ip-10-0-3-203.us-east-2.compute.internal   Node ip-10-0-3-203.us-east-2.compute.internal event: Registered Node ip-10-0-3-203.us-east-2.compute.internal in Controller
40m         Normal   Starting                  node/ip-10-0-3-203.us-east-2.compute.internal   Starting kube-proxy.
40m         Normal   NodeReady                 node/ip-10-0-3-203.us-east-2.compute.internal   Node ip-10-0-3-203.us-east-2.compute.internal status is now: NodeReady
41m         Normal   Starting                  node/ip-10-0-3-235.us-east-2.compute.internal   Starting kubelet.
41m         Normal   NodeHasSufficientMemory   node/ip-10-0-3-235.us-east-2.compute.internal   Node ip-10-0-3-235.us-east-2.compute.internal status is now: NodeHasSufficientMemory
41m         Normal   NodeHasNoDiskPressure     node/ip-10-0-3-235.us-east-2.compute.internal   Node ip-10-0-3-235.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
41m         Normal   NodeHasSufficientPID      node/ip-10-0-3-235.us-east-2.compute.internal   Node ip-10-0-3-235.us-east-2.compute.internal status is now: NodeHasSufficientPID
41m         Normal   NodeAllocatableEnforced   node/ip-10-0-3-235.us-east-2.compute.internal   Updated Node Allocatable limit across pods
41m         Normal   RegisteredNode            node/ip-10-0-3-235.us-east-2.compute.internal   Node ip-10-0-3-235.us-east-2.compute.internal event: Registered Node ip-10-0-3-235.us-east-2.compute.internal in Controller
40m         Normal   Starting                  node/ip-10-0-3-235.us-east-2.compute.internal   Starting kube-proxy.
40m         Normal   NodeReady                 node/ip-10-0-3-235.us-east-2.compute.internal   Node ip-10-0-3-235.us-east-2.compute.internal status is now: NodeReady

The pod is also never listed on kubectl top pods when running the Kubernetes metric-server.

Any other advice on how to explore this problem further?

Could you look at your JupyterHub hub logs? Something like

kubectl logs hub-8546d44db4-2k7k9

or

kubectl logs deploy/hub

If that doesn’t show the error you could also try

You can give it a pod regex and it’ll automatically display the logs as soon as a matching pod starts.

Yes I was able to get those! On the t2.smalls the error turned out to be the size of the nodes. When I checked kubectl describe hub for the smalls they show Insufficient memory.

Events:
Type     Reason            Age                  From                      Message
----     ------            ----                 ----                      -------
Warning  FailedScheduling  89s (x3 over 2m46s)  binderhub-user-scheduler  0/3 nodes are available: 3 Insufficient memory.

I switched over to t2.larges just to make sure I had enough room for the next test. The pods all get out of the Pending stage this time! but the pull is still failing.

kubectl get pods --all-namespaces
NAMESPACE     NAME                                                  READY   STATUS             RESTARTS   AGE
kube-system   aws-node-mkqdt                                        1/1     Running            0          22m
kube-system   aws-node-mwdrr                                        1/1     Running            0          22m
kube-system   aws-node-vw4f9                                        1/1     Running            0          22m
kube-system   binder-56c97d9545-rvbsk                               1/1     Running            0          18m
kube-system   binderhub-image-cleaner-4mt28                         1/1     Running            0          21m
kube-system   binderhub-image-cleaner-jjxz2                         1/1     Running            0          21m
kube-system   binderhub-image-cleaner-jkw6k                         1/1     Running            0          21m
kube-system   coredns-66bc8b7b7b-vzvnh                              1/1     Running            0          26m
kube-system   coredns-66bc8b7b7b-z4c4z                              1/1     Running            0          26m
kube-system   hub-8546d44db4-2ct2l                                  1/1     Running            0          22m
kube-system   jupyter-binder-2dexamples-2drequirements-2ds93l6m1k   0/1     ImagePullBackOff   0          4m9s
kube-system   kube-proxy-jps5s                                      1/1     Running            0          22m
kube-system   kube-proxy-m2v97                                      1/1     Running            0          22m
kube-system   kube-proxy-m67jv                                      1/1     Running            0          22m
kube-system   kubernetes-dashboard-69574fd85c-gx5hw                 1/1     Running            0          22m
kube-system   metrics-server-7d7bbd97fc-nxpw6                       1/1     Running            0          22m
kube-system   proxy-699bd568c7-xh2pb                                1/1     Running            0          22m
kube-system   user-scheduler-78456bb65c-852rp                       1/1     Running            0          22m
kube-system   user-scheduler-78456bb65c-dnrp2                       1/1     Running            0          22m

Checking the logs again I found the error occurring in the KubeSpawner. Its mostly INFOs until the timeout error but I’ve attached the whole error from the first launch in-case its helpful.

kubectl logs jupyter-binder-2dexamples-2drequirements-2dt4p4cf70 -c notebook --namespace=kube-system    
Error from server (BadRequest): container "notebook" in pod "jupyter-binder-2dexamples-2drequirements-2dt4p4cf70" is waiting to start: trying and failing to pull image

and

kubectl logs hub-8546d44db4-2ct2l --namespace=kube-system
No config at /etc/jupyterhub/config/values.yaml
Loading /etc/jupyterhub/secret/values.yaml
Loading extra config: 00-binder
[I 2021-01-30 02:33:54.781 JupyterHub app:2332] Running JupyterHub version 1.2.2
[I 2021-01-30 02:33:54.781 JupyterHub app:2362] Using Authenticator: nullauthenticator.NullAuthenticator-1.0.0
[I 2021-01-30 02:33:54.781 JupyterHub app:2362] Using Spawner: builtins.BinderSpawner
[I 2021-01-30 02:33:54.781 JupyterHub app:2362] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.2.2
[I 2021-01-30 02:33:54.782 JupyterHub app:1457] Writing cookie_secret to /srv/jupyterhub/jupyterhub_cookie_secret
[I 2021-01-30 02:33:54.798 alembic.runtime.migration migration:155] Context impl SQLiteImpl.
[I 2021-01-30 02:33:54.799 alembic.runtime.migration migration:158] Will assume non-transactional DDL.
[I 2021-01-30 02:33:54.808 alembic.runtime.migration migration:517] Running stamp_revision  -> 4dc2d5a8c53c
[I 2021-01-30 02:33:54.819 alembic.runtime.migration migration:155] Context impl SQLiteImpl.
[I 2021-01-30 02:33:54.819 alembic.runtime.migration migration:158] Will assume non-transactional DDL.
[W 2021-01-30 02:33:54.895 JupyterHub app:1687] No admin users, admin interface will be unavailable.
[W 2021-01-30 02:33:54.895 JupyterHub app:1688] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2021-01-30 02:33:54.895 JupyterHub app:1717] Not using allowed_users. Any authenticated user will be allowed.
[I 2021-01-30 02:33:54.946 JupyterHub app:1840] Adding API token for service: binder
[I 2021-01-30 02:33:54.969 JupyterHub app:2399] Initialized 0 spawners in 0.002 seconds
[I 2021-01-30 02:33:54.971 JupyterHub app:2611] Not starting proxy
[I 2021-01-30 02:33:54.977 JupyterHub app:2647] Hub API listening on http://:8081/hub/
[I 2021-01-30 02:33:54.977 JupyterHub app:2649] Private Hub API connect url http://hub:8081/hub/
[I 2021-01-30 02:33:54.977 JupyterHub app:2662] Starting managed service cull-idle
[I 2021-01-30 02:33:54.977 JupyterHub service:339] Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10', '--cull-users']
[I 2021-01-30 02:33:54.979 JupyterHub service:121] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10 --cull-users
[I 2021-01-30 02:33:54.983 JupyterHub app:2671] Adding external service binder
[I 2021-01-30 02:33:54.985 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:33:54.986 JupyterHub proxy:400] Adding default route for Hub: / => http://hub:8081
[I 2021-01-30 02:33:54.989 JupyterHub app:2722] JupyterHub is now running at http://:8000
[I 2021-01-30 02:33:55.110 JupyterHub log:181] 200 GET /hub/api/users (cull-idle@127.0.0.1) 13.04ms
[I 2021-01-30 02:34:54.993 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:35:54.993 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:36:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:37:54.995 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:38:54.993 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:39:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:40:54.992 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:41:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:42:25.828 JupyterHub log:181] 201 POST /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 34.98ms
[I 2021-01-30 02:42:25.873 JupyterHub reflector:201] watching for pods with label selector='component=singleuser-server' in namespace kube-system
[W 2021-01-30 02:42:25.911 JupyterHub spawner:2140] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token
[I 2021-01-30 02:42:25.915 JupyterHub log:181] 202 POST /hub/api/users/binder-examples-requirements-a94hfpwr/servers/ (binder@::ffff:10.0.3.44) 69.58ms
[I 2021-01-30 02:42:25.915 JupyterHub spawner:1805] Attempting to create pod jupyter-binder-2dexamples-2drequirements-2da94hfpwr, with timeout 3
[I 2021-01-30 02:42:25.941 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.37ms
[I 2021-01-30 02:42:26.984 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 9.07ms
[I 2021-01-30 02:42:28.408 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.65ms
[I 2021-01-30 02:42:30.394 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 8.88ms
[I 2021-01-30 02:42:33.161 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.79ms
[I 2021-01-30 02:42:37.035 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.89ms
[I 2021-01-30 02:42:42.474 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 8.71ms
[I 2021-01-30 02:42:50.028 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 8.66ms
[I 2021-01-30 02:42:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:43:00.061 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 16.00ms
[I 2021-01-30 02:43:10.086 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 9.03ms
[I 2021-01-30 02:43:20.140 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 8.78ms
[I 2021-01-30 02:43:30.182 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 14.57ms
[I 2021-01-30 02:43:40.249 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 8.89ms
[I 2021-01-30 02:43:50.363 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.78ms
[I 2021-01-30 02:43:54.993 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:43:55.107 JupyterHub log:181] 200 GET /hub/api/users (cull-idle@127.0.0.1) 10.54ms
[W 210130 02:43:55 __init__:138] Not culling server binder-examples-requirements-a94hfpwr with pending spawn
[I 2021-01-30 02:44:00.395 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 15.73ms
[I 2021-01-30 02:44:10.421 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.05ms
[I 2021-01-30 02:44:20.448 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.08ms
[I 2021-01-30 02:44:30.508 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 15.03ms
[I 2021-01-30 02:44:40.563 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.92ms
 [I 2021-01-30 02:44:50.613 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.65ms
[I 2021-01-30 02:44:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:45:00.645 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 15.49ms
[I 2021-01-30 02:45:10.671 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 8.84ms
[I 2021-01-30 02:45:20.697 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 8.96ms
[I 2021-01-30 02:45:30.742 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 15.10ms
[I 2021-01-30 02:45:40.781 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.08ms
[I 2021-01-30 02:45:50.815 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.82ms
[I 2021-01-30 02:45:54.994 JupyterHub proxy:320] Checking routes
[I 2021-01-30 02:46:00.916 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 15.60ms
[I 2021-01-30 02:46:10.940 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.06ms
[I 2021-01-30 02:46:20.974 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 8.86ms
[I 2021-01-30 02:46:31.008 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.33) 15.53ms
[I 2021-01-30 02:46:41.061 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.3.44) 9.07ms
[I 2021-01-30 02:46:51.109 JupyterHub log:181] 200 GET /hub/api/users/binder-examples-requirements-a94hfpwr (binder@::ffff:10.0.1.237) 8.79ms
[I 2021-01-30 02:46:54.993 JupyterHub proxy:320] Checking routes
[E 2021-01-30 02:46:56.915 JupyterHub user:690] Unhandled error starting binder-examples-requirements-a94hfpwr's server: pod/jupyter-binder-2dexamples-2drequirements-2da94hfpwr did not start in 300 seconds!
[I 2021-01-30 02:46:56.916 JupyterHub spawner:1955] Deleting pod jupyter-binder-2dexamples-2drequirements-2da94hfpwr
[E 2021-01-30 02:46:58.309 JupyterHub gen:623] Exception in Future <Task finished name='Task-558' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:893> exception=TimeoutError('pod/jupyter-binder-2dexamples-2drequirements-2da94hfpwr did not start in 300 seconds!')> after timeout
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
    future.result()
  File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 900, in finish_user_spawn
    await spawn_future
  File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 708, in spawn
    raise e
  File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 607, in spawn
    url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
  File "/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py", line 1917, in _start
    await exponential_backoff(
  File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 179, in exponential_backoff
    raise TimeoutError(fail_message)
TimeoutError: pod/jupyter-binder-2dexamples-2drequirements-2da94hfpwr did not start in 300 seconds!

It sounds like JupyterHub’s KubeSpawner can’t access the docker registry. I’m not familiar with how EKS service accounts work, but have you tried explicitly creating a Z2JH secret with the registry credentials? Configuration Reference — Zero to JupyterHub with Kubernetes documentation

That does seem to be the problem. It’s interesting because the hub pod has no problem reading or writing to the container registry. When I choose a previously built Github project it knows it has been previously built and attempts to launch. This seems to just be a problem in the specific binder pod. I wasn’t able to find a work around all week. I also tried starting a cluster and explicitly adding the GCR secret from the Z2BH tutorial to the EKS service account without any luck. Any other suggestions for how to patch this? I’m leaning toward moving the project to Google Cloud :slightly_smiling_face:

Note it’s the binder pod that handles building the image and pushing it to the registry. The hub pod then launches the image via kubespawner.

If you’ve got at least one singleuser image in your registry (either from binderhub testing, or pushed there manually) it might be worth trying a Z2JH deployment on it’s own Zero to JupyterHub with Kubernetes — Zero to JupyterHub with Kubernetes documentation

Set singleuser.image.pullPolicy: Always and you can then play with your Z2JH registry credentials without the additional complexity of BinderHub.

I have a similar error. Here, the image was built with content from GitHub but the hub is failing to pull the image from the registry. I have tried the solution suggested by @manics but the same issue persist. my question is, am I setting the values correctly. I did set singleuser.image.pullPolicy: always in my secrete.yaml as follows:

jupyterhub:
  hub:
    singleuser:
      image:
        pullPolicy: Always
    services:
      binder:

I am building bhub on an azure kubernetes cluster. I have a name space for everything bhub. I also
have jupyterhub on its own namespace on the same cluster. Would this be a problem?

Bellow is a pod description as the image is being pulled. Any help is greatly appreciated.

kubectl --namespace=uncg-bhub describe pod jupyter-jtande-2djhub-2dfile-2ddownload-2dewbwocfr
Name:         jupyter-jtande-2djhub-2dfile-2ddownload-2dewbwocfr
Namespace:    uncg-bhub
Priority:     0
Node:         aks-nodepool1-16125732-vmss000002/10.240.0.66
Start Time:   Tue, 25 May 2021 11:15:29 -0400
Labels:       app=jupyterhub
              chart=jupyterhub-0.11.1
              component=singleuser-server
              heritage=jupyterhub
              hub.jupyter.org/network-access-hub=true
              release=uncg-bhub
Annotations:  hub.jupyter.org/username: jtande-jhub-file-download-ewbwocfr
Status:       Pending
IP:           10.240.0.88
IPs:
  IP:  10.240.0.88
Init Containers:
  block-cloud-metadata:
    Container ID:  containerd://9d54095e2018474fa519dab2a149c856c5e4e0118139928c9be471125d14c9b6
    Image:         jupyterhub/k8s-network-tools:0.11.1
    Image ID:      docker.io/jupyterhub/k8s-network-tools@sha256:e562b1f91469d71e843aeece0500be249344e82154183f082492c7c51a636b52
    Port:          <none>
    Host Port:     <none>
    Command:
      iptables
      -A
      OUTPUT
      -d
      169.254.169.254
      -j
      DROP
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 25 May 2021 11:15:29 -0400
      Finished:     Tue, 25 May 2021 11:15:30 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  notebook:
    Container ID:  
    Image:         uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9
    Image ID:      
    Port:          8888/TCP
    Host Port:     0/TCP
    Args:
      jupyter-notebook
      --ip=0.0.0.0
      --port=8888
      --NotebookApp.base_url=/user/jtande-jhub-file-download-ewbwocfr/
      --NotebookApp.token=1MODIFIED
      --NotebookApp.trust_xheaders=True
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
      memory:  1073741824
    Environment:
      JUPYTERHUB_API_TOKEN:           ed0MODIFIED
      JPY_API_TOKEN:                  ed0b245dfMODIFIED
      JUPYTERHUB_ADMIN_ACCESS:        1
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-jtande-jhub-file-download-ewbwocfr
      JUPYTERHUB_HOST:                
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/jtande-jhub-file-download-ewbwocfr/oauth_callback
      JUPYTERHUB_USER:                jtande-jhub-file-download-ewbwocfr
      JUPYTERHUB_SERVER_NAME:         
      JUPYTERHUB_API_URL:             http://hub:8081/hub/api
      JUPYTERHUB_ACTIVITY_URL:        http://hub:8081/hub/api/users/jtande-jhub-file-download-ewbwocfr/activity
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_SERVICE_PREFIX:      /user/jtande-jhub-file-download-ewbwocfr/
      MEM_GUARANTEE:                  1073741824
      JUPYTER_IMAGE_SPEC:             uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9
      JUPYTER_IMAGE:                  uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9
      BINDER_REPO_URL:                https://github.com/jtande/jhub-file-download
      BINDER_REF_URL:                 https://github.com/jtande/jhub-file-download/tree/28754d8d6fd4be4b3b06df7445977ebbf54904e9
      BINDER_LAUNCH_HOST:             http://40.76.149.6/
      BINDER_PERSISTENT_REQUEST:      v2/gh/jtande/jhub-file-download/28754d8d6fd4be4b3b06df7445977ebbf54904e9
      BINDER_REQUEST:                 v2/gh/jtande/jhub-file-download.git/intro2python
    Mounts:                           <none>
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations:        hub.jupyter.org/dedicated=user:NoSchedule
                    hub.jupyter.org_dedicated=user:NoSchedule
                    node.kubernetes.io/memory-pressure:NoSchedule
                    node.kubernetes.io/not-ready:NoExecute for 300s
                    node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                                        Message
  ----     ------     ----               ----                                        -------
  Normal   Scheduled  <unknown>                                                      Successfully assigned uncg-bhub/jupyter-jtande-2djhub-2dfile-2ddownload-2dewbwocfr to aks-nodepool1-16125732-vmss000002
  Normal   Pulled     97s                kubelet, aks-nodepool1-16125732-vmss000002  Container image "jupyterhub/k8s-network-tools:0.11.1" already present on machine
  Normal   Created    97s                kubelet, aks-nodepool1-16125732-vmss000002  Created container block-cloud-metadata
  Normal   Started    96s                kubelet, aks-nodepool1-16125732-vmss000002  Started container block-cloud-metadata
  Normal   BackOff    16s (x5 over 95s)  kubelet, aks-nodepool1-16125732-vmss000002  Back-off pulling image "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9"
  Warning  Failed     16s (x5 over 95s)  kubelet, aks-nodepool1-16125732-vmss000002  Error: ImagePullBackOff
  Normal   Pulling    3s (x4 over 96s)   kubelet, aks-nodepool1-16125732-vmss000002  Pulling image "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9"
  Warning  Failed     3s (x4 over 95s)   kubelet, aks-nodepool1-16125732-vmss000002  Failed to pull image "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": [rpc error: code = Unknown desc = failed to pull and unpack image "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": failed to resolve reference "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": failed to authorize: failed to fetch oauth token: unexpected status: 401 Unauthorized, rpc error: code = Unknown desc = failed to pull and unpack image "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": failed to resolve reference "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
  Warning  Failed     3s (x4 over 95s)   kubelet, aks-nodepool1-16125732-vmss000002  Error: ErrImagePull

After digging into the aks documentation, I was able to fix this issue. When you create an Azure service principal you grant it push and pull role. Even though the service principle has the roles granted, your Azure cluster is not aware of the roles. Hence, after following the binderhub recipe, your test image is built and pushed to the Azure Container Registry (ACR) but when jhub (I think) tries to pull the image, it fails with the error

  ... failed to resolve reference "uncgbhubbuilder.azurecr.io/uncgbhubbuilder/deploy-jtande-2djhub-2dfile-2ddownload-c6d4b6:28754d8d6fd4be4b3b06df7445977ebbf54904e9": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]

You can solve this by attaching your ACR at the time of creating your cluster (This means you aready have a registry set up) as follows:

AKS_NAME=your_azure_cluster_name
ACR_NAME=youracrname
RG_NAME=your_resource_group_name

az aks create -n $AKS_NAME -g $RG_NAME \
   --generate-ssh-keys \
   --attach-acr $(az acr show -n $ACR_NAME --query "id" -o tsv)

If you already have a cluster setup on azure as well as an existing ACR, then attached the service principal of your ACR to your cluster as follows:

AKS_NAME=your_azure_cluster_name
ACR_NAME=youracrname
RG_NAME=your_resource_group_name

az aks update -n $AKS_NAME -g $RG_NAME \
   --attach-acr $(az acr show -n $ACR_NAME --query "id" -o tsv)

References

2 Likes

@LeBoss, you live up to your username! Thanks a bunch, this was exactly what was wrong with my setup

1 Like

For me, the issue was the ImagePullSecret was not being added to the jupyter pods for some reason. For gcr, this link was helpful for explaining how to create an ImagePullSecret and add it to a serviceaccount. Doing this in terraform is explained here

Hello, I have a similar problem in a GKE environment.

I had a working jupyterhub/binderhub release, version from Jun-2020, with https and cors setup. I’ve now upgraded to the latest version (15-Nov-2021). I first had to migrate the release from helm2 to helm3, then tweaked my config.yaml to make the upgrade work. I can now build fine, but I can’t launch any image. I only get ‘Launch attempt 1 failed, retrying…’ etc. I’ve tried many changes in my config but I can’t make it work. The built images can be pulled and run with docker without problems.

Thanks to the valuable hints in this topic, I’ve been able to dig out some logs, I can see:

ConnectionRefusedError(111, 'Connection refused')

originating from this http call

200 GET /v2/gh/i386uk/kernel_py36_conda/master_v5 (anonymous@<my binderhub ip>) 675.69ms

What is going on? Any advice to get more debugging info would also be helpful. Thanks.

Hi! Unfortunately single log messages aren’t enough to diagnose an error, there’s usually a lot of context before and after.
Can you show us the logs for BinderHub and JupyterHub for the whole launch process? If you’re not sure then restart everything and show us the whole set of logs. It might also be worth turning on debug logging. Please also show us the pod description for the failed pod.

Hi, thanks for the quick reply. I can’t show you more logs because meantime I went for the classic solution: I deleted the whole deployment and reinstalled everything.

I followed (again) the instructions at zero-to-binderhub and I was able to get a working jupyter/binderhub setup under plain http.

Then I tried to enable https, and I failed. After a few hurdles all I’ve got is a binderhub webpage giving a NET::ERR_CERT_AUTHORITY_INVALID error, and I can’t access the jupyterhub url because its certificate is not ready (Message: Issuing certificate as Secret does not exist).

As other people have pointed out, the documentation is pretty old. Apart from waiting for a doc update, what is the best course of action, should I open a new topic with the details of my failed deployment?

Thanks

I’ve created a new topic for this. Any hints are very welcome. Thank you.