Hi all, I’m trying to deploy JupyterHub with a custom docker image. I continue to get the error “timed out waiting for the condition”. I would appreciate some help in further debugging what could be going on. Thank you!
Regarding the infra:
- Using AWS EKS,
- Docker image using AWS ECR
I followed the instructions and put the following in my config.yaml:
singleuser:
image:
name: XXX.dkr.ecr.us-west-2.amazonaws.com/XXXX
tag: latest
However, I always get the following:
Error: timed out waiting for the condition
I have put --wait --timeout 12000 in the helm command. Still no dice. Here are what I think are the relevant bits from kubectl log:
[kube] 2019/05/20 03:39:14 Watching for changes to Job hook-image-awaiter with timeout of 20m0s [kube] 2019/05/20 03:39:14 Add/Modify event for hook-image-awaiter: ADDED [kube] 2019/05/20 03:39:14 hook-image-awaiter: Jobs active: 0, jobs failed: 0, jobs succeeded: 0 [kube] 2019/05/20 03:39:14 Add/Modify event for hook-image-awaiter: MODIFIED [kube] 2019/05/20 03:39:14 hook-image-awaiter: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 [tiller] 2019/05/20 03:59:14 warning: Release jhub pre-upgrade jupyterhub/templates/image-puller/job.yaml could not complete: timed out waiting for the condition
Heya! Thank you for posting
This is usually because the image you are using is either not found or your EKS cluster is not authorized to use them. Another possibility is that youareout of quota for a resource needed, such as external IP address or nodes that are large enough.
- Does it work if you don’t try and use your custom image?
- What objects are in the namespace you tried deploying to? Running
kubectl -n <namespace> get pod
should provide useful information. If any are in non-Running state, running kubectl -n <namespace> describe pod <pod-name>
would provide even more useful information.
If this doesn’t help, post the contents of your config.yaml file along with outputs from above commands, and that would make it easier for someone to help.
Good luck!
Thanks for the quick reply. Here is the experiment I ran to rule out issues that are permission related:
- I installed a custom image from the jupyter hub registry (jupyter-datascience). This was successful
- Then, I pulled that docker image, changed the tag and pushed to my own ECR. I then ran helm upgrade on that image, and that succeeded. Here’s the relevant part of the config.yaml
singleuser:
image:
name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/jupyter-datascience-notebook
tag: latest
- Then, I changed the name to point to the docker image I actually want to use:
singleuser:
image:
name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter
tag: latest
Now I run into problem. The one thing I know and may be an issue is, the image I’m trying to use to huge. 2.5G huge. That said, I passed in a very large timeout (20 minutes). I would have thought that’d be enough?
In any case, it seems, based on what I’ve been able to isolate, this is not a permission issue. Or an issue with the state of the cluster.
Are there other ways to get more information on what aspect is timing out?
Thank you!
Thanks for testing! This could still be a permissions problem - maybe the other docker image is available but the one you wanna use is not?
(2) from my answer above should give you more information. Providing that, along with your config.yaml
file, would be useful.
Thanks.
Ah! The “describe pod” command is useful! Thank you! I’m a kubernetes newbie and haven’t explored that command yet
I found this in the list of events… Seems I don’t have /bin/sh in the image… Is there a way to customize the spec to not execute that command?
Warning Failed 13m (x5 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Error: failed to start container "image-pull-singleuser": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown
Here’s the config.py
proxy:
secretToken: "xxxxx"
auth:
type: google
google:
clientId: "xxxx-gcja1uuqrjk9l7l9viarcodfs64uv7sf.apps.googleusercontent.com"
clientSecret: "xxxx-w"
callbackUrl: "http://xxxx-162851808.us-west-2.elb.amazonaws.com/hub/oauth_callback"
hostedDomain: "relational.ai"
loginService: "Relational AI"
singleuser:
image:
name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter
tag: latest
Here’s the full output on one of the hook-image-puller pods:
$ kubectl -n jhub describe pod hook-image-puller-8w5t2
Name: hook-image-puller-8w5t2
Namespace: jhub
Priority: 0
PriorityClassName: <none>
Node: ip-192-168-80-136.us-west-2.compute.internal/192.168.80.136
Start Time: Mon, 20 May 2019 10:59:41 -0700
Labels: app=jupyterhub
component=hook-image-puller
controller-revision-hash=6f54b45dbc
pod-template-generation=1
release=jhub
Annotations: <none>
Status: Pending
IP: 192.168.73.200
Controlled By: DaemonSet/hook-image-puller
Init Containers:
image-pull-singleuser:
Container ID: docker://b99f5a108c9ae6928424377a18fdccd9b14b05ac216776f37151c48b52b5c7ec
Image: 791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter:latest
Image ID: docker-pullable://791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter@sha256:0ca9af090ad6edd4687bfcbfa0a9c4626555bc2b4279d9f053309ea5c3bd6381
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
echo "Pulling complete"
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown
Exit Code: 127
Started: Mon, 20 May 2019 11:10:29 -0700
Finished: Mon, 20 May 2019 11:10:29 -0700
Ready: False
Restart Count: 7
Environment: <none>
Mounts: <none>
image-pull-metadata-block:
Container ID:
Image: jupyterhub/k8s-network-tools:0.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
echo "Pulling complete"
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts: <none>
Containers:
pause:
Container ID:
Image: gcr.io/google_containers/pause:3.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts: <none>
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes: <none>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=user:NoSchedule
hub.jupyter.org_dedicated=user:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned jhub/hook-image-puller-8w5t2 to ip-192-168-80-136.us-west-2.compute.internal
Normal Pulled 13m (x5 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Container image "791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter:latest" already present on machine
Normal Created 13m (x5 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Created container
Warning Failed 13m (x5 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Error: failed to start container "image-pull-singleuser": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown
Warning BackOff 4m52s (x47 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Back-off restarting failed container
cool! Glad you found it useful Lots of fun things to learn!
If I understand this correctly, your custom image does not have /bin/sh
in it. Is that right? Is there a specific reason it has been removed? It’s usually present in almost all images. It might also mean there might be other problems with the image that will prevent it from running with JupyterHub. Do you have a Dockerfile you can share with us?