Custom docker image on EKS -- timed out waiting for condition

Hi all, I’m trying to deploy JupyterHub with a custom docker image. I continue to get the error “timed out waiting for the condition”. I would appreciate some help in further debugging what could be going on. Thank you!

Regarding the infra:

  • Using AWS EKS,
  • Docker image using AWS ECR

I followed the instructions and put the following in my config.yaml:
singleuser:
image:
name: XXX.dkr.ecr.us-west-2.amazonaws.com/XXXX
tag: latest

However, I always get the following:
Error: timed out waiting for the condition

I have put --wait --timeout 12000 in the helm command. Still no dice. Here are what I think are the relevant bits from kubectl log:

[kube] 2019/05/20 03:39:14 Watching for changes to Job hook-image-awaiter with timeout of 20m0s [kube] 2019/05/20 03:39:14 Add/Modify event for hook-image-awaiter: ADDED [kube] 2019/05/20 03:39:14 hook-image-awaiter: Jobs active: 0, jobs failed: 0, jobs succeeded: 0 [kube] 2019/05/20 03:39:14 Add/Modify event for hook-image-awaiter: MODIFIED [kube] 2019/05/20 03:39:14 hook-image-awaiter: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 [tiller] 2019/05/20 03:59:14 warning: Release jhub pre-upgrade jupyterhub/templates/image-puller/job.yaml could not complete: timed out waiting for the condition

Heya! Thank you for posting :slight_smile:

This is usually because the image you are using is either not found or your EKS cluster is not authorized to use them. Another possibility is that youareout of quota for a resource needed, such as external IP address or nodes that are large enough.

  1. Does it work if you don’t try and use your custom image?
  2. What objects are in the namespace you tried deploying to? Running kubectl -n <namespace> get pod should provide useful information. If any are in non-Running state, running kubectl -n <namespace> describe pod <pod-name> would provide even more useful information.

If this doesn’t help, post the contents of your config.yaml file along with outputs from above commands, and that would make it easier for someone to help.

Good luck!

Thanks for the quick reply. Here is the experiment I ran to rule out issues that are permission related:

  • I installed a custom image from the jupyter hub registry (jupyter-datascience). This was successful
  • Then, I pulled that docker image, changed the tag and pushed to my own ECR. I then ran helm upgrade on that image, and that succeeded. Here’s the relevant part of the config.yaml

singleuser:
image:
name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/jupyter-datascience-notebook
tag: latest

  • Then, I changed the name to point to the docker image I actually want to use:

singleuser:
image:
name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter
tag: latest

Now I run into problem. The one thing I know and may be an issue is, the image I’m trying to use to huge. 2.5G huge. That said, I passed in a very large timeout (20 minutes). I would have thought that’d be enough?

In any case, it seems, based on what I’ve been able to isolate, this is not a permission issue. Or an issue with the state of the cluster.

Are there other ways to get more information on what aspect is timing out?

Thank you!

Thanks for testing! This could still be a permissions problem - maybe the other docker image is available but the one you wanna use is not?

(2) from my answer above should give you more information. Providing that, along with your config.yaml file, would be useful.

Thanks.

Ah! The “describe pod” command is useful! Thank you! I’m a kubernetes newbie and haven’t explored that command yet :smiley:

I found this in the list of events… Seems I don’t have /bin/sh in the image… Is there a way to customize the spec to not execute that command?

Warning Failed 13m (x5 over 14m) kubelet, ip-192-168-80-136.us-west-2.compute.internal Error: failed to start container "image-pull-singleuser": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown

Here’s the config.py

proxy:
  secretToken: "xxxxx"

auth:
  type: google
  google:
    clientId: "xxxx-gcja1uuqrjk9l7l9viarcodfs64uv7sf.apps.googleusercontent.com"
    clientSecret: "xxxx-w"
    callbackUrl: "http://xxxx-162851808.us-west-2.elb.amazonaws.com/hub/oauth_callback"
    hostedDomain: "relational.ai"
    loginService: "Relational AI"

singleuser:
  image:
    name: 791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter
    tag: latest

Here’s the full output on one of the hook-image-puller pods:

$ kubectl -n jhub describe pod hook-image-puller-8w5t2
Name:               hook-image-puller-8w5t2
Namespace:          jhub
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-80-136.us-west-2.compute.internal/192.168.80.136
Start Time:         Mon, 20 May 2019 10:59:41 -0700
Labels:             app=jupyterhub
                    component=hook-image-puller
                    controller-revision-hash=6f54b45dbc
                    pod-template-generation=1
                    release=jhub
Annotations:        <none>
Status:             Pending
IP:                 192.168.73.200
Controlled By:      DaemonSet/hook-image-puller
Init Containers:
  image-pull-singleuser:
    Container ID:  docker://b99f5a108c9ae6928424377a18fdccd9b14b05ac216776f37151c48b52b5c7ec
    Image:         791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter:latest
    Image ID:      docker-pullable://791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter@sha256:0ca9af090ad6edd4687bfcbfa0a9c4626555bc2b4279d9f053309ea5c3bd6381
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Pulling complete"
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown
      Exit Code:    127
      Started:      Mon, 20 May 2019 11:10:29 -0700
      Finished:     Mon, 20 May 2019 11:10:29 -0700
    Ready:          False
    Restart Count:  7
    Environment:    <none>
    Mounts:         <none>
  image-pull-metadata-block:
    Container ID:  
    Image:         jupyterhub/k8s-network-tools:0.8.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Pulling complete"
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  pause:
    Container ID:   
    Image:          gcr.io/google_containers/pause:3.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:            <none>
QoS Class:          BestEffort
Node-Selectors:     <none>
Tolerations:        hub.jupyter.org/dedicated=user:NoSchedule
                    hub.jupyter.org_dedicated=user:NoSchedule
                    node.kubernetes.io/disk-pressure:NoSchedule
                    node.kubernetes.io/memory-pressure:NoSchedule
                    node.kubernetes.io/not-ready:NoExecute
                    node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  15m                   default-scheduler                                      Successfully assigned jhub/hook-image-puller-8w5t2 to ip-192-168-80-136.us-west-2.compute.internal
  Normal   Pulled     13m (x5 over 14m)     kubelet, ip-192-168-80-136.us-west-2.compute.internal  Container image "791598104349.dkr.ecr.us-west-2.amazonaws.com/delve-jupyter:latest" already present on machine
  Normal   Created    13m (x5 over 14m)     kubelet, ip-192-168-80-136.us-west-2.compute.internal  Created container
  Warning  Failed     13m (x5 over 14m)     kubelet, ip-192-168-80-136.us-west-2.compute.internal  Error: failed to start container "image-pull-singleuser": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown
  Warning  BackOff    4m52s (x47 over 14m)  kubelet, ip-192-168-80-136.us-west-2.compute.internal  Back-off restarting failed container

cool! Glad you found it useful :slight_smile: Lots of fun things to learn!

If I understand this correctly, your custom image does not have /bin/sh in it. Is that right? Is there a specific reason it has been removed? It’s usually present in almost all images. It might also mean there might be other problems with the image that will prevent it from running with JupyterHub. Do you have a Dockerfile you can share with us?