Binder deployed in AWS EKS - domain name resolution errors

Hi,

I installed binder into my k8s cluster using the directions here: https://binderhub.readthedocs.io/en/latest/setup-binderhub.html. I did the install as written, then I used a more recent binderhub repo (version=0.2.0-612ade7) to see if that might fix my problem, but it did not.

When I attempt to build and launch a repo, I get an error in the Build logs, which is at the end of this post.

Wondering if anyone had any advice on this? I tried launching a BusyBox into my cluster to ping archive.ubuntu.com, and connectivity was ok.

TIA,
Dave Benham
Purdue University


Waiting for build to start...
Picked Git content provider.
Cloning into '/tmp/repo2dockerwbam9_7o'...
HEAD is now at 24f42ee Add files via upload
Building conda environment for python=3.7Using PythonBuildPack builder
Building conda environment for python=3.7Building conda environment for python=3.7Step 1/42 : FROM buildpack-deps:bionic
 ---> d69026b2a83e
Step 2/42 : ENV DEBIAN_FRONTEND noninteractive
 ---> Using cache
 ---> a29299f3738d
Step 3/42 : RUN apt-get -qq update &&     apt-get -qq install --yes --no-install-recommends locales > /dev/null &&     apt-get -qq purge &&     apt-get -qq clean &&     rm -rf /var/lib/apt/lists/*
 ---> Running in b29b6f864577
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
E: Package 'locales' has no installation candidate
Removing intermediate container b29b6f864577
The command '/bin/sh -c apt-get -qq update &&     apt-get -qq install --yes --no-install-recommends locales > /dev/null &&     apt-get -qq purge &&     apt-get-qq clean &&     rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

Great to hear others are trying to setup their own BinderHubs :heart_eyes:!

That is an intresting failure, especially given you can ping/resolve archive.ubuntu.com.

Have you (re(re))tried the build just in case it fixed itself between when you ran the build and the busybox test?

I don’t think that by default the helm chart does anything to limit the network connectivity of build pods. Did you do anything with respect to network policies or such? Could you kubectl describe pod <buildpodnamehere> the runninng build pod to see if that contains anything interesting?

TL;DR: I don’t have a good idea right now so fishing around a bit.

Betatim, thanks for the reply. I love the idea behind binderhub, so I’m anxious to get it up and running properly as soon as I can.

I could re-re-try the build, see if that works, seems like an optimistic shot in the dark, but one I’m willing to take.

I can’t do a kubectl describe pod, the pod only exists for several seconds and then disappears, I can barely see the pod register itself before it is gone.

Did you do anything to configure docker-in-docker (DIND) pods? If that doesn’t ring a bell the answer is no.

I think by default the builds end up using the docker daemon on the node that the pod runs on. I think you can get access to that by ssh'ing to a node and running docker build somedockerfilethatdoessomethinglikethefailingsteps and see what you see.

Is this a local kubernetes cluster or one at a cloud provider? fishing

Thanks again for getting back to me.

I ssh’ed to a host in my cluster. I’m running k8s on AWS EKS, and all my nodes are AWS linux, a RedHat derivative.

I did construct a small dockerfile on the host I ssh’d to

FROM buildpack-deps:bionic
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && apt-get -qq install --yes --no-install-recommends locales > /dev/null && apt-get -qq purge && apt-get -qq clean && rm -rf /var/lib/apt/lists/*q
apt-get -qq update

And did a docker build . and got the following output

docker build .
Sending build context to Docker daemon 6.656kB
Step 1/3 : FROM buildpack-deps:bionic
—> d69026b2a83e
Step 2/3 : ENV DEBIAN_FRONTEND noninteractive
—> Using cache
—> a29299f3738d
Step 3/3 : RUN apt-get -qq update && apt-get -qq install --yes --no-install-recommends locales > /dev/null && apt-get -qq purge && apt-get -qq clean && rm -rf /var/lib/apt/lists/*q
—> Running in 91f1c85a5b34
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease Temporary failure resolving ‘archive.ubuntu.com


NOTE… additional lines here similar to the one above, couldn’t include them b/c the forum said I couldn’t post more than 2 links as a new user and each error line had two urls in it

E: Package ‘locales’ has no installation candidate
The command ‘/bin/sh -c apt-get -qq update && apt-get -qq install --yes --no-install-recommends locales > /dev/null && apt-get -qq purge && apt-get -qq clean && rm -rf /var/lib/apt/lists/*q’ returned a non-zero code: 100

1 Like

I think this means that something with the kubernetes cluster and its networking isn’t setup properly. I don’t have any experience with setting up clusters though so not really sure where to point you to :-/

Looks like a DNS issue when resolving archive.ubuntu.com.

What happens when running the following command on the machine?

docker run busybox nslookup archive.ubuntu.com

[ec2-user@ip-192-168-141-71 ~]$ docker run busybox nslookup archive.ubuntu.com
nslookup: can’t connect to remote host (192.168.0.2): Network is unreachable

Not sure where that address is coming from or what should be there. I’m a a bit of a k8s newbie.

I am currently working on the spinning up BinderHub on AWS EKS and I have encountered the same issue.

I’ve found SSHing directly into a EC2 node that is part of NodeGroup and then running a docker container directly on a host is definitely different than how a Kubernetes deployment would run the same docker container. My understanding is that their is VPC-CNI networking layer to allow containers to talk to each other and the outside world (as opposed to a docker0/overlay interface as would be for a local docker deployment).

So, even though

[ec2-user@ip-192-168-141-71 ~]$ docker run busybox nslookup archive.ubuntu.com
nslookup: can’t connect to remote host (192.168.0.2): Network is unreachable

is the same behaviour I observe, I can successfully run a k8s job such as creating a file called job-nslookup.yaml with contents

apiVersion: batch/v1
kind: Job
metadata:
  name: nslookup-test
spec:
  template:
    spec:
      containers:
      - name: nslookup-test
        image: busybox
        command: ["nslookup",  "-type=a", "archive.ubuntu.com"]
      restartPolicy: Never
  backoffLimit: 4

to get

$ kubectl apply -f job-nslookup.yaml
job.batch/nslookup-test created
$ kubectl get pods -l job-name=nslookup-test
NAME                  READY   STATUS      RESTARTS   AGE
nslookup-test-wnrw8   0/1     Completed   0          28s
$ kubectl logs nslookup-test-wnrw8
Server:	10.100.0.10
Address:	10.100.0.10:53

Non-authoritative answer:
Name:	archive.ubuntu.com
Address: 91.189.88.162
Name:	archive.ubuntu.com
Address: 91.189.88.152
Name:	archive.ubuntu.com
Address: 91.189.91.23
Name:	archive.ubuntu.com
Address: 91.189.88.149
Name:	archive.ubuntu.com
Address: 91.189.88.161

as I am expecting. This confirms to me, at least, DNS is working within a k8s managed container.

My next thought is to explore if there is any with the docker-in-docker pods that could be the issue. I’d be happy to hear if @dbenham has managed to get this resolved?

1 Like

I think I’ve found a likely source for the error:

The docker bridge network is now disabled by default in EKS AMI images.
To confirm, we an force docker to use the node’s network and get a successful name resolution:

[ec2-user@ip-192-168-70-145 ~]$ docker run --network=host --rm -it  busybox nslookup -type=a archive.ubuntu.com
Server:		192.168.0.2
Address:	192.168.0.2:53

Non-authoritative answer:
Name:	archive.ubuntu.com
Address: 91.189.88.152
Name:	archive.ubuntu.com
Address: 91.189.88.162
Name:	archive.ubuntu.com
Address: 91.189.91.23
Name:	archive.ubuntu.com
Address: 91.189.88.149
Name:	archive.ubuntu.com
Address: 91.189.88.161

However, there was added a new --enable-docker-bridge as a bootstrap argument that is supposed to restore the previous behaviour.

I am currently searching for the correct way to pass in --enable-docker-bridge with eksctl (I know how to do it with CloudFormation but I figure there has to be a way of passing this option in when the nodegroup is created with eksctl).

1 Like

Success!

As far as I can tell eksctl does not (yet?) have support for --enable-docker-bridge. I am not even sure that the bootstrap.sh script that would for an AMI through CloudFormation (used when setting up EKS without eksctl) is even the same boot strapping script used by eksctl. eksctl does support both the concepts of overrideBootstrapScript and preBootstrapCommand though. So have a several false starts, I’ve added the following to by configuration yaml that I used to spin up by eksctl controlled cluster:

 preBootstrapCommand:
    # Replicate what --enable-docker-bridge does in /etc/eks/bootstrap.sh
    # Enabling the docker bridge network. We have to disable live-restore as it
    # prevents docker from recreating the default bridge network on restart
   - "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
   - "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' >  /etc/docker/jq_script"
   - "jq -f /etc/docker/jq_script /etc/docker/daemon_backup.json | tee /etc/docker/daemon.json"
   - "systemctl restart docker"

The shell commands are bit longer than I would have expected. But it seemed that I needed to make of copy of daemon.json otherwise I would end up with a blank file and there was some odd quotes in strings that needed to be handled.

But, overall, success! I’ve now got a skeleton BinderHub up and running on AWS using EKS and eksctl.

3 Likes

Whoop! I love it that with the forum someone who isn’t a JupyterHub maintainer could help solve a problem for someone :grinning: way to go.

One question: should we update the title to something more descriptive? Seems like it is about resolving names, find, and some AWS tooling.

1 Like

jmunroe, thanks for the help. Haven’t had a chance to try your suggestion yet, look forward to next week and letting you know.

1 Like

I just ran into this issue - I’m not deploying binderhub, but I am deploying an app that uses repo2docker. fwiw - I’m using terraform, which is pretty nice, and for terraform, the workaround is here:

2 Likes