Z2jh bad gateway 502 (connect() failed in the ingresscontroller logs)

Hi,
All pods, svc, ingress are running but I get a bad gateway error when I try to reach the website in browser. There are some problems in the logs of my ingress nginx controller kubectl logs ingress-nginx-controller-76df688779-bxvjn:

2024/01/26 11:41:57 [error] 574#574: *1211328 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.42.0.1, server: mydomain, request: "GET / HTTP/2.0", upstream: "http://10.42.0.32:8000/", host: "mydomain"
10.42.0.1 - - [26/Jan/2024:11:41:57 +0000] "GET / HTTP/2.0" 502 552 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36" 647 0.001 [default-proxy-public-http] [] 10.42.0.32:8000, 10.42.0.32:8000, 10.42.0.32:8000 0, 0, 0 0.000, 0.000, 0.001 502, 502, 502 e9832cac970ee7d7ecc0d5e2298f6c7a
2024/01/26 11:41:57 [error] 574#574: *1211328 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.42.0.1, server: mydomain, request: "GET /favicon.ico HTTP/2.0", upstream: "http://10.42.0.32:8000/favicon.ico", host: "mydomain", referrer: "mydomain"

Here is the result of k get ingress:

kubectl get ingress
NAME         CLASS   HOSTS                     ADDRESS       PORTS     AGE
jupyterhub   nginx   mydomain   10.1.253.32   80, 443   5h21m

Would you have any idea what’s going wrong here? and how can I fix it?
best

Here is the result of ‘dig’:

kubectl run dig-container --image=alpine:latest --rm -it --restart=Never --command -- /bin/sh -c 'apk add --no-cache bind-tools && dig proxy-public.default.svc.cluster.local'


OK: 15 MiB in 29 packages

; <<>> DiG 9.18.19 <<>> proxy-public.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25347
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2b5ee6f72e018e04 (echoed)
;; QUESTION SECTION:
;proxy-public.default.svc.cluster.local.        IN A

;; ANSWER SECTION:
proxy-public.default.svc.cluster.local. 5 IN A  10.43.137.16

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10) (UDP)
;; WHEN: Fri Jan 26 14:09:54 UTC 2024
;; MSG SIZE  rcvd: 133

It’s strange. Why does it try to reach 10.42.0.1 or 10.43.0.1 I have no jupyterhub service or ep there:

$ kubectl get ep
NAME                                            ENDPOINTS          AGE
kubernetes                                      10.1.253.32:6443   26h
proxy-public                                    10.42.0.32:8000    5h42m
proxy-api                                       10.42.0.32:8001    5h42m
hub                                             10.42.0.34:8081    5h42m
cluster.local-nfs-subdir-external-provisioner   <none>             26h
$ kubectl get svc -o wide
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE     SELECTOR
kubernetes     ClusterIP   10.43.0.1       <none>        443/TCP    26h     <none>
hub            ClusterIP   10.43.83.95     <none>        8081/TCP   5h42m   app=jupyterhub,component=hub,release=nbgrader
proxy-api      ClusterIP   10.43.129.135   <none>        8001/TCP   5h42m   app=jupyterhub,component=proxy,release=nbgrader
proxy-public   ClusterIP   10.43.137.16    <none>        80/TCP     5h42m   app=jupyterhub,component=proxy,release=nbgrader

The ingresscontroller sends traffic to the peoxy pod access namespaces, is it allowed to do so?

I think it may need a label added to it, see infrastructure/helm-charts/support/values.yaml at 096e96be206dde3e0de62a22899e798a763be003 · 2i2c-org/infrastructure · GitHub for example

I figure this may need to be better documented if it isn’t already.

1 Like

Thanks for your response. How should i make sure ingresscontroller is allowed to? Should i add that annotation you mentioned in the ingresscontrollerpod? Or jupyterhub config file?

Best

Add it as a label to the ingress controller, if its the ingres-nginx helm chart, you have the config on how to do it linked above

1 Like

I installed ingress-nginx using

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --values ingress-nginx-controller.yaml

where ingress-nginx-controller.yaml is:

$ cat ingress-nginx-controller-config.yaml
controller:
  service:
    annotations:
      hub.jupyter.org/network-access-proxy-http: "true"
$ kubectl get service/ingress-nginx-controller -n ingress-nginx -o yaml | grep annotation -A 2
  annotations:
    hub.jupyter.org/network-access-proxy-http: "true"

But this didn’t work. Am I missing something?

best

Ah but then the exact example is the one provided in the link above: infrastructure/helm-charts/support/values.yaml at 096e96be206dde3e0de62a22899e798a763be003 · 2i2c-org/infrastructure · GitHub

Note that its a label for the controller pods, not a annotation for the controller service.

1 Like

Thanks for your clarification. I modified the values file as:

$ cat ingress-nginx-controller-config.yaml
controller:
  podLabels:
    hub.jupyter.org/network-access-proxy-http: "true"

and installed ingress but I still get the same bad gateway 502 error. Still same entry in the ingress-controller-pod logs:

2024/01/27 12:04:29 [error] 40#40: *50787 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.42.0.1, server: mydomain, request: "GET /favicon.ico HTTP/2.0", upstream: "http://10.42.0.32:8000/favicon.ico", host: "mydomain", referrer: "mydomain"

In Liveness & readiness probes failed z2jh you mentioned you were having problems with a multi-cluster k3s deployment. Are you 100% sure the cluster is fully working? If it’s only partially configured some things may work, others won’t, and some things may intermittently work.

1 Like

I couldn’t make it work in a multicluster so i sticked to a single cluster. Now all pods and services and ingress from jupyterhub side is working and i no longer have liveness & rediness problem. But this bad gateway error persists

Although the hub status is running I found this also in the logs of hub pod, repeatedly

[E 2024-01-28 00:48:47.189 JupyterHubSingleUser] Failed to connect to my Hub at http://hub:8081/hub/api (attempt 1/5). Is it running?
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/singleuser/extension.py", line 336, in check_hub_version
        resp = await client.fetch(self.hub_auth.api_url)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ConnectionRefusedError: [Errno 111] Connection refused

And here is the result of k describe pod hub...:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m49s  default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   Scheduled         2m47s  default-scheduler  Successfully assigned default/hub-856b4799c4-55nbq to k3s-master-01
  Normal   Pulled            2m47s  kubelet            Container image "bpfrd/nbgrader-hub:latest" already present on machine
  Normal   Created           2m47s  kubelet            Created container hub
  Normal   Started           2m47s  kubelet            Started container hub

So maybe these two issues are related? I was wondering if you have any idea about this log?
best

also this one

[E 240128 01:08:45 ioloop:923] Exception in callback functools.partial(<function cull_idle at 0x7f1005eca520>, url='http://localhost:8081/hub/api', api_token='303f6418f5a54858b93213d06537d8ec', inactive_limit=3600, cull_users=False, remove_named_servers=False, max_age=0, concurrency=10, ssl_enabled=False, internal_certs_location='internal-ssl', cull_admin_users=True, api_page_size=0)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 921, in _run
        await val
      File "/usr/local/lib/python3.11/site-packages/jupyterhub_idle_culler/__init__.py", line 422, in cull_idle
        async for user in fetch_paginated(req):
      File "/usr/local/lib/python3.11/site-packages/jupyterhub_idle_culler/__init__.py", line 135, in fetch_paginated
        response = await resp_future
                   ^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/jupyterhub_idle_culler/__init__.py", line 117, in fetch
        return await client.fetch(req)
               ^^^^^^^^^^^^^^^^^^^^^^^
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden

Perhaps something is wrong that causes 502 bad gateway error as a side effect?

It might be helpful to mention that the same script ran perfectly on two different servers. But running it on this server gives various errors in the logs. It’s the same server where I had problems with multi-cluster k3s.

I was wondering if,
1 - it is a networking issue?
2 - how can I run hub and proxy pods on 0.0.0.0? does it help? Does it help?
3 - I set networkpolicy to false but does it help if I explicitly set network policies? what should I do in this way? is there any example
4 - is there anyway to find out if there are some restrictions on this server that prevents networking?

best

I think it’s worth taking a step back, and focussing on k3s/k8s. One of the major advantages of Kubernetes is that it’s a standard platform that is (more or less) agnostic to the underlying hardware or cloud provider. This makes it a lot easier to write and deploy applications on it.

The big downside is it means the K8s admin, i.e. you, are responsible for ensuring K8s is setup correctly. This may include installing and configuring some addons, configuring the K8s cluster itself, checking there are no weird hardware/storage/networking issues, etc. For example, some k8s distributions don’t include network policies, dynamic storage, load balancers or ingress by default and if they do they may still require some additional configuration.

Given you’ve had problems with your existing server, and you’re not sure about what state it’s in, I think it’d be worth starting with a completely new server and make notes of everything you do.

1 Like