GKE - 503 : Service Unavailable Your server appears to be down. Try restarting it from the hub

I’m running jupyterhub on kubernetes 1.21 in GKE. I have installed jupyterhub using helmchart
chart version: jupyterhub-1.2.0
app version: 1.5.0

The helm install works fine and the pods come up to, but I more often then not see the below error on the browser which I noticed is pretty common.
503 : Service Unavailable Your server appears to be down. Try restarting it from the hub
Sometimes I reload it multiple times and it does work as expected ( but that’s very very rare ).
I’m using google oauth for authentication.
A similar issue has been raised here, but that does not fix the issue
the hub pod logs during the time of error are below

[I 2022-03-17 14:43:58.771 JupyterHub log:189] 200 GET /hub/error/503?url=%2Fuser%2Fgohar.hovsepyan%40verve.com%2Flab%2Fworkspaces%2Fauto-y (@10.246.6.17) 1.40ms
[I 2022-03-17 14:44:07.805 JupyterHub log:189] 200 GET /hub/error/503?url=%2F (@10.246.6.17) 1.49ms
[W 2022-03-17 14:44:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.32ms
[I 2022-03-17 14:44:39.447 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:45:38.511 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.15ms
[I 2022-03-17 14:45:39.415 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:46:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.32ms
[I 2022-03-17 14:46:39.416 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:47:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.22ms
[W 2022-03-17 14:47:49.428 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[I 2022-03-17 14:47:49.563 JupyterHub proxy:347] Checking routes
.
.
.
[I 2022-03-17 14:51:06.467 JupyterHub oauth2:111] OAuth redirect: 'https://jupy-----redacted------ve.io/hub/oauth_callback'
[I 2022-03-17 14:51:06.468 JupyterHub log:189] 302 GET /hub/oauth_login?next=%2Fhub%2Fuser%2Fgohar.hovsepyan%40verve.com%2Flab%2Fworkspaces%2Fauto-y -> https://accounts.google.com/o/oauth2/v2/auth?response_type=code&redirect_uri=https%3A%2F%2Fjupy-----redacted------ve.io%2Fhub%2Foauth_callback&client_id=324621593441-gonsvpicbljnh0th79g0463il2dnfrhv.apps.googleusercontent.com&state=[secret]&scope=openid+email (@10.255.66.89) 1.33ms
[I 2022-03-17 14:51:17.193 JupyterHub base:762] User logged in: gohar.hovsepyan@verve.com
[I 2022-03-17 14:51:17.194 JupyterHub log:189] 302 GET /hub/oauth_callback?state=[secret]&code=[secret]&scope=email+openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&authuser=[secret]&hd=verve.com&prompt=none -> /hub/user/gohar.hovsepyan@verve.com/lab/workspaces/auto-y (@10.255.66.89) 10155.52ms
[E 2022-03-17 14:51:17.364 JupyterHub log:189] 503 GET /hub/user/gohar.hovsepyan@verve.com/lab/workspaces/auto-y (gohar.hovsepyan@verve.com@10.255.66.89) 15.28ms
[I 2022-03-17 14:51:22.781 JupyterHub log:189] 200 GET /hub/error/503?url=%2Fhub%2Fstatic%2Fjs%2Fnot_running.js%3Fv%3D20220317141739 (@10.246.6.17) 1.32ms
[I 2022-03-17 14:51:35.255 JupyterHub log:189] 200 GET /hub/error/503?url=%2F (@10.246.6.17) 1.32ms
[W 2022-03-17 14:51:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.12ms
[W 2022-03-17 14:51:59.430 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-03-17 14:52:09.589 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-03-17 14:52:09.591 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fde2ea36c40>>, <Task finished name='Task-2798' coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2666> exception=TimeoutError('Repeated api_request to proxy path "" failed.')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2668, in update_last_activity
        routes = await self.proxy.get_all_routes()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
        result = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Repeated api_request to proxy path "" failed.
2022-03-17T14:52:09.592137627Z

I have also tried deleting the network policies, but it returns with the same error.
Below is the helm values file that we are using ( could not attach a file, hence pasting it here )

# custom can contain anything you want to pass to the hub pod, as all passed
# Helm template values will be made available there.
custom: {}

# imagePullSecret is configuration to create a k8s Secret that Helm chart's pods
# can get credentials from to pull their images.
imagePullSecret:
  create: false
  automaticReferenceInjection: true
  registry: ''
  username: ''
  email: ''
  password: ''
# imagePullSecrets is configuration to reference the k8s Secret resources the
# Helm chart's pods can get credentials from to pull their images.
imagePullSecrets:
  - name: auth-container-gcr
  - name: auth-container-docker


# hub relates to the hub pod, responsible for running JupyterHub, its configured
# Authenticator class KubeSpawner, and its configured Proxy class
# ConfigurableHTTPProxy. KubeSpawner creates the user pods, and
# ConfigurableHTTPProxy speaks with the actual ConfigurableHTTPProxy server in
# the proxy pod.
hub:
  config:
    GoogleOAuthenticator:
      client_id: 3246215x-----redacted------content.com
      client_secret: Po2------redacted------K5LAl3Ov
      oauth_callback_url: https://jupyter-----redacted------uth_callback
    JupyterHub:
      admin_access: true
      authenticator_class: google
  service:
    type: ClusterIP
    annotations: {}
    ports:
      nodePort:
    loadBalancerIP:
  baseUrl: /
  cookieSecret:
  initContainers: []
  fsGid: 1000
  nodeSelector: {}
  tolerations: []
  concurrentSpawnLimit: 64
  consecutiveFailureLimit: 5
  activeServerLimit:
  deploymentStrategy:
    ## type: Recreate
    ## - sqlite-pvc backed hubs require the Recreate deployment strategy as a
    ##   typical PVC storage can only be bound to one pod at the time.
    ## - JupyterHub isn't designed to support being run in parallell. More work
    ##   needs to be done in JupyterHub itself for a fully highly available (HA)
    ##   deployment of JupyterHub on k8s is to be possible.
    type: Recreate
  db:
    type: sqlite-pvc
    upgrade:
    pvc:
      annotations: {}
      selector: {}
      accessModes:
        - ReadWriteOnce
      storage: 1Gi
      subPath:
      storageClassName:
    url:
    password:
  image:
    pullPolicy: IfNotPresent
    pullSecrets: []
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
  containerSecurityContext:
    runAsUser: 1000
    runAsGroup: 1000
    allowPrivilegeEscalation: false
  services: {}
  pdb:
    enabled: false
    minAvailable: 1
  networkPolicy:
    enabled: true
    ingress: []
    ## egress for JupyterHub already includes Kubernetes internal DNS and
    ## access to the proxy, but can be restricted further, but ensure to allow
    ## access to the Kubernetes API server that couldn't be pinned ahead of
    ## time.
    ##
    ## ref: https://stackoverflow.com/a/59016417/2220152
    egress:
      - to:
          - ipBlock:
              cidr: 0.0.0.0/0
    interNamespaceAccessLabels: ignore
    allowedIngressPorts: []
  allowNamedServers: false
  namedServerLimitPerUser:
  authenticatePrometheus:
  redirectToServer:
  shutdownOnLogout:
  templatePaths: []
  templateVars: {}
  livenessProbe:
    # The livenessProbe's aim to give JupyterHub sufficient time to startup but
    # be able to restart if it becomes unresponsive for ~5 min.
    enabled: true
    initialDelaySeconds: 300
    periodSeconds: 10
    failureThreshold: 30
    timeoutSeconds: 3
  readinessProbe:
    # The readinessProbe's aim is to provide a successful startup indication,
    # but following that never become unready before its livenessProbe fail and
    # restarts it if needed. To become unready following startup serves no
    # purpose as there are no other pod to fallback to in our non-HA deployment.
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 2
    failureThreshold: 1000
    timeoutSeconds: 1
  existingSecret:

rbac:
  enabled: true


# proxy relates to the proxy pod, the proxy-public service, and the autohttps
# pod and proxy-http service.
proxy:
  secretToken: 'bab124-----redacted------8f1203abdf'
  annotations: {}
  deploymentStrategy:
    ## type: Recreate
    ## - JupyterHub's interaction with the CHP proxy becomes a lot more robust
    ##   with this configuration. To understand this, consider that JupyterHub
    ##   during startup will interact a lot with the k8s service to reach a
    ##   ready proxy pod. If the hub pod during a helm upgrade is restarting
    ##   directly while the proxy pod is making a rolling upgrade, the hub pod
    ##   could end up running a sequence of interactions with the old proxy pod
    ##   and finishing up the sequence of interactions with the new proxy pod.
    ##   As CHP proxy pods carry individual state this is very error prone. One
    ##   outcome when not using Recreate as a strategy has been that user pods
    ##   have been deleted by the hub pod because it considered them unreachable
    ##   as it only configured the old proxy pod but not the new before trying
    ##   to reach them.
    type: Recreate
    ## rollingUpdate:
    ## - WARNING:
    ##   This is required to be set explicitly blank! Without it being
    ##   explicitly blank, k8s will let eventual old values under rollingUpdate
    ##   remain and then the Deployment becomes invalid and a helm upgrade would
    ##   fail with an error like this:
    ##
    ##     UPGRADE FAILED
    ##     Error: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
    ##     Error: UPGRADE FAILED: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
    rollingUpdate:
  # service relates to the proxy-public service
  service:
    type: ClusterIP 
    labels: {}
    annotations: {}
    nodePorts:
      http:
      https:
    extraPorts: []
    loadBalancerIP:
    loadBalancerSourceRanges: []
  # chp relates to the proxy pod, which is responsible for routing traffic based
  # on dynamic configuration sent from JupyterHub to CHP's REST API.
  chp:
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
    extraCommandLineFlags: []
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 10
    readinessProbe:
      enabled: true
      initialDelaySeconds: 0
      periodSeconds: 2
      failureThreshold: 1000
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
    extraEnv: {}
    nodeSelector: {}
    tolerations: []
    networkPolicy:
      enabled: true
      ingress: []
      egress:
        - to:
            - ipBlock:
                cidr: 0.0.0.0/0
      interNamespaceAccessLabels: ignore
      allowedIngressPorts: [http, https]
    pdb:
      enabled: false
      minAvailable: 1
  # traefik relates to the autohttps pod, which is responsible for TLS
  # termination when proxy.https.type=letsencrypt.
  traefik:
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
    hsts:
      includeSubdomains: false
      preload: false
      maxAge: 15724800 # About 6 months
    resources: {}
    extraEnv: {}
    extraVolumes: []
    extraVolumeMounts: []
    extraStaticConfig: {}
    extraDynamicConfig: {}
    nodeSelector: {}
    tolerations: []
    extraPorts: []
    networkPolicy:
      enabled: true
      ingress: []
      egress:
        - to:
            - ipBlock:
                cidr: 0.0.0.0/0
      interNamespaceAccessLabels: ignore
      allowedIngressPorts: [http, https]
    pdb:
      enabled: false
      minAvailable: 1
  secretSync:
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
    resources: {}
  labels: {}
  https:
    enabled: false
    type: letsencrypt
    #type: letsencrypt, manual, offload, secret
    letsencrypt:
      contactEmail: ''
      # Specify custom server here (https://acme-staging-v02.api.letsencrypt.org/directory) to hit staging LE
      acmeServer: https://acme-v02.api.letsencrypt.org/directory
    manual:
      key:
      cert:
    secret:
      name: ''
      key: tls.key
      crt: tls.crt
    hosts: []


# singleuser relates to the configuration of KubeSpawner which runs in the hub
# pod, and its spawning of user pods such as jupyter-myusername.
singleuser:
  podNameTemplate:
  extraTolerations: []
  nodeSelector: {}
  extraNodeAffinity:
    required: []
    preferred: []
  extraPodAffinity:
    required: []
    preferred: []
  extraPodAntiAffinity:
    required: []
    preferred: []
  networkTools:
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
  cloudMetadata:
    # block set to true will append a privileged initContainer using the
    # iptables to block the sensitive metadata server at the provided ip.
    blockWithIptables: true
    ip: 169.254.169.254
  networkPolicy:
    enabled: true
    ingress: []
    egress:
      # Required egress to communicate with the hub and DNS servers will be
      # augmented to these egress rules.
      #
      # This default rule explicitly allows all outbound traffic from singleuser
      # pods, except to a typical IP used to return metadata that can be used by
      # someone with malicious intent.
      - to:
          - ipBlock:
              cidr: 0.0.0.0/0
              except:
                - 169.254.169.254/32
    interNamespaceAccessLabels: ignore
    allowedIngressPorts: []
  events: true
  extraAnnotations: {}
  extraLabels:
    hub.jupyter.org/network-access-hub: 'true'
  extraEnv: {}
  lifecycleHooks: {}
  initContainers: []
  extraContainers: []
  uid: 1000
  fsGid: 100
  serviceAccountName: spark
  storage:
    type: dynamic
    extraLabels: {}
    extraVolumes:
      - name: aws-credentials
        secret:
          secretName: aws-credentials
      - name: gcp-credentials-applift
        secret:
          secretName: gcp-credentials-applift
      - name: gcp-credentials-data-jobs
        secret:
          secretName: gcp-credentials-data-jobs
      - name: gcp-credentials
        secret:
          secretName: gcp-credentials
      - name: shared
        persistentVolumeClaim:
          claimName: jupyterhub-rwmany-claim
    extraVolumeMounts:
      - mountPath: /home/jovyan/.aws/credentials
        name: aws-credentials
        subPath: credentials
        readOnly: true
      - mountPath: /home/jovyan/gcp-credentials-applift.json
        name: gcp-credentials-applift
        subPath: gcp-credentials-applift.json
        readOnly: true
      - mountPath: /home/jovyan/gcp-credentials-data-jobs.json
        name: gcp-credentials-data-jobs
        subPath: gcp-credentials-data-jobs.json
        readOnly: true
      - mountPath: /home/jovyan/gcp-credentials.json
        name: gcp-credentials
        subPath: gcp-credentials.json
        readOnly: true
      - mountPath: /home/jovyan/shared
        name: shared
    static:
      pvcName:
      subPath: '{username}'
    capacity: 10Gi
    homeMountPath: /home/jovyan
    dynamic:
      storageClass:
      pvcNameTemplate: claim-{username}{servername}
      volumeNameTemplate: volume-{username}{servername}
      storageAccessModes: [ReadWriteOnce]
  image:
    pullPolicy: IfNotPresent
    pullSecrets: []
  startTimeout: 300
  cpu:
    limit:
    guarantee: 1
  memory:
    limit:
    guarantee: 4G
  extraResource:
    limits: {}
    guarantees: {}
  cmd: jupyterhub-singleuser
  defaultUrl: "/lab"
  extraPodConfig: {}
  profileList:
    - display_name: "Small"
      description: "4G Memory, 1 CPU Guaranteed"
      default: true
    - display_name: "Medium"
      description: "8G Memory, 2 CPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 8G
        mem_limit:
        cpu_guarantee: 2
        cpu_limit:
    - display_name: "Large"
      description: "16G Memory, 4 CPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 16G
        mem_limit:
        cpu_guarantee: 4
        cpu_limit:
    - display_name: "XLarge"
      description: "25G Memory, 6 CPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 28G
        mem_limit:
        cpu_guarantee: 6
        cpu_limit: 25
    - display_name: "N2-HighMem-Small"
      description: "16G Memory, 2 vCPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 16G
        mem_limit:
        cpu_guarantee: 2
        cpu_limit: 
    - display_name: "N2-HighMem-Medium"
      description: "32G Memory, 4 vCPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 32G
        mem_limit:
        cpu_guarantee: 4
        cpu_limit: 
    - display_name: "N2-HighMem-Large"
      description: "64G Memory, 8 vCPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 64G
        mem_limit:
        cpu_guarantee: 8
        cpu_limit: 
    - display_name: "N2-HighMem-XL"
      description: "128G Memory, 16 vCPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 128G
        mem_limit:
        cpu_guarantee: 16
        cpu_limit: 
    - display_name: "N2-HighMem-XXL"
      description: "256G Memory, 32 vCPU Guaranteed"
      kubespawner_override:
        mem_guarantee: 240G
        mem_limit:
        cpu_guarantee: 30
        cpu_limit: 
    

# scheduling relates to the user-scheduler pods and user-placeholder pods.
scheduling:
  userScheduler:
    enabled: true
    replicas: 2
    logLevel: 4
    # plugins ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins-1
    plugins:
      score:
        disabled:
          - name: SelectorSpread
          - name: TaintToleration
          - name: PodTopologySpread
          - name: NodeResourcesBalancedAllocation
          - name: NodeResourcesLeastAllocated
          # Disable plugins to be allowed to enable them again with a different
          # weight and avoid an error.
          - name: NodePreferAvoidPods
          - name: NodeAffinity
          - name: InterPodAffinity
          - name: ImageLocality
        enabled:
          - name: NodePreferAvoidPods
            weight: 161051
          - name: NodeAffinity
            weight: 14631
          - name: InterPodAffinity
            weight: 1331
          - name: NodeResourcesMostAllocated
            weight: 121
          - name: ImageLocality
            weight: 11
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
    nodeSelector: {}
    tolerations: []
    pdb:
      enabled: true
      maxUnavailable: 1
    resources:
      requests:
        cpu: 50m
        memory: 256Mi
  podPriority:
    enabled: false
    globalDefault: false
    defaultPriority: 0
    userPlaceholderPriority: -10
  userPlaceholder:
    enabled: true
    replicas: 0
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
  corePods:
    nodeAffinity:
      matchNodePurpose: prefer
  userPods:
    nodeAffinity:
      matchNodePurpose: prefer


# prePuller relates to the hook|continuous-image-puller DaemonsSets
prePuller:
  annotations: {}
  resources:
    requests:
      cpu: 0
      memory: 0
  containerSecurityContext:
    runAsUser: 65534  # nobody user
    runAsGroup: 65534 # nobody group
    allowPrivilegeEscalation: false
  extraTolerations: []
  # hook relates to the hook-image-awaiter Job and hook-image-puller DaemonSet
  hook:
    enabled: true
    # image and the configuration below relates to the hook-image-awaiter Job
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    podSchedulingWaitDuration: 10
    nodeSelector: {}
    tolerations: []
    resources:
      requests:
        cpu: 0
        memory: 0
  continuous:
    enabled: true
  pullProfileListImages: true
  extraImages: {}
  pause:
    containerSecurityContext:
      runAsUser: 65534  # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []

ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: nginx
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: 32m
  hosts:
    - jupy----redacted-----ve.io
  pathSuffix: ''
  tls:
    - hosts:
      - jupyte----redacted-----ve.io
      secretName: jupyterhub-tls

cull:
  enabled: true
  users: false
  removeNamedServers: false
  timeout: 28800
  every: 3600
  concurrency: 10
  maxAge: 0


debug:
  enabled: false

global: {}

What does kubectl describe on the hub and proxy pods show? Are they running low on resources?

thanks, but you know what, the issue was fixed after restarting kube-dns. I wish the error messages were more informative.