I’m running jupyterhub on kubernetes 1.21
in GKE. I have installed jupyterhub using helmchart
chart version: jupyterhub-1.2.0
app version: 1.5.0
The helm install works fine and the pods come up to, but I more often then not see the below error on the browser which I noticed is pretty common.
503 : Service Unavailable Your server appears to be down. Try restarting it from the hub
Sometimes I reload it multiple times and it does work as expected ( but that’s very very rare ).
I’m using google oauth for authentication.
A similar issue has been raised here, but that does not fix the issue
the hub
pod logs during the time of error are below
[I 2022-03-17 14:43:58.771 JupyterHub log:189] 200 GET /hub/error/503?url=%2Fuser%2Fgohar.hovsepyan%40verve.com%2Flab%2Fworkspaces%2Fauto-y (@10.246.6.17) 1.40ms
[I 2022-03-17 14:44:07.805 JupyterHub log:189] 200 GET /hub/error/503?url=%2F (@10.246.6.17) 1.49ms
[W 2022-03-17 14:44:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.32ms
[I 2022-03-17 14:44:39.447 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:45:38.511 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.15ms
[I 2022-03-17 14:45:39.415 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:46:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.32ms
[I 2022-03-17 14:46:39.416 JupyterHub proxy:347] Checking routes
[W 2022-03-17 14:47:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.22ms
[W 2022-03-17 14:47:49.428 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[I 2022-03-17 14:47:49.563 JupyterHub proxy:347] Checking routes
.
.
.
[I 2022-03-17 14:51:06.467 JupyterHub oauth2:111] OAuth redirect: 'https://jupy-----redacted------ve.io/hub/oauth_callback'
[I 2022-03-17 14:51:06.468 JupyterHub log:189] 302 GET /hub/oauth_login?next=%2Fhub%2Fuser%2Fgohar.hovsepyan%40verve.com%2Flab%2Fworkspaces%2Fauto-y -> https://accounts.google.com/o/oauth2/v2/auth?response_type=code&redirect_uri=https%3A%2F%2Fjupy-----redacted------ve.io%2Fhub%2Foauth_callback&client_id=324621593441-gonsvpicbljnh0th79g0463il2dnfrhv.apps.googleusercontent.com&state=[secret]&scope=openid+email (@10.255.66.89) 1.33ms
[I 2022-03-17 14:51:17.193 JupyterHub base:762] User logged in: gohar.hovsepyan@verve.com
[I 2022-03-17 14:51:17.194 JupyterHub log:189] 302 GET /hub/oauth_callback?state=[secret]&code=[secret]&scope=email+openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&authuser=[secret]&hd=verve.com&prompt=none -> /hub/user/gohar.hovsepyan@verve.com/lab/workspaces/auto-y (@10.255.66.89) 10155.52ms
[E 2022-03-17 14:51:17.364 JupyterHub log:189] 503 GET /hub/user/gohar.hovsepyan@verve.com/lab/workspaces/auto-y (gohar.hovsepyan@verve.com@10.255.66.89) 15.28ms
[I 2022-03-17 14:51:22.781 JupyterHub log:189] 200 GET /hub/error/503?url=%2Fhub%2Fstatic%2Fjs%2Fnot_running.js%3Fv%3D20220317141739 (@10.246.6.17) 1.32ms
[I 2022-03-17 14:51:35.255 JupyterHub log:189] 200 GET /hub/error/503?url=%2F (@10.246.6.17) 1.32ms
[W 2022-03-17 14:51:38.512 JupyterHub log:189] 403 GET /hub/metrics (@10.246.32.7) 1.12ms
[W 2022-03-17 14:51:59.430 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-03-17 14:52:09.589 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-03-17 14:52:09.591 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fde2ea36c40>>, <Task finished name='Task-2798' coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2666> exception=TimeoutError('Repeated api_request to proxy path "" failed.')>)
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2668, in update_last_activity
routes = await self.proxy.get_all_routes()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
resp = await self.api_request('', client=client)
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
result = await exponential_backoff(
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: Repeated api_request to proxy path "" failed.
2022-03-17T14:52:09.592137627Z
I have also tried deleting the network policies, but it returns with the same error.
Below is the helm values file that we are using ( could not attach a file, hence pasting it here )
# custom can contain anything you want to pass to the hub pod, as all passed
# Helm template values will be made available there.
custom: {}
# imagePullSecret is configuration to create a k8s Secret that Helm chart's pods
# can get credentials from to pull their images.
imagePullSecret:
create: false
automaticReferenceInjection: true
registry: ''
username: ''
email: ''
password: ''
# imagePullSecrets is configuration to reference the k8s Secret resources the
# Helm chart's pods can get credentials from to pull their images.
imagePullSecrets:
- name: auth-container-gcr
- name: auth-container-docker
# hub relates to the hub pod, responsible for running JupyterHub, its configured
# Authenticator class KubeSpawner, and its configured Proxy class
# ConfigurableHTTPProxy. KubeSpawner creates the user pods, and
# ConfigurableHTTPProxy speaks with the actual ConfigurableHTTPProxy server in
# the proxy pod.
hub:
config:
GoogleOAuthenticator:
client_id: 3246215x-----redacted------content.com
client_secret: Po2------redacted------K5LAl3Ov
oauth_callback_url: https://jupyter-----redacted------uth_callback
JupyterHub:
admin_access: true
authenticator_class: google
service:
type: ClusterIP
annotations: {}
ports:
nodePort:
loadBalancerIP:
baseUrl: /
cookieSecret:
initContainers: []
fsGid: 1000
nodeSelector: {}
tolerations: []
concurrentSpawnLimit: 64
consecutiveFailureLimit: 5
activeServerLimit:
deploymentStrategy:
## type: Recreate
## - sqlite-pvc backed hubs require the Recreate deployment strategy as a
## typical PVC storage can only be bound to one pod at the time.
## - JupyterHub isn't designed to support being run in parallell. More work
## needs to be done in JupyterHub itself for a fully highly available (HA)
## deployment of JupyterHub on k8s is to be possible.
type: Recreate
db:
type: sqlite-pvc
upgrade:
pvc:
annotations: {}
selector: {}
accessModes:
- ReadWriteOnce
storage: 1Gi
subPath:
storageClassName:
url:
password:
image:
pullPolicy: IfNotPresent
pullSecrets: []
resources:
requests:
cpu: 200m
memory: 512Mi
containerSecurityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
services: {}
pdb:
enabled: false
minAvailable: 1
networkPolicy:
enabled: true
ingress: []
## egress for JupyterHub already includes Kubernetes internal DNS and
## access to the proxy, but can be restricted further, but ensure to allow
## access to the Kubernetes API server that couldn't be pinned ahead of
## time.
##
## ref: https://stackoverflow.com/a/59016417/2220152
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
interNamespaceAccessLabels: ignore
allowedIngressPorts: []
allowNamedServers: false
namedServerLimitPerUser:
authenticatePrometheus:
redirectToServer:
shutdownOnLogout:
templatePaths: []
templateVars: {}
livenessProbe:
# The livenessProbe's aim to give JupyterHub sufficient time to startup but
# be able to restart if it becomes unresponsive for ~5 min.
enabled: true
initialDelaySeconds: 300
periodSeconds: 10
failureThreshold: 30
timeoutSeconds: 3
readinessProbe:
# The readinessProbe's aim is to provide a successful startup indication,
# but following that never become unready before its livenessProbe fail and
# restarts it if needed. To become unready following startup serves no
# purpose as there are no other pod to fallback to in our non-HA deployment.
enabled: true
initialDelaySeconds: 0
periodSeconds: 2
failureThreshold: 1000
timeoutSeconds: 1
existingSecret:
rbac:
enabled: true
# proxy relates to the proxy pod, the proxy-public service, and the autohttps
# pod and proxy-http service.
proxy:
secretToken: 'bab124-----redacted------8f1203abdf'
annotations: {}
deploymentStrategy:
## type: Recreate
## - JupyterHub's interaction with the CHP proxy becomes a lot more robust
## with this configuration. To understand this, consider that JupyterHub
## during startup will interact a lot with the k8s service to reach a
## ready proxy pod. If the hub pod during a helm upgrade is restarting
## directly while the proxy pod is making a rolling upgrade, the hub pod
## could end up running a sequence of interactions with the old proxy pod
## and finishing up the sequence of interactions with the new proxy pod.
## As CHP proxy pods carry individual state this is very error prone. One
## outcome when not using Recreate as a strategy has been that user pods
## have been deleted by the hub pod because it considered them unreachable
## as it only configured the old proxy pod but not the new before trying
## to reach them.
type: Recreate
## rollingUpdate:
## - WARNING:
## This is required to be set explicitly blank! Without it being
## explicitly blank, k8s will let eventual old values under rollingUpdate
## remain and then the Deployment becomes invalid and a helm upgrade would
## fail with an error like this:
##
## UPGRADE FAILED
## Error: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
## Error: UPGRADE FAILED: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
rollingUpdate:
# service relates to the proxy-public service
service:
type: ClusterIP
labels: {}
annotations: {}
nodePorts:
http:
https:
extraPorts: []
loadBalancerIP:
loadBalancerSourceRanges: []
# chp relates to the proxy pod, which is responsible for routing traffic based
# on dynamic configuration sent from JupyterHub to CHP's REST API.
chp:
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
image:
pullPolicy: IfNotPresent
pullSecrets: []
extraCommandLineFlags: []
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
enabled: true
initialDelaySeconds: 0
periodSeconds: 2
failureThreshold: 1000
resources:
requests:
cpu: 200m
memory: 512Mi
extraEnv: {}
nodeSelector: {}
tolerations: []
networkPolicy:
enabled: true
ingress: []
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
interNamespaceAccessLabels: ignore
allowedIngressPorts: [http, https]
pdb:
enabled: false
minAvailable: 1
# traefik relates to the autohttps pod, which is responsible for TLS
# termination when proxy.https.type=letsencrypt.
traefik:
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
image:
pullPolicy: IfNotPresent
pullSecrets: []
hsts:
includeSubdomains: false
preload: false
maxAge: 15724800 # About 6 months
resources: {}
extraEnv: {}
extraVolumes: []
extraVolumeMounts: []
extraStaticConfig: {}
extraDynamicConfig: {}
nodeSelector: {}
tolerations: []
extraPorts: []
networkPolicy:
enabled: true
ingress: []
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
interNamespaceAccessLabels: ignore
allowedIngressPorts: [http, https]
pdb:
enabled: false
minAvailable: 1
secretSync:
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
image:
pullPolicy: IfNotPresent
pullSecrets: []
resources: {}
labels: {}
https:
enabled: false
type: letsencrypt
#type: letsencrypt, manual, offload, secret
letsencrypt:
contactEmail: ''
# Specify custom server here (https://acme-staging-v02.api.letsencrypt.org/directory) to hit staging LE
acmeServer: https://acme-v02.api.letsencrypt.org/directory
manual:
key:
cert:
secret:
name: ''
key: tls.key
crt: tls.crt
hosts: []
# singleuser relates to the configuration of KubeSpawner which runs in the hub
# pod, and its spawning of user pods such as jupyter-myusername.
singleuser:
podNameTemplate:
extraTolerations: []
nodeSelector: {}
extraNodeAffinity:
required: []
preferred: []
extraPodAffinity:
required: []
preferred: []
extraPodAntiAffinity:
required: []
preferred: []
networkTools:
image:
pullPolicy: IfNotPresent
pullSecrets: []
cloudMetadata:
# block set to true will append a privileged initContainer using the
# iptables to block the sensitive metadata server at the provided ip.
blockWithIptables: true
ip: 169.254.169.254
networkPolicy:
enabled: true
ingress: []
egress:
# Required egress to communicate with the hub and DNS servers will be
# augmented to these egress rules.
#
# This default rule explicitly allows all outbound traffic from singleuser
# pods, except to a typical IP used to return metadata that can be used by
# someone with malicious intent.
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32
interNamespaceAccessLabels: ignore
allowedIngressPorts: []
events: true
extraAnnotations: {}
extraLabels:
hub.jupyter.org/network-access-hub: 'true'
extraEnv: {}
lifecycleHooks: {}
initContainers: []
extraContainers: []
uid: 1000
fsGid: 100
serviceAccountName: spark
storage:
type: dynamic
extraLabels: {}
extraVolumes:
- name: aws-credentials
secret:
secretName: aws-credentials
- name: gcp-credentials-applift
secret:
secretName: gcp-credentials-applift
- name: gcp-credentials-data-jobs
secret:
secretName: gcp-credentials-data-jobs
- name: gcp-credentials
secret:
secretName: gcp-credentials
- name: shared
persistentVolumeClaim:
claimName: jupyterhub-rwmany-claim
extraVolumeMounts:
- mountPath: /home/jovyan/.aws/credentials
name: aws-credentials
subPath: credentials
readOnly: true
- mountPath: /home/jovyan/gcp-credentials-applift.json
name: gcp-credentials-applift
subPath: gcp-credentials-applift.json
readOnly: true
- mountPath: /home/jovyan/gcp-credentials-data-jobs.json
name: gcp-credentials-data-jobs
subPath: gcp-credentials-data-jobs.json
readOnly: true
- mountPath: /home/jovyan/gcp-credentials.json
name: gcp-credentials
subPath: gcp-credentials.json
readOnly: true
- mountPath: /home/jovyan/shared
name: shared
static:
pvcName:
subPath: '{username}'
capacity: 10Gi
homeMountPath: /home/jovyan
dynamic:
storageClass:
pvcNameTemplate: claim-{username}{servername}
volumeNameTemplate: volume-{username}{servername}
storageAccessModes: [ReadWriteOnce]
image:
pullPolicy: IfNotPresent
pullSecrets: []
startTimeout: 300
cpu:
limit:
guarantee: 1
memory:
limit:
guarantee: 4G
extraResource:
limits: {}
guarantees: {}
cmd: jupyterhub-singleuser
defaultUrl: "/lab"
extraPodConfig: {}
profileList:
- display_name: "Small"
description: "4G Memory, 1 CPU Guaranteed"
default: true
- display_name: "Medium"
description: "8G Memory, 2 CPU Guaranteed"
kubespawner_override:
mem_guarantee: 8G
mem_limit:
cpu_guarantee: 2
cpu_limit:
- display_name: "Large"
description: "16G Memory, 4 CPU Guaranteed"
kubespawner_override:
mem_guarantee: 16G
mem_limit:
cpu_guarantee: 4
cpu_limit:
- display_name: "XLarge"
description: "25G Memory, 6 CPU Guaranteed"
kubespawner_override:
mem_guarantee: 28G
mem_limit:
cpu_guarantee: 6
cpu_limit: 25
- display_name: "N2-HighMem-Small"
description: "16G Memory, 2 vCPU Guaranteed"
kubespawner_override:
mem_guarantee: 16G
mem_limit:
cpu_guarantee: 2
cpu_limit:
- display_name: "N2-HighMem-Medium"
description: "32G Memory, 4 vCPU Guaranteed"
kubespawner_override:
mem_guarantee: 32G
mem_limit:
cpu_guarantee: 4
cpu_limit:
- display_name: "N2-HighMem-Large"
description: "64G Memory, 8 vCPU Guaranteed"
kubespawner_override:
mem_guarantee: 64G
mem_limit:
cpu_guarantee: 8
cpu_limit:
- display_name: "N2-HighMem-XL"
description: "128G Memory, 16 vCPU Guaranteed"
kubespawner_override:
mem_guarantee: 128G
mem_limit:
cpu_guarantee: 16
cpu_limit:
- display_name: "N2-HighMem-XXL"
description: "256G Memory, 32 vCPU Guaranteed"
kubespawner_override:
mem_guarantee: 240G
mem_limit:
cpu_guarantee: 30
cpu_limit:
# scheduling relates to the user-scheduler pods and user-placeholder pods.
scheduling:
userScheduler:
enabled: true
replicas: 2
logLevel: 4
# plugins ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins-1
plugins:
score:
disabled:
- name: SelectorSpread
- name: TaintToleration
- name: PodTopologySpread
- name: NodeResourcesBalancedAllocation
- name: NodeResourcesLeastAllocated
# Disable plugins to be allowed to enable them again with a different
# weight and avoid an error.
- name: NodePreferAvoidPods
- name: NodeAffinity
- name: InterPodAffinity
- name: ImageLocality
enabled:
- name: NodePreferAvoidPods
weight: 161051
- name: NodeAffinity
weight: 14631
- name: InterPodAffinity
weight: 1331
- name: NodeResourcesMostAllocated
weight: 121
- name: ImageLocality
weight: 11
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
image:
pullPolicy: IfNotPresent
pullSecrets: []
nodeSelector: {}
tolerations: []
pdb:
enabled: true
maxUnavailable: 1
resources:
requests:
cpu: 50m
memory: 256Mi
podPriority:
enabled: false
globalDefault: false
defaultPriority: 0
userPlaceholderPriority: -10
userPlaceholder:
enabled: true
replicas: 0
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
corePods:
nodeAffinity:
matchNodePurpose: prefer
userPods:
nodeAffinity:
matchNodePurpose: prefer
# prePuller relates to the hook|continuous-image-puller DaemonsSets
prePuller:
annotations: {}
resources:
requests:
cpu: 0
memory: 0
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
extraTolerations: []
# hook relates to the hook-image-awaiter Job and hook-image-puller DaemonSet
hook:
enabled: true
# image and the configuration below relates to the hook-image-awaiter Job
image:
pullPolicy: IfNotPresent
pullSecrets: []
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
podSchedulingWaitDuration: 10
nodeSelector: {}
tolerations: []
resources:
requests:
cpu: 0
memory: 0
continuous:
enabled: true
pullProfileListImages: true
extraImages: {}
pause:
containerSecurityContext:
runAsUser: 65534 # nobody user
runAsGroup: 65534 # nobody group
allowPrivilegeEscalation: false
image:
pullPolicy: IfNotPresent
pullSecrets: []
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
nginx.ingress.kubernetes.io/proxy-body-size: 32m
hosts:
- jupy----redacted-----ve.io
pathSuffix: ''
tls:
- hosts:
- jupyte----redacted-----ve.io
secretName: jupyterhub-tls
cull:
enabled: true
users: false
removeNamedServers: false
timeout: 28800
every: 3600
concurrency: 10
maxAge: 0
debug:
enabled: false
global: {}