We noticed that running a Helm upgrade (which only affects the hub, not the proxy) results in a short downtime due to the deployment strategy Recreate.
During this time period, the hub is not available and the proxy responds with “Service unavailable”.
According to the docs, Recreate is preferred over RollingUpdate because:
JupyterHub does not support running in parallel, due to this we default to using a deployment strategy of Recreate.
Suppose I understand the deployment strategy RollingUpdate correctly. In that case, it keeps the hub pod running (i.e., forwarding all traffic to this pod) until the new pod is created and running (i.e., readiness probe succeeded). Thus, both pods are not running at the same time.
Or may both pods be running in parallel for less than a second?
If that’s not the case, I don’t know why Recreate is preferred over RollingUpdate, which does not result in a downtime.
Thus, can I safely use RollingUpdate?
We didn’t notice any issues yet.
I am no expert in JupyterHub helm chart but running two hubs at the same time can be problematic due to the DB. Two hubs talking to the same DB can result in DB inconsistencies especially when JupyterHub applies DB migrations. There might be more issues but this is what came to me first!!
Yes, the jupyterhub process assumes it is the only one modifying the database and proxy state. If you have two hubs running the internal state of jupyterhub can get out of sync.
If I’m correct, the message “Service Unavailable” is shown by the CHP because the error target, which by default is the hub itself, is obviously not reachable.
According to the docs, changing the error target is helpful to show more informative error messages.
Besides the hub is not available at all, what are other “error scenarios” that should be considered when providing a custom error message? (For example, is a 404 an “error” handled by the error target?)