Thanks for the topic Evan and great picture Tony! That picture should supplant the one in the Kernel Gateway docs.
There’s a bit of history here that should probably be presented, although I suspect you’re aware of this, but, for the benefit of others…
Yes, Enterprise Gateway is completely derived from Kernel Gateway. We didn’t extend KG with the EG remote kernels due to “time-to-market” constraints - finding it quicker to create a new repo. However, EG gets all of its request handling from KG today - thus the dependency. As a result, we leveraged the NB2KG extension. (By the way, as of Notebook 6.0 - which is imminent - the NB2KG extension and its multi-step configuration will no longer be required and its equivalent can be achieved simply by adding --gateway-url <gateway-server-url>
!)
I believe the creators of Kernel Gateway created the NB2KG mechanism because they wanted a way to move kernels closer to the compute clusters, keeping the notebooks on the desktops of the analysts. Since there was no support for remote kernels, this was a way that could be achieved. This is very important because it eliminates the need for having Spark/YARN, etc., installed on the desktop and that configuration was their primary use-case - as it was for EG. The last thing cluster admins want is a bunch of user-based files and notebooks on their compute cluster. So separation via NB2KG is quite important.
Enterprise Gateway simply extends Kernel Gateway with the ability to launch kernels across a cluster via a resource manager - which then enables the ability to scale the number of kernels that can be launched since they no longer reside on the Gateway server. That launch still requires that EG be on a “edge node” of the cluster.
Regarding HA/DR, the idea is that we persist kernel sessions (not to be confused with the sessions that are managed in the Notebook server) while a remote kernel is active. Should the EG server go down, another would be started (or already be running) that gets a request. (Of course, this presumes the client is hitting a load-balanced URL that is routed and honors session affinity, etc.) When the handler on the “foreign” server goes to lookup the kernel-id corresponding to the request, it wouldn’t find the kernel manager. Rather than return 404, it would load the persisted state corresponding to that kernel and “hydrate” a kernel manager, seeding it with the persisted information, and start talking to that remote kernel - thereby continuing to honor the request.
I’ve demonstrated this can be accomplished in an Active/Passive mode by simply killing EG on the server and starting EG on another with the persisted state being located on a shared filesystem. This does require an explicit ‘reconnect’ to be issued from the client to reestablish the websocket, but I’m hoping we can figure out how that can be automated.
For Active/Active, we’d need to modify the handlers (as previously described), but since those still reside in Kernel Gateway, we’d want to get them into EG (and probably remove our dependence on KG at that time).
Futures:
When jupyter_server gets off the ground, we will likely adopt @takluyver’s proposal for Kernel Providers - which essentially allows “providers” to bring their own kernel manager and launch mechanisms to the party. With that, EG can probably go away, although I suspect we’ll want some “extension” to perform things like HA and kernel session persistence that is not really required for regular notebook users. However, the ability to launch kernels into resource-managed clusters should be achievable - provided the server has access to the cluster. (Note: there will still be a way to “gateway” from one jupyter server instance to another - just to get that separation that admins require.).
You can get a feel for this via the Jupyter Enterprise Gateway Experiments organization of repos. I’ve been performing POC exercises for this recently.
I’m hoping this answers your questions. If not, let’s continue this discussion!