Why does nb2kg connect to the kernel via EG rather than directly?

When the gateway is down, it affects the connection between nb2kg and the kernel regardless of setting session persistence configuration in EG.
For example, busy kernel does not respond to the client when it’s reconnected because it is running the code requested earlier.

I’ve tried to figure out the solution of the problem, but I couldn’t understand the fundamental architecture of EG where it proxies the kernel connections and couldn’t have got rid of this curiosity.

I guess EG can just provide the kernel connection info to nb2kg, and nb2kg connects to the kernels with those connection info. Of course, nb2kg gets heavier. If this is possible, however, EG doesn’t have to manage the sessions, so it would be the stateless component which is able to be easily HA-configured.

Thanks,
Evan Park

1 Like

I’ve been trying to produce a series of indicative sketches for myself about how I think I see the various Jupyter jigsaw pieces fit together (I have no idea if they are correct or not (I’m hoping they’re at least not too wrong).

Here’s the one I have for nb2kg:

3 Likes

Thanks for your response and kind figure!

But my curiosity was about why nb2kg —websocket—enterprise gateway—ZMQ—Kernel rather than nb2kg----ZMQ—Kernel.

When it comes to Kernel Gateway, it’s fine because kernels have exactly same life cycle with the kernel gateway. OTOH, in case of Enterprise Gateway, kernels’ life cycle is independent of Enterprise Gateway.

I’m guessing it is because EG wants to be compatible with Kernel Gateway (nb2kg interface), right?

Thanks for the topic Evan and great picture Tony! That picture should supplant the one in the Kernel Gateway docs.

There’s a bit of history here that should probably be presented, although I suspect you’re aware of this, but, for the benefit of others…

Yes, Enterprise Gateway is completely derived from Kernel Gateway. We didn’t extend KG with the EG remote kernels due to “time-to-market” constraints - finding it quicker to create a new repo. However, EG gets all of its request handling from KG today - thus the dependency. As a result, we leveraged the NB2KG extension. (By the way, as of Notebook 6.0 - which is imminent - the NB2KG extension and its multi-step configuration will no longer be required and its equivalent can be achieved simply by adding --gateway-url <gateway-server-url>!)

I believe the creators of Kernel Gateway created the NB2KG mechanism because they wanted a way to move kernels closer to the compute clusters, keeping the notebooks on the desktops of the analysts. Since there was no support for remote kernels, this was a way that could be achieved. This is very important because it eliminates the need for having Spark/YARN, etc., installed on the desktop and that configuration was their primary use-case - as it was for EG. The last thing cluster admins want is a bunch of user-based files and notebooks on their compute cluster. So separation via NB2KG is quite important.

Enterprise Gateway simply extends Kernel Gateway with the ability to launch kernels across a cluster via a resource manager - which then enables the ability to scale the number of kernels that can be launched since they no longer reside on the Gateway server. That launch still requires that EG be on a “edge node” of the cluster.

Regarding HA/DR, the idea is that we persist kernel sessions (not to be confused with the sessions that are managed in the Notebook server) while a remote kernel is active. Should the EG server go down, another would be started (or already be running) that gets a request. (Of course, this presumes the client is hitting a load-balanced URL that is routed and honors session affinity, etc.) When the handler on the “foreign” server goes to lookup the kernel-id corresponding to the request, it wouldn’t find the kernel manager. Rather than return 404, it would load the persisted state corresponding to that kernel and “hydrate” a kernel manager, seeding it with the persisted information, and start talking to that remote kernel - thereby continuing to honor the request.

I’ve demonstrated this can be accomplished in an Active/Passive mode by simply killing EG on the server and starting EG on another with the persisted state being located on a shared filesystem. This does require an explicit ‘reconnect’ to be issued from the client to reestablish the websocket, but I’m hoping we can figure out how that can be automated.

For Active/Active, we’d need to modify the handlers (as previously described), but since those still reside in Kernel Gateway, we’d want to get them into EG (and probably remove our dependence on KG at that time).

Futures:
When jupyter_server gets off the ground, we will likely adopt @takluyver’s proposal for Kernel Providers - which essentially allows “providers” to bring their own kernel manager and launch mechanisms to the party. With that, EG can probably go away, although I suspect we’ll want some “extension” to perform things like HA and kernel session persistence that is not really required for regular notebook users. However, the ability to launch kernels into resource-managed clusters should be achievable - provided the server has access to the cluster. (Note: there will still be a way to “gateway” from one jupyter server instance to another - just to get that separation that admins require.).

You can get a feel for this via the Jupyter Enterprise Gateway Experiments organization of repos. I’ve been performing POC exercises for this recently.

I’m hoping this answers your questions. If not, let’s continue this discussion!

1 Like

Always kind and detailed answer! Thank you, it is exactly what I wanted. :smile:

Also keep in mind that nb2kg and EG or KG are typically running on different nodes. Kernels use multiple ZMQ connections, there’s not much authentication around them, and the port numbers are random. That’s not something network administrators will appreciate, so you’ll find it hard to get through firewalls that way. Therefore, ZMQ is not a good choice for connecting over the network to the kernels, or to the cluster in which the kernels are running.

EG and KG do authenticate the incoming websocket connections before bridging messages to the ZMQ connections. You need only two well-known ports for connecting to EG and KG over a network, one for the REST calls (https) and one for the websocket connections (wss). The ZMQ connections between the gateway and the kernels are then only inside the cluster.

2 Likes

I think the separation is useful just in terms of how folk might think of architecting new services.

eg I was trying to riff on the idea of how VS Code does and could connect to services, and this sketch fell out showing (with my limited understanding of how VS Code works!) how a current non-existent nb2kg plugin might expand the reach of VS Code.

(I’ve also tried to tempt the VS Code folk with the idea :wink:

FWIW, image files are done with drawio here and thy just represent unchecked / unverified sketches of how some of the Jupyter pieces seem to me to stick together.

Re: my arch diagram for nb2kg, I guess I could add an optional auth block at the entry to the kernel gateway.

Is the main difference for EG that it can spawn kernels in arbitrary locations, whereas the KG spawns them on a single node?

2 Likes

Another good diagram although I’m not certain what ‘VS Code’ is. Is this related to Visual Studio at all? (Nice repo btw. I’ll need to checkout some of the other diagrams!) At any rate, nb2kg requires Notebook since it’s essentially replacing the kernel manager classes. However, a non-notebook client can easily manage kernels via the REST API. We provide an experimental “gateway client” in EG. Not sure if that would be the more appropriate interface to “VS Code”.

Regarding KG vs. EG, yes kernel location is the primary difference. We also provide a foundation for things like HA/DR, impersonation, port ranges, etc. aside from the various flavors of remotability. I.e., things enterprise-scale admins might require.

VS Code is Visual Studio Code, yes. eg https://code.visualstudio.com/docs/python/jupyter-support

1 Like

Kernel Gateway and Jupyter Server provide basically the same API for working with kernels, so you can simplify the diagram. Put them in the same box on the right (under the heading “Remote Kernels”) and remove the nb2kg plugin (or rename it to some to-be-developed VSC plugin).

2 Likes

We’re going to need to modify Jupyter Server’s kernel handler to pick out the env: stanza from the body and convey it into the kernel launch logic for it to have parity with a gateway server. At that time, we should also include a parameters: stanza in what can be sent in the request - in preparation for parameterized kernels.

That sounds intriguing…

I believe this issue can be used as a springboard for what has been discussed: https://github.com/jupyter/jupyter_client/issues/434

Generally speaking, I think we’d include the parameter metadata in the kernelspec response so applications know how to prompt, and the start kernel request would include a parameters: stanza containing the inputted/defaulted values, that are then used by the kernel manager/provider to substitute or make available to the kernel, or, more often, the entity that creates the environment in which the kernel runs (e.g., number of cpus/gpus, memory, etc.).

Just did a mod of the KG diagram as an attempt at an enterprise gateway diagram; I tried to capture how EG builds on KG, as well as how notebook server may ship w/ native nb2kg support as well as via an extension

Please let me know if there is anything obviously wrong with that diagram; I’m still feeling my way for what pieces go where!

One thing I should try to include is the various places where different bits of auth might go.

Thanks Tony - lots of work here. Got a few comments…

  1. The “embedded gateway” (nb2kg API) doesn’t introduce any new API. It merely forwards kernel and kernelspec requests to the gateway server. This is accomplished via the gateway package in the NB server. I just didn’t want to imply there’s a diversion from the typical protocol.
  2. For container envs, EG (and the handlers provided by JKG) are containerized and running in the same cluster (k8s/swarm) as the kernel pods/containers. For other clusters, we recommend EG be deployed on a node of the cluster, although that isn’t required. As a result, EG is typically intended to target one kind of cluster at a time. I’m sure we could make a multi-cluster config work, but that’s really not the design center.
  3. Since the detail is so fine, EG sits on KG which sits on Notebook Server. I.e., the class hierarchies used by these entities derive from NB and down to jupyter_client.
  4. There’s a typo in your Docker Swarm cluster block.

I’m hoping @rolweber can chime in on the auth for KG.

Also note that in Jupyter Server, the “plan” is to use a Kernel Provider model which would essentially turn the Gateway blocks into another (and optional) Jupyter Server in “Gateway Mode” block, for those installations that desire a KernelAsAService type of behavior. I think we (the community) need to have a more concrete discussion regarding the transition to that model.

Great, thanks for that… here’s another attempt…

(The cluster stuff is all a bit beyond me…:wink:

I straddled the jupyter server with an nb2kg relay as code for myself that it’s an extension or part of the server…

4 Likes

That reminds me that I never liked the name “nb2kg”. So short to write, yet so silly to pronounce :stuck_out_tongue_winking_eye:
But “relay” is a good term for the functionality needed at that point.

I’m not so sure about the routing functionality you imply by the multiple EKG behind the relay. I would much rather expect to start one Jupyter Server + relay for each cluster in the backend. It’s so much easier to use different URLs in the browser, instead of doing a path mapping in the Jupyter server.

What’s the use case for providing multiple backends through a single Jupyter server? How many users would actually want to switch from a kernel provided by the Docker Swarm to a kernel provided by Spark/YARN? In Watson Studio on Cloud, we decided not to support that kind of scenario, and saved a lot of work and trouble that way. If users want to switch from one environment to another, we spin up a new container and Jupyter.

Imaging the browser sending a request for the list of kernel specs to the Jupyter server. According to the diagram, the Jupyter server would then have to request the list of kernel specs from each of the backend EKG, merge those lists into one, and then return that to the browser. If even one of the backend systems is slow to respond, so will be the Jupyter server. Same for listing running kernels.

1 Like

This would be handled on “releay” side (nb2kg :stuck_out_tongue:). I uploaded PR to Attempt to re-establish websocket connection to KG by esevan · Pull Request #42 · jupyter/nb2kg · GitHub . I have not much experience about handling kernel websocket connection, so I’ll appreciate your review for that :smiley:

1 Like

Ah, okay… so I really need to move the kernel gateway out as a single point of entry and then allow that to pass traffic to individual enterprise gateways sitting in their own cluster?

Seems to me that is then a misnaming? wouldn’t you want the enterprise gateway to be able to route anywhere and the kernel gateway to sit in each cluster?

Lovely drawings in progress! Let me give it a try too! :smiley:.

<mxfile modified="2019-07-18T15:22:37.311Z" host="www.draw.io" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36" etag="g6MepfxWdecdswOX8oro" version="10.9.8" type="device"><diagram id="-R7rIjBebfeIzxYSrSvm" name="Page-1">7Vtbc5s4GP01fkwG3TB+jJ2kO5vNbtrsTJt9w6BiJoA8Qr71169kEDaIxq4LBrd5SdAHyOh8R9/lGA/QJF5/4O589sh8Gg2g5a8H6HYAIQTOUP5Tlk1mGRKSGQIe+pkJ7AzP4TeaG63cugh9mpYuFIxFIpyXjR5LEuqJks3lnK3Kl31lUflT525ADcOz50am9XPoi1lmdeBwZ/+DhsFMfzKwR9mZ2NUX5ytJZ67PVnsmdDdAE86YyI7i9YRGCjyNS3bf/XfOFg/GaSKOuWHy4F0lKHmZLF//+fj4vJrMEbjCTjbN0o0W+YpvmfdKubQ9r1weD9Ss9/Lvw2JKeUKFdIY2vdx8+rsYTFjiLzzBeL5asdEQcrZIfKqeAgzQeDULBX2eu546u5KkkbaZiKP89JJyEUr4b6IwSKRtyoRgsTyRP6c8TdffRQAUuEpCUhZTwTfykvwGhHNX5FxEI3Cds3G18y227Mw22/Nr4UU351NQzL6DXB7kqP+AB2zTA59ozARVHqB8SXsMKLHLgGLLMuAEuAZO4LQG5/By4UQ27huc+JLhrGx34NjHbndAhi0BCk1ADQRp4t+o1CVHXuSmaeiVQZPr55svcmBdjxyoDS9bA8B6fKtgsIrRZn/0RHko1yO9lxmzR6C+kQsrWMvk6/KAijcWiFC9U/YgJ5aJuLZxGrkiXJafo84L+Sc8sVA+4c7nw7LPIa44MmUL7tH8rv1sWZkIVkIbJE55ogwIY6ItKYpl/wRPzLRwOk+urGsLgjJRHBkppWEdii/6cnmcnRuSfLjjkBqUKNQOvTL3HKTXMTyEnfLQgTrW6GKjSqBjmYiGqDIVHKKzchFZBhdnQsx16WdHErvxVGYCO1BHKzpNVQ0pUvOcwWEZ3kWZtZym4Td3ur1AkWeulrVdKBkPyK20uAvB0qxTUDe4ee7wJJkU5YykIpjKOqlMQmES/KsGt1e4mRwDHcM5uqTeI5pTQzTUVspGsCZy2G6sMIgyH5RH1hNnHk2zuj4/FZQu1KOKMx9UQxCZ9g+uoCt383OVQRMFgLlzoOkcAGu8Y7fmHWx458/FfKOIa+CYNlBgNQAjQHYZRB2DuwOxrohqleIdu4D0kMkYHK5Q9lAo5A8V0303nW3hs46J33Ho+2pGNcdczRyvA6U2XWcCD7wWK/bZ3eS10HgrNckiRh0q/GBDTsCjU2N9a+IBNmP9f5Szx49t5tmIfhXnzbIEHEN/+6zIIwP5sSJjD+L1FbAqvUvnARubWc+Aqb9db1dNBEDVcHNqM3ugKW65f8CkQe9fSi/bFWn0HUWveGLbaVR9VY2xbdLYBmnem868EHH61nPi0QXF94PSEUE929NVVZNY8MREUFE1ZWF11k2t8fiVM8FBCZMcK2ES2DceGqqmXSXQ6aomgedV2InZv74nGJ3rDedYZgtx1gyj90L7ks+dwnvOw5RekrKJna57PHL5yiYedQ5iXav0eymbPWDyEdXsL65sHhvv23styqwUfw9ls47+Z1U2bbMu6q2y2X3Atuvqkr52vr1pJKrKJrGaUTaNeVruIWzzW4DTvd/TfrY3pKkomwQ1pGwSbJ2XNGah/N546uRn1CEdK5u2WZAffO3loJfO3y5WXiwGdS++1ybO9oA19f27tZDBMn0rgk4juRW2AUz/VgTY2fA+VJ+/pbgBXiNvHmoF4qBsl1HmzSgJ7GZCIqrouTYsHPujUZHAg1O1HRfr2q5Wet9ip1p/hankXA9K25o++PhN2l4nbNY3yRS+Biakisjd62LVehCPzPSBat73J20BiC4NQFD51lfXRqVfTDQCoBzufvyWxZDdTwjR3f8=</diagram></mxfile>
2 Likes