The Enterprise Gateway project allows you to spawn kernels in a number of different platforms for distributed computing.
What exactly is the enterprise gateway, in simple terms? What scale must a business be at to benefit from one?
Is it a way for companies to connect multiple servers and have services be accessible from one log-in window? A centralized place to manage access to applications/data? Is it a way to avoid moving data? When do I need one?
Who is responsible at an organization for setting it up? How does it differ from a docker or kubernetes deployment of Jupyterhub?
I tried reading the docs after seeing the announcement, but it’s not really clear to me (perhaps because I’m not the intended audience, but I’d like to help work on language that introduces facets of the Jupyter ecosystem in clearer terms with fewer presumptions of background knowledge).
At a high level, Enterprise Gateway facilitates the ability to run notebook kernels across a compute cluster. This distribution occurs by leveraging the underlying resource manager to determine where a kernel should run. As a result, the Notebook server is no longer susceptible to resource exhaustion because kernels no longer run local to the Notebook server.
This configuration is beneficial to organizations with large compute clusters where a given data scientist/analyst requires multiple simultaneously-active notebooks, that, previously, may have combined to consume the resources of that particular notebook server. In addition, specialized servers with GPUs and other high-end compute configurations can be targeted for only kernel activity.
By installing the NB2KG server extension on the Notebook server, all kernel management is redirected to the Enterprise Gateway server, which then utilizes a pluggable architecture to spawn, locate, and manage the life-cycle of kernels across the compute cluster. Enterprise Gateway currently supports Hadoop YARN, Kubernetes, Docker Swarm, Dask YARN, IBM Spectrum Conductor resource managers, along with a simple, round-robin distributed mode that uses SSH to accomplish the kernel remoting.
The cluster administrators would likely be the most common role for configuring the Enterprise Gateway installation.
Although there is quite a lot of information in our documentation, we completely agree that it could use better organization and would like to break it down to role-based topics for administrators, data scientists, and other stake holders. We would greatly appreciate contributions in this area, as well as others.
hey @kevin-bates - per your response I tried slightly updating the “about” text above in @betatim’s post, but please let me know if there’s a better wording to use!
Sounds great @choldgraf - thank you.
Thanks for the description @kevin-bates!
What are the differences between Enterprise Gateway and JupyterHub?
Hi @tekumara. I just noticed that I had not answered this question when it was previously in the thread - I apologize - it is commonly asked.
At its core, Jupyter Hub is a spawner. What it typically spawns are Notebook servers tied to an authenticated user. As such, Hub supports a rich array of authentication providers that allow organizations to integrate it with their authentication framework.
Each of the kernels started by any of the spawned Notebook servers run on that Notebook server. In Kubernetes, for example, the Notebook server is a pod running within the Kubernetes cluster and all kernels associated with that notebook server also run in that pod. For example, if a user needed high-end processing for one of their notebooks, the spawned pod would need to support that high-end processing (gpu, tpu, etc.) even though other notebook kernels would still consume resources from this “expensive” pod.
Enterprise Gateway, on the other hand, is merely a kernel server. Notebook servers can be configured to proxy their kernel-related operations to an EG server via Notebook’s --gateway-url
(the NB2KG server extension mentioned previously has been baked into the notebook server - as of 6.0). On request, EG has the ability to spawn and manage the kernel’s lifecycle across various resource-managed clusters (see previous). This allows for a more optimal utilization of resources since each kernel runs independently of the associated notebook server.
It is recommended that EG reside behind a reverse-proxy like Hub or Apache Knox since it doesn’t provide authentication capabilities. EG allows for basic application-level authorization whereby an administrator can restrict various kernels to various users by configuring authorization lists.
So, in essence, EG should be thought of as a means of optimizing resources. If users tend to only work within a single notebook there really isn’t a benefit for EG. It becomes advantageous when the notebook user (data scientist, analyst, etc.) requires multiple active notebooks, where at least one requires a fair amount of resources that you don’t want to necessarily allocate via a single notebook server/pod.