Scalable Enterprise Gateway

Description

Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.

    def scale_test(kernelspec, example_code, _):
        """test function for scalability test"""
        res = True
        gateway_client = GatewayClient()

        kernel = None
        try:
            # scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
            kernel = gateway_client.start_kernel(kernelspec)
            if example_code:
                res = kernel.execute(example_code)
        finally:
            if kernel is not None:
                gateway_client.shutdown_kernel(kernel)

        return res

    class TestScale(unittest.TestCase):
        def _scale_test(self, spec, test_max_scale):
            LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
            example_code = []
            example_code.append('print("Hello World")')

            with Pool(processes=test_max_scale) as pool:
                children = []
                for i in range(test_max_scale):
                    children.append(pool.apply_async(scale_test,
                                                     (self.KERNELSPECS[spec],
                                                      example_code,
                                                      i)))

                test_results = [child.get() for child in children]

            for result in test_results:
                self.assertRegexpMatches(result, "Hello World")

        def test_python3_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
            self._scale_test('python3', test_max_scale)

        def test_spark_python_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
            self._scale_test('spark_python', test_max_scale)

I’ve set LAUNCH_TIMEOUT to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because spark-submit processes launched by EG makes process starvation among EG process and other spark-submit processes.

When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance) image

I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.

In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)

  • Process starvation In order to resolve process starvation in EG instance, I have two ideas.
    1. Spawn spark-submit pod and launch-kubernetes pod instead of launching processes. Using container, isolate the spark-submit process from EG instance.
    2. Create another submitter pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable because EG always passes the parameters for launching a process.
  • Session Persistence This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?

Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.

Thanks.

Environment

  • Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
  • Platform: Kubernetes
  • Others: nb2kg latest
1 Like

Hi Evan - thank you for opening this discussion!

The test code snippet is great - would you be willing to contribute that to the project? Also, I need to make sure that when you used #580, you also built notebook and jupyter_client with the corresponding PRs (#4479 and #428, respectively). I assume that’s the case.

I also need to make sure that the kernelspecs python3 and spark_python are indeed using the KubernetesProcessProxy and not getting launched as kernels local to EG. I assume each gets launched into their respective pods, etc.

Your post strikes me as being more about concurrency than scalability - although the two are closely related. Once kernels have started, EG really doesn’t need to do much from a scalability perspective - since kernel management is really just a hash map of kernel manager instances that just transfer the kernel inputs/outputs between the front end and kernel and the web handlers tend to be async already - it’s a relatively small footprint. However, as you have found, concurrent starts can be an issue - thus the need to introduce more EG instances - once we’ve optimized for the single instance case of course.

Yes, the current “HA/DR” story is Active/Passive and that shouldn’t be conflated with scalability. I totally agree we need to get to an Active/Active configuration that will alleviate both HA and scale issues.

Process Starvation:
I like the idea of introducing a submitter or launcher pod, although I’m not sure if it just moves the problem into the pod. How did you conclude that this approach might be beneficial? Is k8s imposing process on a given pod? Are there areas that we can address in lieu of this - i.e., there may still be async-related changes since those PRs haven’t been fully exercised. For example, I’m not sure if shutdown has been fully “asynced” in the PRs. As a result, it might be helpful to insert a delay prior to shutdown such that all kernels are running simultaneously before the first is shutdown.

I’m assuming this pod (deployment or service?) would be a singleton and not created/deleted on each kernel start request. The overall kernel start time probably won’t improve since the same sequence must occur, whether from EG or not. What’s nice is that inserting this kind of mechanism doesn’t affect kernel discovery. Keep in mind that in Spark 3, the plan is to provide the ability for users to bring their own pod templates - although launch would still go through spark-submit, a pre-processing step to substitute parameters into the template would need to occur (see #559) - as a result, I think the same pod could handle both styles of launch.

I’m also wondering if this is a matter of using another entity for the launch with a queue in between, if this can be implemented generically via threading? That would allow other platforms (on-prem YARN, docker, etc.) to leverage this feature. Perhaps most of the initial effort could be done in this manner.

Session Persistence:
Yes - the ultimate target for this is a lightweight, NOSQL DB like Redis, MongoDB (see #562) or Cloudant (CouchDB), etc. and is the driving force behind the recent refactoring of the KernelSessionManager. I think the big effort here is removing EG’s dependency on Kernel Gateway and subsuming the websocket personality handlers so we can add 404 handling that then hits the persisted session storage (whether that be Redis or NFS, etc.). Some of the issues are discussed in #562 and #594. Regardless, we’ll want to continue using sticky sessions so as to not thrash within a given kernel’s session.

I’m not sure we need to worry about references throughout the server. Since sessions are sticky, its really a matter of establishing the in-memory states from the persisted storage within the handlers when the handlers detect the kernel_id can’t be found (i.e., 404 handling). Once established, those sessions should be usable. EG really just uses the kernel manager state, sessions are not sent from the client via NB2KG, although users issuing requests directly against the REST API can start a kernel via a POST to /api/sessions.

One area that gets side-affected by this is idle kernel culling since the kernel’s last activity is stored in the EG process. This is why sticky sessions are critical since a kernel’s management should only switch servers if the original server has gone down. If idle culling is enabled, then this implies the idle periods would reset when the kernel switches servers - which is probably correct since the new server doesn’t load the kernel’s session state unless there’s activity against it. Where idle culling would fail to behave is if the idle timeout were to occur after the server has died but before the kernel got loaded by another instance. In these cases, the kernel would not be culled.

Thank you for looking into this! It’s an extremely important topic - one that larger installations can definitely benefit from.

1 Like

@kevin-bates thanks for your response! I cannot answer to everything because I didn’t have enough time to read and come up with the ideas, but I can say that I used notebook and jupyter_client you shared. It was perfectly working before scalability test. Let me answer to the rest of starvation issue tomorrow ;). I don’t know about other platforms, but threading might not be a solution for starvation. The processes should be isolated!

well… as for the code snippet, my concerns to make it public are how many concurrent kernel starts are appropriate, whether it is okay to just print hello rather than keeping websocket session for longer time. Moreover, as for the scalablity or stress test, I think we need more results than Paas/Fail that unittest library provides, for example response time, and requests per seconds. I’m considering Locust tool or JMeter for it, but nothing has been fixed.

By the way It is a great news to know you think this is important as well!

thanks for your interest!

1 Like

FYI, launching 30 JVMs for spark-submit was the load in the EG pod.

Thanks for your responses - I know it’s a lot (that’s the way I tend to roll - unfortunately).

ok - I’m not surprised by your starvation comment then. The idea behind k8s is that each kernel run in its own pod - otherwise there’s no difference between notebook usages where kernel resources are locally consumed - and spark drivers consume lots of resources.

Would it be possible to try the scale test using kernelspecs configured with KubernetesProcessProxy? The gating factor for that should be your k8s cluster (rather than the EG pod/namespace/node limits).

Regarding to Starvation issue, I didn’t mean the kernels to run on EG pod. As you know well, kernel is running on a separate pod, but process to launch the kernel runs on EG pod. I want those processes to run on another pod.

As you mentioned earlier, it just moves the problem to other pod which is scalable.
Let me assume that EG has only spark kernelspec which is spawned by spark-submit at this time.
Launching multiple spark-submit processes inside the EG slows down the performance of all the processes in EG. What I expect to submitter idea is that all spark-submit run in consistent performance regardless of how many requests come into EG by isolating each process in a pod (container).

The first idea is to spawn a pod to launch spark-submit process per each kernel start request so that processes spawned by kernel start request can be isolated to run in consistent performance.
And the second idea is to run a daemon pod to launch spark-submit processes instead of EG.
The second idea might sound like it’s just what EG is working on now, but the idea behind it is submitter can be scaled out although EG can’t.

Thanks for your idea about session persistence.
I’ll come back for that soon after taking a look at it more.

1 Like

Thank you for the details - I misunderstood and thought you might be running kernels locally as part of your test. The files you reference contain process-proxy of k8s - so we’re good.

I’ll be interested to see if this helps. I think you’re right, it might help out EG a bit, although I’m a little anxious the overall startup time of the kernels will be increased. Thank you for digging into this!

Please go ahead and create an issue in EG and we can work from there.

1 Like

EG issue created: https://github.com/jupyter/enterprise_gateway/issues/732