Description
Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.
def scale_test(kernelspec, example_code, _):
"""test function for scalability test"""
res = True
gateway_client = GatewayClient()
kernel = None
try:
# scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
kernel = gateway_client.start_kernel(kernelspec)
if example_code:
res = kernel.execute(example_code)
finally:
if kernel is not None:
gateway_client.shutdown_kernel(kernel)
return res
class TestScale(unittest.TestCase):
def _scale_test(self, spec, test_max_scale):
LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
example_code = []
example_code.append('print("Hello World")')
with Pool(processes=test_max_scale) as pool:
children = []
for i in range(test_max_scale):
children.append(pool.apply_async(scale_test,
(self.KERNELSPECS[spec],
example_code,
i)))
test_results = [child.get() for child in children]
for result in test_results:
self.assertRegexpMatches(result, "Hello World")
def test_python3_scale_test(self):
test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
self._scale_test('python3', test_max_scale)
def test_spark_python_scale_test(self):
test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
self._scale_test('spark_python', test_max_scale)
I’ve set LAUNCH_TIMEOUT
to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because spark-submit
processes launched by EG makes process starvation among EG process and other spark-submit
processes.
When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)
I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.
In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)
- Process starvation In order to resolve process starvation in EG instance, I have two ideas.
- Spawn
spark-submit
pod andlaunch-kubernetes
pod instead of launching processes. Using container, isolate thespark-submit
process from EG instance. - Create another
submitter
pod.submitter
pod queues the requests from EG, and launch processes with limited process pool. Thissubmitter
pod is also scalable because EG always passes the parameters for launching a process.
- Spawn
- Session Persistence This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?
Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.
Thanks.
Environment
- Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
- Platform: Kubernetes
- Others: nb2kg latest