Weird behavior for cpu usage with multiprocessing in microk8s jupyterhub server

Dear,
I am managing a jupyterhub instance on top of microk8s server on a supercomputer.
We have been using multiple versions of jupyterhub or even daskhub and the problem is always present.
Our latest version is 1.2.0 because we are using daskhub-2022.6.0.
We have been noticing for a long time (2y probably) a weird behavior with our cpu usage when doing multiprocessing.
As you can see below in the gif, first time, we run the multiprocessing code all 64 cpu are used. But once the kernel is restarted, it will always run on 1 cpu until we have been a weird and long workflow:

  • shutdown all kernels,
  • shutdown server
  • log out
  • log in
  • start server
  • shutdown all kernel
  • start kernel

And even like this sometime it works again, sometime not. We have not been able to pinpoint which step is actually important.

cpu_utilization_jupyterhub

What we have already tested :

  • If we run on the same computer an unique jupyter notebook, there is no problem.
  • If I manually killed the user pod, the hub pod, the proxy pod, the problem is still here.
  • there is no cpu affinity set anywhere, i double/triple check at every level of the computer

I am running out of idea here and I was hoping someone already saw this problem.

Can you reproduce this problem if you run JupyterLab on k8s on its own, without JupyterHub? If you can that simplifies things, if you can’t then check the configuration of the pod when it’s launched by JupyterHub, and add those options to your manually created JupyterLab pod until you hopefully reproduce the problem.

That’s a good idea. After test, the problem does not appear in the manually created jupyterlab. I will try to add options in the config now.

If we run on the same computer an unique jupyter notebook, there is no problem.

Is this outside microk8s, or still in a manual microk8s pod? If outside, is it still in a container?

Can you also verify whether the pool processes are shut down after you restart the kernel? I wonder if some leftovers from the pool could be related due to an unclean shutdown of the kernel the first time.

One shot in the dark to try, before starting the pool, set the start method to spawn instead of fork:

multiprocessing.set_start_method('spawn')

there is no cpu affinity set anywhere, i double/triple check at every level of the computer

Were you checking code, or inspecting the processes at runtime? Since this behavior looks so much like cpu affinity pinning (something like the forked subprocess modifying something somewhere that affects the parent when it shouldn’t have), checking at runtime would give me more confidence that it’s truly not involved. You can do this with psutil or taskset: what do you get from taskset --all-tasks -cp 1 and/or the Python code:

import psutil
for p in psutil.process_iter():
    print(p.pid, p.cpu_affinity())
1 Like

Both inside microk8s (with a manual pod) and outside microk8s, not in a container.

Thank for your advice I will try it.

In the mean time, I gave up on the lead of rebuilding the manual pod with the same config. But another side, I saw that depending the base image of the server it has a more or less high impact.
For example with my private docker image, it is in each restart while with one of the jupyterlab base image, it happens less often (still happen through).
At the end, excluding the cpu 0-3 from the pool did the trick but it is not very practical.
thanks for the input