Dear,
I am managing a jupyterhub instance on top of microk8s server on a supercomputer.
We have been using multiple versions of jupyterhub or even daskhub and the problem is always present.
Our latest version is 1.2.0 because we are using daskhub-2022.6.0.
We have been noticing for a long time (2y probably) a weird behavior with our cpu usage when doing multiprocessing.
As you can see below in the gif, first time, we run the multiprocessing code all 64 cpu are used. But once the kernel is restarted, it will always run on 1 cpu until we have been a weird and long workflow:
shutdown all kernels,
shutdown server
log out
log in
start server
shutdown all kernel
start kernel
And even like this sometime it works again, sometime not. We have not been able to pinpoint which step is actually important.
What we have already tested :
If we run on the same computer an unique jupyter notebook, there is no problem.
If I manually killed the user pod, the hub pod, the proxy pod, the problem is still here.
there is no cpu affinity set anywhere, i double/triple check at every level of the computer
I am running out of idea here and I was hoping someone already saw this problem.
Can you reproduce this problem if you run JupyterLab on k8s on its own, without JupyterHub? If you can that simplifies things, if you can’t then check the configuration of the pod when it’s launched by JupyterHub, and add those options to your manually created JupyterLab pod until you hopefully reproduce the problem.
If we run on the same computer an unique jupyter notebook, there is no problem.
Is this outside microk8s, or still in a manual microk8s pod? If outside, is it still in a container?
Can you also verify whether the pool processes are shut down after you restart the kernel? I wonder if some leftovers from the pool could be related due to an unclean shutdown of the kernel the first time.
One shot in the dark to try, before starting the pool, set the start method to spawn instead of fork:
multiprocessing.set_start_method('spawn')
there is no cpu affinity set anywhere, i double/triple check at every level of the computer
Were you checking code, or inspecting the processes at runtime? Since this behavior looks so much like cpu affinity pinning (something like the forked subprocess modifying something somewhere that affects the parent when it shouldn’t have), checking at runtime would give me more confidence that it’s truly not involved. You can do this with psutil or taskset: what do you get from taskset --all-tasks -cp 1 and/or the Python code:
import psutil
for p in psutil.process_iter():
print(p.pid, p.cpu_affinity())
Both inside microk8s (with a manual pod) and outside microk8s, not in a container.
Thank for your advice I will try it.
In the mean time, I gave up on the lead of rebuilding the manual pod with the same config. But another side, I saw that depending the base image of the server it has a more or less high impact.
For example with my private docker image, it is in each restart while with one of the jupyterlab base image, it happens less often (still happen through).
At the end, excluding the cpu 0-3 from the pool did the trick but it is not very practical.
thanks for the input