Cpu limits in jupyterhub/k8S

Dear,

I am running a jupyterhub on microk8s on a supercomputer (1node, 64cpu).

I added a cpu request/limit to the config.yaml = 8/24.
When opening a user session, I am using the top command to look at my CPU usage.
Would you expect to see the 64 CPU usage here? I would have naively thought we can see only the CPU usage for the one used in the pod aka between 8 and 24.

Even with this limit and only 4 users at the same time, we are having a huge slow-down mainly because all the usage is concentrated on at most 3 CPUs. Two of these users are using dask with 10thread and big data so the CPU/memory usage.

Do you have an idea of what is happening here ?

config.yaml for the limit part:

  singleuser:
    cpu:
      guarantee: 8
      limit: 24
    memory:
      limit: 128G

Thanks for your advice

Getting usage inside a container can be tricky. The container will generally be able to “see” all the cpus on the host, but not actually use them if you’ve set limits. However, this can mean that tools that auto-select threads and such based on hardware availability can select too many, causing big slowdowns. If you can, I’d try using top, etc. outside the container to see what real usage is.

If you are only seeing 3 cpus used, that makes me suspicious of CPU affinity being set, which can pin all threads in a given group (perhaps a whole pod). Often this is related to the blas library in use. It could be set in an environment variables somewhere, but can be hard to track down.

I’d also recommend trying with request and limit as the same value while you are tracking things down, to narrow down variables.

Thanks for your answer.
The top outside gives me something similar with around only 3 cpu used.

However, I think I found out why I had so much slowdown on the hub and proxy each time i am running a notebook. I added in the config file some required cpu as similar to this example found in the documentation. And since then, at least the whole jupyterhub is not slowdown.

I still have a CPU problem. If I am running two notebooks at the same time they are using the same cpu even if there are so many more available making it very slow.
I will look into the CPU affinity / blas library in use, could be a good lead, Thanks you !

Looks like we are using openBlas in numpy background and the cpu affinity for every process I could find is ffffffffffffffff so everything look good from this side.

Maybe I would have the idea more clear with a graphana dashboard. Do you know where there is a guide to install a graphana board for jupyterhub. I know there is one already specialized for jupyterhub metrics but I did not find a clear guide on how to setup everything.
Thanks!

We are in fact working on a tool for that here using jsonnet.

Multiple processes stuck on the same CPU really does sound like pinning is happening at some level, but unfortunately I don’t know all the ways this can happen. The fact that you are on a supercompuer makes me suspicious there might be other resource management things involved somehow. How is microk8s being launched? Is it inside the system’s job control, or just directly on a node without interacting with slurm/etc.?

If I understand correctly, you are seeing one pod using a maximum of one cpu, no matter how many processes, but another pod can use another cpu? So there’s an effective cpu limit of 1 on the pods? Does it change if you remove the cpu request and limit at the kubernetes level?

microK8s is directly installed on the node without interaction with slurm or other (as you would on a classic computer in fact). I agree with you there is probably a config somewhere pinning cpus.

The exact usage is a bit unclear but I would say yes 1 CPU by 1 pod even in multiprocessing/multithreading config. Sometimes it can expand on more CPU but not reliably.

No change if I remove the CPU request/limit.

I will take the time to install prometheus/grafana thank you for link.

Hello everyone, many months later I finally fixed my CPU problems. So if someone else working on a supercomputer and not in the cloud encounters the same problem, I thought I would give some feedback here.

Seems the autoscaller scheduling feature, which is not useful in my case, was creating a weird behavior combined with my supercomputer config.
After deactivating most of it, I am able to multiprocess correctly and used all my cpu cores. Hopefully, it won’t create problems in the long run.

scheduling:
   userScheduler: 
      enabled: false
  podPriority:
     enabled: false

Thanks for the help