Hello,
We are running a Jupyterhub service on a VM with around 150 users daily. We encountered some irregular instabilities during the day since a while ago. At first, we thought it was a problem with a user doing too much activity at once and making the service unusable for the other users.
During our investigation, we regularly found problem linked to the number of open files by a process hitting a limit, which resulted in errors in the log like ENOTFOUND, EMFILE, EBUSY. We tried to up the limit in the VM and it worked for some time but it didn’t fix the problem.
When we tried to monitor the number of openfiles for the different Jupyterhub processes (python, nodejs, nginx), we realized that nodejs sometimes does not close all of its open files during the night, when there’s almost no user online. Usually, it does close most of its files and keep around 40+ opened, but when it doesn’t, the number just add up until we reach a limit (around 2100+) and the service breaks. We have to restart it to fix it and the number of open files goes back to normal.
For now, we have a script that is launched during the night and it restart the Jupyterhub service if the number of open files by nodejs is too high but it’s not really a fix.