I have a deployment of JupyterHub on AWS EKS with multiple nodes.
An issue that my team has been running in to is that after a class is over, our nodes begin vacating as users logout or the culler removes inactive pods. Approximately an hour after the class is over, most single-user pods have been shut down one way or another. However, there are always a few users keeping their pods active hours after the class has finished. These remaining users’ pods are generally spread sparsely between cluster nodes, often only one or two single-user active pods remaining on each node. This scenario has been preventing the autoscaler from reducing the node count.
In the interest of minimizing server costs, we are looking for a solution to reschedule the sparsely distributed single-user pods onto a single node.
@yuvipanda are you aware of any existing solution that may solve our need? If not, do you have any suggestions on how we might achieve this, at the same time minimizing interruptions to single-users during the rescheduling process?
Peter