Our app is using JupyterHub as a backend - when we launch notebook servers for users we poll the /hub/api/users/{user} endpoint to get the user json which includes status information for the user’s servers. The reason we poll is because the status event endpoint is not available outside of /hub origin.
This breaks down pretty quickly - we poll once every three seconds and with a small number of users (~10?) the response latency goes through the roof, which makes the requests just pile up.
I ran some very simple tests using Locust (https://locust.io) just hitting the /hub/api/users/{user} endpoint. This is the resulting chart:
Thanks for investigating! My guess is that it’s the activity tracking that’s bogging it down. Every authenticated request is performing a write to the database to update the activity counter for the authenticated user/token. That’s probably not good! We should probably throttle this so we don’t record activity for a given user more often than every ~30 seconds or so. I’ve added config to do this: add activity_resolution config by minrk · Pull Request #2605 · jupyterhub/jupyterhub · GitHub
Is your locust setup anywhere that we can look at? I’d love to have something to play with.
The reason we poll is because the status event endpoint is not available outside of /hub origin.
Can you clarify this one? cookie-authenticated requests are indeed restricted to /hub/ origins to avoid issues with user-to-user CORS, but token-authenticated requests should have no such limitation.
Every authenticated request is performing a write to the database to update the activity counter for the authenticated user/token.
Ok that confirms my hunch (I was surprised by the commits to the DB when retrieving information). It seems that API requests should be excluded from activity tracking, since an application may be doing the request on behalf of the user.
re: CORS - that rings a bell, it’s been a while since I’ve tried this. It’s entirely possible that I did it with cookies. Will try again, thanks for clarifying that point!
re: locust - I was just running it from my laptop against the deployed JH instance, so nothing to look at, but this is the setup:
from locust import HttpLocust, TaskSet, task
class UserBehavior(TaskSet):
@task(1)
def page_projects(self):
self.client.get('/jupyterhub/hub/api/user',
headers={'x-requested-with': 'XMLHttpRequest',
'Authorization': 'token <token>'}
)
class WebsiteUser(HttpLocust):
task_set = UserBehavior
min_wait = 100
max_wait = 5000
I don’t think so. When a user makes an API request, that is activity. If an application is using a user token to make such a request, I think that ought to be considered activity for the user, since using a user token in this way is the application pretending to be the user. Plus, activity is tracked for the token itself. I think the key is reducing the interval so this doesn’t happen on every request.
I’m digging into profiling and performance scale/load testing our hub and came across this older thread. I just wanted to say that earlier in the year when we had some load issues on the hub API during a large user event that config was the thing we found and tuned to ease the pressure so definitely thanks for adding that. We actually bumped that up from the default (30 seconds) to 6000 seconds.
While this is fresh in mind and I’ve been poking around docs looking at performance tuning, would it be worthwhile to mention that config option here [1]?
Our team has identified a performance issue with the KubeSpawner’s PodReflector when running at high scale (2-3K notebook server pods) [1]. The default behavior of the kubernetes python client is to convert JSON K8S API responses into objects and from using py-spy we see that a lot of time is spent during that conversion. By reducing the CPU time spent by KubeSpawner it makes the hub API much more responsive. We have also done things like turn down activity_resolution, hub_activity_interval, and last_activity_interval (we are going to try and optimize the update_last_activity method which can be slow when you have 3K users) and also disabled events.