Towards JupyterHub deployment insights

consideRatio · April 1, 2020, 7:05pm

This topic: a feature exploration

This topic is meant to be a collaborative exploration of what could make sense to develop in order to provide usage related insights for the funders, administrators, and users of a JupyterHub deployment! Let’s embark on a balancing act to arrive at a viable feature suggestion.

Value of data based insight

JupyterHub admins often need to demonstrate value and estimate the cost of their JupyterHub deployment, they could benefit from having accessible information about its usage.

I believe that the typical administrator of a JupyterHub deployment currently only have a vague perception of how it is used, and I think that the users mostly know if they have a server running or not right now. Could we provide significant value by providing some more information about the collective usage to administrators and the individual usage to individual users?

Let’s consider an example where an institution funds a JupyterHub deployment.

If the institution would have a measurable indication of the cost and value the deployment provide, it could help them motivate its continued funding and development, allocate costs appropriately, and help administrators optimize it further.
If individual users would be better informed about their own usage and its implications on costs and such, they would likely gain agency to use the resources appropriately which would benefit everyone.

Related terminology

Monitoring is typically about tracking the current status of various metrics such as currently running users. Its main purpose is probably a more technical and for the current operation.
Events can be emitted and recorded to track discrete events, such as a spawn of a user pod.
Key Performance Indicators (KPIs) are values that typically act as a statistical summary of a duration of time back. For a JupyterHub it could be weekly/monthly active users where an active user is a user that have started a server once, or weekly/monthly regular users where a regular user is a user that have been active for 8 hours a week on average.

Feature inspiration

I added some reference examples of related features, edit this wiki post to add more.

JupyterHub’s admin dashboard

JupyterHub currently provide a snapshot number of the total users in the JupyterHub database, and the total currently running servers. It is also possible to list users based on their latest activity.

GitLab’s admin dashboard

GitLab deployment provides some snapshot indicators of its usage as well as latest projects/users/groups that gives an indication on activity.

Discourse’s reports

Discourse takes it a bit further by presenting a dashboard with an opinionated selection of metrics presented in graphs together with suggestive help about the graphs. They also present a wide range of reports that can be exported either manually or through a REST API.

Provided info about DAU/MAU when hovering the questionmark
“Number of members that logged in in the last day divided by number of members that logged in in the last month – returns a % which indicates community ‘stickiness’. Aim for >30%.”

Grafana

Grafana is a tool dedicated to presenting dashboards of metrics. Grafana is typically relying on something like Prometheus which is able to repeatedly poll various services (example: https://hub.mybinder.org/hub/metrics) to build up a time series of their status. Prometheus can then provide historic information to Grafana that can then use be used to define dashboard’s with different graphs.

Grafana allow dashboards to be exported to JSON objects and also make it possible to publish these dashboard descriptions on grafana.com/grafana/dashboards.

This has has allowed admins of the mybinder.org deployment to create dashboards backed by the data from prometheus and collected by prometheus, these dashboards with graphs are publically available at https://grafana.mybinder.org.

Here is an example from a graph in the Node Activity dashboard. From this, an administrator could learn that they had configured to have more user placeholders than seems needed. User placeholders is a feature for Z2JH deployments acting as seat warmers to ensure users don’t end up waiting for nodes to start.

Here is another graph that provide fine grain insights about the usage of the deployment, but the data presented is coming from Kubernetes API rather than from JupyterHub, so the dashboard definition which can be extracted as JSON, is tightly coupled with Z2JH.

Since JupyterHub expose a /hub/metrics endpoint (@GeorgianaElena and others ) about for example the total number of currently spawned servers, we could define a Grafana dashboard with common metrics independent on the kind of JupyterHub deployment.

Related:

Develop a Z2JH Grafana dashboard for use with prometheus · Issue #1566 · jupyterhub/zero-to-jupyterhub-k8s · GitHub

Events with `jupyter_telemetry`

mybinder.org collect data and makes it available at https://archive.analytics.mybinder.org/. The published data describes the launches for users at mybinder. This is tech enabled by BinderHub itself rather to some degree rather than specifically for mybinder. It works by logging these events using code which I now this is extracted to the jupyter_telemetry package.

@yuvipanda provide a relevant discussion about the difference of events and metrics that I quote below. Note that grafana presents metrics above, while @choldgraf tweeted an analysis based on events.

BinderHub also exposes prometheus metrics. These are pre-aggregated, and extremely limited in scope. They can efficiently answer questions like ‘how many launches happened in the last hour?’ but not questions like ‘how many times was this repo launched in the last 6 months?’. Events are discrete and can be aggregated in many ways during analysis. Metrics are aggregated at source, and this limits what can be done with them during analysis. Metrics are mostly operational, while events are for analytics.

Additional reading:

Related:

https://twitter.com/choldgraf/status/1246473805918113792

Website analytics with Matomo

Matomo allow for website usage to be recorded and presented as I understand it, like google analytics, but open source. The mybinder deployment of BinderHub has deployed it alongside the BinderHub, and it can help track how many arrive to mybinder.org and navigation on various subpages that may didn’t need to be additional requests to the BinderHub backend.

Feature ideas

I list some feature ideas, edit this wiki post to add yours or edit existing entries.

/metrics endpoint additions

We already exposea /metrics endpoint on JupyterHub, but what metrics do we currently expose, and what metrics do we want to have exposed there?

Grafana dashboard

Since we have a /metrics endpoint, it would pair very well with a grafana dashboard that defines the graphs assuming we have such data collected by Prometheus over a period of time.

KPI reports for admins

Just like Discourse provided a predefined set of Key Performance Indicators (KPIs) such as DAU/MAU. I think it could be good to define some of these that are general enough for all JupyterHubs.

To get KPI reports, I think we need to:

Define a set of KPIs to expose
- What would a admin want to see? (Operational insights)
- What would a investor want to see? (Usage / value / outcome insights)
Enable collection of relevant data
JupyterHub is an extensible system where Spawners, Authenticators, and Proxy are base classes that can be overridden. Some KPIs may require us to add something to these base classes and implement some additional logic in the derivative classes like KubeSpawner.
Collect and store relevant data
We need to collect and store the raw data so we can analyze it later.
Process data into a KPI
We need to be able to process the data into a KPI.
Publish through a web UI and/or API (+ notebook)
We need to be able to expose the KPIs, either directly from the JupyterHub web UI, through a built in JupyterHub REST API, or through a JupyterHub service.

Usage reports for users

This regards the idea of providing usage reports to individual users. What would be beneficial if users got information about?

Events server (a sink for jupyter_telemetry)

We could provide a JupyterHub service (internal, external, or either) that could act as a sink that receives events from various sources which it could expose somehow. This could act as common place to send events that then can be exposed for analysis.

Summary and questions to you!

The JupyterHub ecosystem has some mechanisms to provide insights about its usage, such as JupyterHub’s /metrics endpoint and the jupyter_telemetry package, but we could likely benefit from some more pieces. I hope that we can define something a feature valuable enough to develop, sustainable to maintain, with early adopters ready to dogfood it during development.

What feature do you think could make sense to develop?
What insights would you benefit from as a project funder, administrator, or user?

betatim · April 5, 2020, 1:22pm

I like the idea and the collected examples are very cool.

I got stuck/captured by the section " KPIs for a JupyterHub" and wondering what numbers would a typical hub admin want/need to have access to? What would help them?

I think there are (at least) two different situations in which you’d find yourself as an admin:

live usage/debugging the current situation (plots like we have on grafana.mybinder.org)
general/long term health and optimisation of a hub

For (1) something “grafana style” is probably the right answer. The hurdle here is IMHO setup complexity. So maybe something like https://github.com/netdata/netdata (and friends) that “just works”/single binary is better as recommendation for TLJH and “small” z2jh deployments?

For (2) I am thinking of something that can answer questions like “what is the maximum number of concurrently logged in users?”, “how much peak/median disk space are my users using?”, “how much RAM do my users actually use?”, “how many distinct users actually use this hub regularly?”, “how often and when do we run out of CPU?”, “how often and when do we run out of RAM?”. Having an easy way to get answers to these kinds of questions would help people optimise the cost of their hub by increasing/decreasing the resources they assign. It would let people figure out if a MOOC with 2000 subscribers needs to be able to handle all 2000 using the service at once or 200 or 20 or 2.

For (2) I think we could/should implement some custom things. For (1) we can hopefully find an off-the-shelf solution that can be bundled or recommended.

scottyhq · April 8, 2020, 11:32pm

Thanks for this excellent write-up @consideRatio! I definitely agree with @betatim’s ideas and I’m going to add some thoughts (mostly a wishlist ) based on both administering and using a jupyterhub over the last year running on AWS EKS with ~200 users (https://aws-uswest2.pangeo.io). It seems like some features could be implemented in any jupyterhub deployment, but others might be limited to k8s.

What feature do you think could make sense to develop?

There are definitely levels to explore here. For starters, I really like the idea of a dashboard endpoint without many external dependencies that exposes more “events” data. The discourse example interface seems great! Just like the /hub/admin page, having a more sophisticated pre-canned /hub/dashboard would be highly useful with access to historical data and fields in addition to ‘Last Activity’ (see response to question 2). A secondary, but equally important, level is facilitating adding external tools to a k8s hub - like prometheus/grafana with pod-level CPU, RAM, and other resource metrics.

What insights would you benefit from as a project funder, administrator, or user?

I’ve never dug into the hub database before, so maybe some things are already in place, but here are several simple data fields per user that would come in handy as an admin - “total logins”, “login/logout timestamps”, “average session duration”, “login IP” (to track geographic usage), “image launched” (if the hub uses a profile list).

CPU, Ram, Disk, and network traffic are all extremely useful things to have a record of for hub infrastructure optimization. Each of these metrics can also be translated to cost metrics with some multiplier depending on the Cloud-provider. I think these metrics are extremely useful for admins but also for users as an educational tool. But I imagine most users wouldn’t go out of their way to track down that information in a centralized metrics endpoint, so one possibility would be to display a per-session summary of these metrics to a hub or binderhub user when they shut down their server.

yuvipanda · April 9, 2020, 4:46am

As always, great overview, @consideratio! Thanks for putting that together.

At Berkeley, we’ve published data on our usage pattern and our cost. This lets us do at least some math on how much things cost. The public data is almost a year old now, but still very useful I think.

You can see the data and notebooks at https://github.com/berkeley-dsep-infra/datahub-usage-analysis. https://github.com/berkeley-dsep-infra/datahub-usage-analysis/blob/91e3fd8716fc40886886957c567e974b768641b8/notebooks/03-visualize-cost-and-usage.ipynb is probably the most useful notebook. For example, here is a graph of daily costs per user:

You can see the result of some cost reduction exercise we did, for example. We have done more of those since.

We do have more recent data, and that’s in jupyter_telemetry format. Maybe that will also be useful to publish?

Here’s what I would want, right now:

Standardized code to read JupyterHub telemetry events and infer - for example, session length. The data for this already exists, and some code also exists (I have some more code somewhere else too I believe).
Ability to generate reports based on this data, possibly adding in data from various cloud vendors on daily cost usage.

These two will go a long way in helping JupyterHub admins plan and justify their resource usage upstream.

rabernat · April 14, 2020, 3:33pm

I think this is an important topic. It’s somewhat related to what I posted here:

One additional point I tried to make is that I think that tools for simple monitoring of the kubernetes cluster state (nodes, pods, etc.) could actually be an important educational tool for people trying to learn about how cloud computing works.

consideRatio · April 15, 2020, 9:00pm

Thanks for your input everyone!

Cross reference to pangeo’s forum

brospars · February 3, 2022, 3:12pm

Hello @consideRatio @yuvipanda,

Do you know any straightfroward deployment of Jupyterhub dashboard ? Just as easy as ZTJH.

Thie preview of this looks great and it’s exaclty what I’m looking for but the complexity is a bit too much for me, being new to the kubernetes ecosystem.

Thanks for your help.

brospars · February 15, 2022, 3:16pm

In the end, I am using : GitHub - prometheus-operator/kube-prometheus: Use Prometheus to monitor Kubernetes and applications running on Kubernetes

With a customized Dashboard :

Topic		Replies	Views
Has anyone implemented JupyterHub dashboards to show usage / adoption / etc? JupyterHub	5	472	March 9, 2021
[Feature request] More useful metrics collected by JupyterHub JupyterHub feature-idea	3	659	June 17, 2022
Grafana Dashboard JupyterHub	2	802	April 18, 2024
[Jupyterhub] Not saw GPU related metrics from Prometheus Metrics endpoint JupyterHub how-to , help-wanted	8	555	March 6, 2024
Deploying JupyterHub for Education discuss	18	5547	May 5, 2019