Lab Extension for Monitoring Kubernetes Cluster?

rabernat · May 8, 2019, 3:04pm

In Pangeo, we run jupyterhub and binder clusters using kubernetes on several different clouds. We often use dask_kuberentes to launch additional dask worker pods from our notebooks.

Both administrators and users are generally very curious about the status of the cluster as a whole. They would like to know

What is the status of my dask pods?
How many other users are on the cluster?
What is the status of the VM nodes and the pod distribution among them?
How much is it costing?

This information is available to admins via kubectl or the cloud console. But what if we could monitor it directly from our jupyterlab window (or, alternatively, from the jupyterhub interface)? This would be valuable for debugging but also for education. Lots of people are just curious about how cloud works. HPC users are used to being able to query the cluster load and job queue, and expect similar information to be available in the cloud.

Perhaps some tools already exists for this purpose that could be plugged in to meet this need.

jacobtomlinson · May 8, 2019, 4:00pm

This sounds like a great idea. We could do this with the same libraries that dask_kubernetes uses. Kubectl and all the other tools are built on the same API, so that info should be available.

Cost is slightly more challenging as that is at the cloud provider level rather than the kubernetes level.

rabernat · May 8, 2019, 4:56pm

Even without the cost information, this would be quite useful.

I see it mainly as a design / javascript problem. How do you present this information in a compact, visually engaging way?

jacobtomlinson · May 9, 2019, 8:03am

Yeah I agree with that. It might be interesting to throw together a quick ipywidgets mock up.

The official kubernetes dashboard could provide some good inspiration. However it definitely has it’s flaws too.

betatim · May 9, 2019, 10:01am

Throwing in another idea: if you already run grafana for your cluster could you put together a dashboard there which let’s people see relevant things based on their username?

Making it available directly in lab/a notebook could be done via an iframe which already has the right username set.

You get all the features of grafana for free, downside is that it won’t look quite as integrated.

jacobtomlinson · May 9, 2019, 10:10am

Thats definitely an option. However depending on how you’re securing your Grafana this could cause extra problems. We use GitHub oAuth on ours.

betatim · May 9, 2019, 10:15am

Nods, I was assuming that if you can run stuff on the cluster you’d also be allowed to look at the grafana charts. If the same auth is used for launching pods as for accessing grafana you could reuse the token. Though the more complicated this gets the less attractive it is to reuse grafana charts

jacobtomlinson · May 9, 2019, 10:20am

Yeah it’s an option. Although the Grafana dashboard generally gives you access to more things than personal usage on the cluster, so we might want to manage that.

Looking at permissions we currently do not provide enough perms on Pangeo by default to do the things that @rabernat has mentioned.

We can get info about the dask pods (and all other pods in the namespace including notebooks). From this we can infer the number of users, dask clusters and dask workers.

We do not provide credentials to get information about the underlying nodes. We could add this but it can cause security headaches for those of us running Pangeo/Z2JH on multi-tennant kubernetes clusters.

I guess in many cases the Kubernetes admin and Jupyter user are going to be the same person. But when running at an institution level this will not be true.

rabernat · May 9, 2019, 4:47pm

It’s true that grafana may provide this information. But I have to admit that I detest the grafana ux. I made a mockup of the information I would like to see. Ideally this would be all responsive with lots of tooltips when I hover over the different objects.

Perhaps it is possible that grafana could be configured to show something like this. Or maybe it would be easier to make something new.

mrocklin · May 20, 2019, 9:20pm

Nice picture! If I were to implement such a thing I would probably write a Bokeh Server application (which I claim is easy enough for a moderately skilled Python dev to learn in a day) and then use something like this template that shows integrating a Bokeh Server application into JupyterLab as an extension (work by Ian Rose)

I did this with someone at NVIDIA who had never seen Bokeh before and he was up and running with a GPU diagnostics dashboard within about a day.

jhamman · May 21, 2019, 11:42pm

I’d be really excited to see something like this come together. A few months back, mocked up a prototype ipywidget (https://gist.github.com/jhamman/a7f8a00fa19cfa9ecaf5a252a4707842) to start exploring this space. I’d be psyched if someone was able to make a real extension happen.

I’ll also agree with others that we may well be able to configure grafana and the k8s dashboard to behave sufficiently well that we don’t have to build something new.

ntor · July 8, 2020, 2:25am

Did anything ever come of this thread? I would love to have a tool like this. Willing to help build it if anyone where has begun any more work on a solution

qTipTip · July 8, 2020, 7:39am

I am also interested in this. Would love to see!

rabernat · July 8, 2020, 10:36am

As far as I know, no, nothing ever came of this suggestion. The current best practice appears to be to use Grafana and Prometheus to do monitoring of hubs.

@ntor, thanks for volunteering to help build something! Perhaps to best way forward would be try to build an extension that can query data fom Prometheus?

ntor · July 8, 2020, 1:52pm

I am a bit more interested in building an extension to help debug issues in deployments. Something like the VSCode kubernetes extension that gives easy access to logs and resource status/descriptions. To do that, I was thinking of building this around something like kubernetes-client: https://github.com/kubernetes-client/python/issues/333.
However, it definitely seems cool to integrate monitoring from prometheus along with this. What do you think would be the best way to do that?

Topic		Replies	Views
Log Monitoring for JupyterLab on DockerSpawner JupyterHub jupyterhub , how-to , help-wanted	2	423	March 12, 2024
Announce: Grafana Dashboards for JupyterHub on Kubernetes Zero to JupyterHub on Kubernetes	0	4402	December 31, 2020
Infrastructure Advice for JupyterHub, Dask, and Airflow JupyterHub	4	1666	April 10, 2019
[Feature request] More useful metrics collected by JupyterHub JupyterHub feature-idea	3	640	June 17, 2022
Grafana Dashboard JupyterHub	2	580	April 18, 2024

Lab Extension for Monitoring Kubernetes Cluster?

Related topics