In Pangeo, we run jupyterhub and binder clusters using kubernetes on several different clouds. We often use dask_kuberentes to launch additional dask worker pods from our notebooks.
Both administrators and users are generally very curious about the status of the cluster as a whole. They would like to know
What is the status of my dask pods?
How many other users are on the cluster?
What is the status of the VM nodes and the pod distribution among them?
How much is it costing?
This information is available to admins via kubectl or the cloud console. But what if we could monitor it directly from our jupyterlab window (or, alternatively, from the jupyterhub interface)? This would be valuable for debugging but also for education. Lots of people are just curious about how cloud works. HPC users are used to being able to query the cluster load and job queue, and expect similar information to be available in the cloud.
Perhaps some tools already exists for this purpose that could be plugged in to meet this need.
This sounds like a great idea. We could do this with the same libraries that dask_kubernetes uses. Kubectl and all the other tools are built on the same API, so that info should be available.
Cost is slightly more challenging as that is at the cloud provider level rather than the kubernetes level.
Throwing in another idea: if you already run grafana for your cluster could you put together a dashboard there which let’s people see relevant things based on their username?
Making it available directly in lab/a notebook could be done via an iframe which already has the right username set.
You get all the features of grafana for free, downside is that it won’t look quite as integrated.
Nods, I was assuming that if you can run stuff on the cluster you’d also be allowed to look at the grafana charts. If the same auth is used for launching pods as for accessing grafana you could reuse the token. Though the more complicated this gets the less attractive it is to reuse grafana charts
Yeah it’s an option. Although the Grafana dashboard generally gives you access to more things than personal usage on the cluster, so we might want to manage that.
Looking at permissions we currently do not provide enough perms on Pangeo by default to do the things that @rabernat has mentioned.
We can get info about the dask pods (and all other pods in the namespace including notebooks). From this we can infer the number of users, dask clusters and dask workers.
We do not provide credentials to get information about the underlying nodes. We could add this but it can cause security headaches for those of us running Pangeo/Z2JH on multi-tennant kubernetes clusters.
I guess in many cases the Kubernetes admin and Jupyter user are going to be the same person. But when running at an institution level this will not be true.
It’s true that grafana may provide this information. But I have to admit that I detest the grafana ux. I made a mockup of the information I would like to see. Ideally this would be all responsive with lots of tooltips when I hover over the different objects.
Nice picture! If I were to implement such a thing I would probably write a Bokeh Server application (which I claim is easy enough for a moderately skilled Python dev to learn in a day) and then use something like this template that shows integrating a Bokeh Server application into JupyterLab as an extension (work by Ian Rose)
I did this with someone at NVIDIA who had never seen Bokeh before and he was up and running with a GPU diagnostics dashboard within about a day.
I’d be really excited to see something like this come together. A few months back, mocked up a prototype ipywidget (https://gist.github.com/jhamman/a7f8a00fa19cfa9ecaf5a252a4707842) to start exploring this space. I’d be psyched if someone was able to make a real extension happen.
I’ll also agree with others that we may well be able to configure grafana and the k8s dashboard to behave sufficiently well that we don’t have to build something new.
Did anything ever come of this thread? I would love to have a tool like this. Willing to help build it if anyone where has begun any more work on a solution
As far as I know, no, nothing ever came of this suggestion. The current best practice appears to be to use Grafana and Prometheus to do monitoring of hubs.
@ntor, thanks for volunteering to help build something! Perhaps to best way forward would be try to build an extension that can query data fom Prometheus?
I am a bit more interested in building an extension to help debug issues in deployments. Something like the VSCode kubernetes extension that gives easy access to logs and resource status/descriptions. To do that, I was thinking of building this around something like kubernetes-client: https://github.com/kubernetes-client/python/issues/333.
However, it definitely seems cool to integrate monitoring from prometheus along with this. What do you think would be the best way to do that?