[Jupyterhub] Not saw GPU related metrics from Prometheus Metrics endpoint

Hi everyone!

I would like to monitor jupyterhub using the Prothemeus monitoring solution.
I read through the documentation here and enabled monitoring metrics: Monitoring — JupyterHub documentation
But, I have a question about the GPU metrics.
As I saw in the JupyterLab dashboard, we have a GPU resources usage tab


I tried to find the GPU metrics that appear in the image through the Prometheus endpoint, but it seem like it’s not exposed because I only saw the normal CPU and Memory usage metrics:

# HELP process_virtual_memory_bytes Virtual memory size in bytes.-:--:--     0
10# TYPE process_virtual_memory_bytes gauge
0process_virtual_memory_bytes 9.72066816e+08
 # HELP process_resident_memory_bytes Resident memory size in bytes.
 # TYPE process_resident_memory_bytes gauge
1process_resident_memory_bytes 1.63274752e+08
26k  100  126k    0     0  5739k      0 --:--:-- --:--:-- --:--:-- 5739k
# HELP total_memory_usage counter for total memory usage
# TYPE total_memory_usage gauge
total_memory_usage 1.1664384e+09
# HELP max_memory_usage counter for max memory usage
# TYPE max_memory_usage gauge
max_memory_usage 1.34814277632e+011

So, did I need to enable some option to expose GPU metrics through Prometheus endpoint or those types of metrics are currently not exposed?

Thanks for your time reading my question!

The JupyterHub metrics are related to the performance of the JupyterHub server, e.g.number of users, and the resource usage of the hub process. It doesn’t include any metrics for your singleuser servers- the metrics for these need to be scraped seperately.

If the metrics you need aren’t included you may need to isntall some extensions, or perhaps even write one. The GPU dashboard in your screenshot isn’t part of the standard JupyterLab, so you must’ve installed some customisations already.

1 Like

Hi @manics,

Thanks for your response!
About the GPU metrics from my image, I need to investigate the extensions where the dashboard comes from.
Could you please give me some clues about the metrics from the documentation (List of Prometheus Metrics — JupyterHub documentation)?
It seems like the metrics that I gathered are not the same as the metrics name from the documentation, such as:
Here are some metrics that I gathered

http_request_duration_seconds_count{handler="jupyter_server_terminals.api_handlers.TerminalHandler",method="DELETE",status_code="204"} 8.0
http_request_duration_seconds_sum{handler="jupyter_server_terminals.api_handlers.TerminalHandler",method="DELETE",status_code="204"} 0.238997220993042
http_request_duration_seconds_created{handler="jupyter_server_terminals.handlers.TermSocket",method="GET",status_code="404"} 1.6994337649306383e+09
http_request_duration_seconds_created{handler="jupyter_server_terminals.api_handlers.TerminalRootHandler",method="GET",status_code="200"} 1.699433773478154e+09
http_request_duration_seconds_created{handler="jupyter_server_terminals.api_handlers.TerminalRootHandler",method="POST",status_code="200"} 1.699433919273691e+09
http_request_duration_seconds_created{handler="jupyter_server_terminals.handlers.TermSocket",method="GET",status_code="101"} 1.6994339193902347e+09
http_request_duration_seconds_created{handler="jupyter_server_terminals.api_handlers.TerminalHandler",method="DELETE",status_code="204"} 1.699451404645297e+09
# HELP terminal_currently_running_total counter for how many terminals are running
# TYPE terminal_currently_running_total gauge
terminal_currently_running_total 2.0

I can not see the metric name “terminal_currently_running_total” in the documentation. Conversely, I can not find the metric name “jupyterhub_active_users” from the list metric above.

JupyterHub and Jupyter-server/JupyterLab are separate components. JupyterHub is designed to manage multiple jupyter server/lab/notebooks for multiple users, including managing logins, and creating/running/destroying jupyter servers. JupyterLab/server is what users use for running notebooks.

Based on your post I assume you’re only interested in the metrics for Jupyter server? If so you can ignore the JupyterHub documentation.

Hi @manics,

Sorry for the late response!
It’s more clear when you explain JupyterLab and JupyterHub. So, is there any documentation for the JupyterLab/server Prometheus metrics?

Someone from the Jupyter server/lab teams will know better than me.

are the metrics you’ve already found

This extension

has some additional metrics.

1 Like

Thank you for your information!

I think your findings are valuable for me to base on that and do a deeper investigation.

Hi @Tan_Trinh @manics curious how do you manage to scrape /metrics endpoint that host by jupyterhub-singleuser, i’m using k8s to host Jupyterhub and trying to allow prometheus server to scrap the jupyter server related prometheus metrics but it always showing the hub auth page like this when i directly curl the endpoint inside of jupyter server container, i think jupyterhub-singleuser is different from jupyter-server even though under the hood it’s inherit from it, and it probably override existing auth layer for jupyter server with HubOAuth, so setting --ServerApp.authenticate_prometheus=False didn’t help to bypass the auth piece, appreciate any insight here.

jupyter related version:

  • jupyterhub 3.1.1
  • jupyterlab 3.4.7
  • jupyter-server 1.24.0

update here, i was able to get the prometheus metric by using the jupyterhub api token plus the base URL need to be updated

curl 127.0.0.1:<jupyter-server-port>/user/<user_name>/<server_name>/metrics  -H "Authorization: Bearer $JUPYTERHUB_API_TOKEN"
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 119586.0
python_gc_objects_collected_total{generation="1"} 18505.0
python_gc_objects_collected_total{generation="2"} 1534.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
...