tornado.web.HTTPError: HTTP 500: Internal Server Error (Permission failure checking authorization, I may need a new token)

Hi all,
I’m hoping to get some input on a problem that has cropped up recently.

We have Jupyterhub running on Rocky Linux 8 using Batchspawner to run notebooks on a Slurm Cluster (also running Rocky Linux 8). This has been working great for many months.

Now whenever trying to spawn a notebook, the web interface gives:
“Spawn failed: sbatch: error: Batch job submission failed: Socket timed out on send/recv operation”

On the Cluster node where the job was trying to spawn, there are files like /tmp/jupyterhub-31717.error that contain:

I don’t have permission to check authorization with JupyterHub, my auth token may have expired: [403] Forbidden
{“status”: 403, “message”: “Forbidden”}
Traceback (most recent call last):
File “/mnt/local/python3.9/bin/batchspawner-singleuser”, line 8, in
sys.exit(main())
File “/mnt/local/python3.9/lib/python3.9/site-packages/batchspawner/singleuser.py”, line 17, in main
hub_auth._api_request(
File “/mnt/local/python3.9/lib/python3.9/site-packages/jupyterhub/services/auth.py”, line 436, in _api_request
raise HTTPError(
tornado.web.HTTPError: HTTP 500: Internal Server Error (Permission failure checking authorization, I may need a new token)

So far the only similar issue I have found searching the forums is a reference to a netrc or .netrc file in the users’s home directory causing the problem, but these files do not exist in the user dir, /etc, or in any other obvious places.

Anyone have any thoughts to share on what might be causing this, or how to troubleshoot?

Thanks,

-Dj

I believe I have traced the problem down to communication slowness issues between sssd on the Linux systems and the campus Active Directory service.

In the sssd.log I noticed lots of messages “sssd (‘default’:‘%BE_default’) was terminated by own WATCHDOG”.

As a workaround, I added “timeout = 45” into the domain section for the AD server in sssd.conf, and so far that seems to have helped. No more WATCHDOG termination messages and the jobs are now running.

I still haven’t figured out why the communication to the AD service is taking a lot longer than it used to, but at least this allows us to function while we troubleshoot.

-Dj