Jupyterhub 500 Error - SSL connection has been closed unexpectedly [tornado, ioloop]

JupyterHub version 1.2.1 hosted on EKS
AWS Aurora Postgres Serverless as hub Db

Every day we are receiving database connection exception which gives 500 as error response code.
Not really sure which part triggers this activity on daily basis, but it fails after this exception. pasted the initial exception from callback (IOLoop)

we have to refresh the hub from ui as the work around to fix this issue. looking for some leads to fix this recurring isssue

thanks

Error stack trace

[E 2021-05-06 16:23:24.252 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f9fce2c8d00>>, <Task finished name=‘Task-173862’ coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2519> exception=OperationalError(’(psycopg2.OperationalError) SSL connection has been closed unexpectedly
’)>)

Full Stack Trace

2021-05-06T16:41:57.823Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:57.725 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 0.98ms
2021-05-06T16:41:57.823Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:57.052 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.04ms
2021-05-06T16:41:56.823Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:56.741 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.03ms
2021-05-06T16:41:49.822Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[W 2021-05-06 16:41:49.409 JupyterHub log:181] 403 GET /hub/metrics (@10.78.39.194) 1.01ms
2021-05-06T16:41:47.822Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:47.714 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 1.03ms
2021-05-06T16:41:47.821Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:47.042 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 0.99ms
2021-05-06T16:41:46.821Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:46.730 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.03ms
2021-05-06T16:41:37.820Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:37.704 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 1.11ms
2021-05-06T16:41:37.820Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:37.031 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.05ms
2021-05-06T16:41:36.820Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:36.722 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.03ms
2021-05-06T16:41:27.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:27.693 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 1.01ms
2021-05-06T16:41:27.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:27.020 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.01ms
2021-05-06T16:41:26.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:26.714 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.15ms
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[parameters: [immutabledict({})]]
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,LIMIT %(param_1)s]
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,WHERE users.name = %(name_1)s
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,FROM users
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,"[SQL: SELECT users.id AS users_id, users.name AS users_name, users.admin AS users_admin, users.created AS users_created, users.last_activity AS users_last_activity, users.cookie_id AS users_cookie_id, users.state AS users_state, users.encrypted_auth_state AS users_encrypted_auth_state"
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,sqlalchemy.exc.StatementError: (sqlalchemy.exc.InvalidRequestError) Can’t reconnect until invalid transaction is rolled back
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,raise exc.InvalidRequestError(
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 469, in _revalidate_connection"
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,conn = self.revalidate_connection()
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 1202, in execute_context"
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,raise exception
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py”", line 182, in raise
"
2021-05-06T16:41:24.819Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,util.raise
(
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 1510, in _handle_dbapi_exception"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,self._handle_dbapi_exception(
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 1206, in _execute_context"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,ret = self._execute_context(
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 1124, in _execute_clauseelement"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“return connection._execute_clauseelement(self, multiparams, params)”
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py”", line 298, in _execute_on_connection"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“return meth(self, multiparams, params)”
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py”", line 1011, in execute"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“result = conn.execute(querycontext.statement, self._params)”
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py”", line 3560, in _execute_and_instances"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,return self._execute_and_instances(context)
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py”", line 3535, in iter"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,return list(res)
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py”", line 3203, in getitem"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,ret = list(self[0:1])
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py”", line 3429, in first"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,return db.query(cls).filter(cls.name == name).first()
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/jupyterhub/orm.py”", line 222, in find"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“user = orm.User.find(self.db, route_data[‘user’])”
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py”", line 2536, in update_last_activity"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,future.result()
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py”", line 765, in _discard_future_result"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,ret = callback()
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,“File “”/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py”", line 741, in _run_callback"
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,Traceback (most recent call last):
2021-05-06T16:41:24.818Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,"[E 2021-05-06 16:41:24.252 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f9fce2c8d00>>, <Task finished name=‘Task-176108’ coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2519> exception=StatementError(""(sqlalchemy.exc.InvalidRequestError) Can’t reconnect until invalid transaction is rolled back"")>)"
2021-05-06T16:41:18.623Z,i-088fb79e20ea8169a,180-adl-workspacemanager,2021-05-06 16:41:18.586 INFO [scheduling-1] - [TransId:80554a0b-01ee-42ff-b043-b8d0e05bd0a1][Class:DeleteScheduler$$EnhancerBySpringCGLIB$$230ee039][Method:deLinkArticlesScheduled] - Exiting
2021-05-06T16:41:18.623Z,i-088fb79e20ea8169a,180-adl-workspacemanager,2021-05-06 16:41:18.586 INFO [scheduling-1] - [TransId:80554a0b-01ee-42ff-b043-b8d0e05bd0a1][Class:DeleteScheduler$$EnhancerBySpringCGLIB$$230ee039][businessmetric: service=workspacemanager;transactionId=80554a0b-01ee-42ff-b043-b8d0e05bd0a1;actionName=deLinkArticlesScheduled;]
2021-05-06T16:41:18.623Z,i-088fb79e20ea8169a,180-adl-workspacemanager,2021-05-06 16:41:18.584 TRACE [scheduling-1] - binding parameter [1] as [VARCHAR] - [DELETED]
2021-05-06T16:41:18.623Z,i-088fb79e20ea8169a,180-adl-workspacemanager,2021-05-06 16:41:18.581 INFO [scheduling-1] - [TransId:80554a0b-01ee-42ff-b043-b8d0e05bd0a1][Class:DeleteScheduler$$EnhancerBySpringCGLIB$$230ee039][Method:deLinkArticlesScheduled] - Entering
2021-05-06T16:41:17.817Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:17.682 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 1.03ms
2021-05-06T16:41:17.817Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:17.010 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.15ms
2021-05-06T16:41:16.817Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:16.703 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.13ms
2021-05-06T16:41:07.816Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:07.671 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 1.06ms
2021-05-06T16:41:07.816Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:06.999 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.02ms
2021-05-06T16:41:06.816Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:41:06.693 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.00ms
2021-05-06T16:40:57.815Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:40:57.660 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.38.75) 0.99ms
2021-05-06T16:40:57.815Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:40:56.987 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.39.70) 1.03ms
2021-05-06T16:40:56.815Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[I 2021-05-06 16:40:56.682 JupyterHub log:181] 200 GET /hub/login (@::ffff:10.78.37.37) 1.01ms
2021-05-06T16:40:49.814Z,i-0a6884385fc6e34b2,180-adl-jupyterhub,[W 2021-05-06 16:40:49.409 JupyterHub log:181] 403 GET /hub/metrics (@10.78.39.194) 1.03ms

Is there anything in your PostgreSQL logs?

1 Like

Hi @manics,

Thanks for the reply.

I’ve checked serverless for the past 24 hours and I haven’t noticed any errors in the database logs that can correlate to the above mentioned error.

I’ve also checked CPU, Memory, FreeLocalStorage, ServerlessDatabaseCapacity, Deadlocks metrics for the past 3 days and I don’t see any issues with the metrics that I can correlate to the SSL connection error. Your serverless capacity remains intact at minimum capacity (2 ACUs) all the time.

Unfortunately this could be difficult to investigate as it uses an AWS specific service, it sounds like they run a proxy/load-balancer in front of PostgreSQL to manage connections. I did a bit of Googling and found a few mentions of connection drops with this AWS service but nothing definitive.

You could try enabling debug logging in JupyterHub to see if you get more information. Otherwise it might be worth testing with your own PostgreSQL server?

1 Like

I don’t know what could be the cause of the connection error, but even if it keeps happening, the consequences should be intermittent. The persistent bad state, especially indicated by this log line:

Can’t reconnect until invalid transaction is rolled back

indicates a bug in JupyterHub. We have a few db.rollback()s sprinkled around that are meant to get the db back by doing the rollback suggested in the error, but somehow that’s not happening. This probably means there are some places where a db error can occur that don’t have this catch & rollback logic, and/or this catch & rollback isn’t sufficient itself (for instance, I would expect the first request after getting into this bad state to trigger the rollback in the error here.

I have never managed to reliably reproduce this kind of error well enough to add a test case to verify that our rollback code is robust, which it clearly isn’t.

1 Like

@manics Thanks for the reply. Unfortunately our prod systems are build using aws aurora serverless service. so even if we test with own postgreSQL server, mapping it out with aws service might be not possible… did you had chance to reproduce this by any means ?

Hi @minrk

Thanks for the reply. Its not self healing at this point. we are doing 2 work arounds

  1. To restart the jupyterhub server, so connections are fresh.
  2. To login with dummy user by accessing the jupyterhub login url… It recovers the connections.

Is it possible to point the right location for adding the db.rollback(), so I can add this in my local setup and check?

Error Trigger’s…

Again thanks for the support.

I’m not sure how usual this is, probably not very given there hasn’t been a JupyterHub 1.5.x release yet, but I’ve proposed a backport of the related DB rollback fix to the 1.x branch in the hopes of a 1.5.1 release for this issue:

We’re still on 1.5.0 because we haven’t had the time or priority to take the plunge on the 3.0 upgrade (via z2jh), but we hit this pending rollback transaction error frequently in one of our deployments and have to manually restart the hub each time it happens. It’d be nice if the fix could be put out in a 1.5.1 patch release though I’m not familiar with the backport policies of the project so I’ll understand if it’s rejected.